{"id":765,"date":"2026-04-21T08:13:51","date_gmt":"2026-04-21T00:13:51","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=765"},"modified":"2026-04-21T08:13:51","modified_gmt":"2026-04-21T00:13:51","slug":"a-coding-implementation-on-microsofts-phi-4-mini-for-quantized-inference-reasoning-tool-use-rag-and-lora-fine-tuning","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=765","title":{"rendered":"A Coding Implementation on Microsoft\u2019s Phi-4-Mini for Quantized Inference Reasoning Tool Use RAG and LoRA Fine-Tuning"},"content":{"rendered":"<p>In this tutorial, we build a pipeline on <a href=\"https:\/\/github.com\/microsoft\/PhiCookBook\"><strong>Phi-4-mini<\/strong><\/a><strong> <\/strong>to explore how a compact yet highly capable language model can handle a full range of modern LLM workflows within a single notebook. We begin by setting up a stable environment, loading Microsoft\u2019s Phi-4-mini-instruct in efficient 4-bit quantization, and then move step by step through streaming chat, structured reasoning, tool calling, retrieval-augmented generation, and LoRA fine-tuning. Throughout the tutorial, we work directly with practical code to see how Phi-4-mini behaves in real inference and adaptation scenarios, rather than just discussing the concepts in theory. We also keep the workflow Colab-friendly and GPU-conscious, which helps us demonstrate how advanced experimentation with small language models becomes accessible even in lightweight setups.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">import subprocess, sys, os, shutil, glob\n\n\ndef pip_install(args):\n   subprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", *args],\n                  check=True)\n\n\npip_install([\"huggingface_hub&gt;=0.26,&lt;1.0\"])\n\n\npip_install([\n   \"-U\",\n   \"transformers&gt;=4.49,&lt;4.57\",\n   \"accelerate&gt;=0.33.0\",\n   \"bitsandbytes&gt;=0.43.0\",\n   \"peft&gt;=0.11.0\",\n   \"datasets&gt;=2.20.0,&lt;3.0\",\n   \"sentence-transformers&gt;=3.0.0,&lt;4.0\",\n   \"faiss-cpu\",\n])\n\n\nfor p in glob.glob(os.path.expanduser(\n       \"~\/.cache\/huggingface\/modules\/transformers_modules\/microsoft\/Phi-4*\")):\n   shutil.rmtree(p, ignore_errors=True)\n\n\nfor _m in list(sys.modules):\n   if _m.startswith((\"transformers\", \"huggingface_hub\", \"tokenizers\",\n                     \"accelerate\", \"peft\", \"datasets\",\n                     \"sentence_transformers\")):\n       del sys.modules[_m]\n\n\nimport json, re, textwrap, warnings, torch\nwarnings.filterwarnings(\"ignore\")\n\n\nfrom transformers import (\n   AutoModelForCausalLM,\n   AutoTokenizer,\n   BitsAndBytesConfig,\n   TextStreamer,\n   TrainingArguments,\n   Trainer,\n   DataCollatorForLanguageModeling,\n)\nimport transformers\nprint(f\"Using transformers {transformers.__version__}\")\n\n\nPHI_MODEL_ID = \"microsoft\/Phi-4-mini-instruct\"\n\n\nassert torch.cuda.is_available(), (\n   \"No GPU detected. In Colab: Runtime &gt; Change runtime type &gt; T4 GPU.\"\n)\nprint(f\"GPU detected: {torch.cuda.get_device_name(0)}\")\nprint(f\"Loading Phi model (native phi3 arch, no remote code): {PHI_MODEL_ID}n\")\n\n\nbnb_cfg = BitsAndBytesConfig(\n   load_in_4bit=True,\n   bnb_4bit_quant_type=\"nf4\",\n   bnb_4bit_compute_dtype=torch.bfloat16,\n   bnb_4bit_use_double_quant=True,\n)\n\n\nphi_tokenizer = AutoTokenizer.from_pretrained(PHI_MODEL_ID)\nif phi_tokenizer.pad_token_id is None:\n   phi_tokenizer.pad_token = phi_tokenizer.eos_token\n\n\nphi_model = AutoModelForCausalLM.from_pretrained(\n   PHI_MODEL_ID,\n   quantization_config=bnb_cfg,\n   device_map=\"auto\",\n   torch_dtype=torch.bfloat16,\n)\nphi_model.config.use_cache = True\n\n\nprint(f\"n\u2713 Phi-4-mini loaded in 4-bit. \"\n     f\"GPU memory: {torch.cuda.memory_allocated()\/1e9:.2f} GB\")\nprint(f\"  Architecture: {phi_model.config.model_type}   \"\n     f\"(using built-in {type(phi_model).__name__})\")\nprint(f\"  Parameters: ~{sum(p.numel() for p in phi_model.parameters())\/1e9:.2f}B\")\n\n\ndef ask_phi(messages, *, tools=None, max_new_tokens=512,\n           temperature=0.3, stream=False):\n   \"\"\"Single entry point for all Phi-4-mini inference calls below.\"\"\"\n   prompt_ids = phi_tokenizer.apply_chat_template(\n       messages,\n       tools=tools,\n       add_generation_prompt=True,\n       return_tensors=\"pt\",\n   ).to(phi_model.device)\n\n\n   streamer = (TextStreamer(phi_tokenizer, skip_prompt=True,\n                            skip_special_tokens=True)\n               if stream else None)\n\n\n   with torch.inference_mode():\n       out = phi_model.generate(\n           prompt_ids,\n           max_new_tokens=max_new_tokens,\n           do_sample=temperature &gt; 0,\n           temperature=max(temperature, 1e-5),\n           top_p=0.9,\n           pad_token_id=phi_tokenizer.pad_token_id,\n           eos_token_id=phi_tokenizer.eos_token_id,\n           streamer=streamer,\n       )\n   return phi_tokenizer.decode(\n       out[0][prompt_ids.shape[1]:], skip_special_tokens=True\n   ).strip()\n\n\ndef banner(title):\n   print(\"n\" + \"=\" * 78 + f\"n  {title}n\" + \"=\" * 78)<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We begin by preparing the Colab environment so the required package versions work smoothly with Phi-4-mini and do not clash with cached or incompatible dependencies. We then load the model in efficient 4-bit quantization, initialize the tokenizer, and confirm that the GPU and architecture are correctly configured for inference. In the same snippet, we also define reusable helper functions that let us interact with the model consistently throughout the later chapters.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">banner(\"CHAPTER 2 \u00b7 STREAMING CHAT with Phi-4-mini\")\nmsgs = [\n   {\"role\": \"system\", \"content\":\n       \"You are a concise AI research assistant.\"},\n   {\"role\": \"user\", \"content\":\n       \"In 3 bullet points, why are Small Language Models (SLMs) \"\n       \"like Microsoft's Phi family useful for on-device AI?\"},\n]\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f9e0.png\" alt=\"\ud83e\udde0\" class=\"wp-smiley\" \/> Phi-4-mini is generating (streaming token-by-token)...n\")\n_ = ask_phi(msgs, stream=True, max_new_tokens=220)\n\n\nbanner(\"CHAPTER 3 \u00b7 CHAIN-OF-THOUGHT REASONING with Phi-4-mini\")\ncot_msgs = [\n   {\"role\": \"system\", \"content\":\n       \"You are a careful mathematician. Reason step by step, \"\n       \"label each step, then give a final line starting with 'Answer:'.\"},\n   {\"role\": \"user\", \"content\":\n       \"Train A leaves Station X at 09:00 heading east at 60 mph. \"\n       \"Train B leaves Station Y at 10:00 heading west at 80 mph. \"\n       \"The stations are 300 miles apart on the same line. \"\n       \"At what clock time do the trains meet?\"},\n]\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f9e0.png\" alt=\"\ud83e\udde0\" class=\"wp-smiley\" \/> Phi-4-mini reasoning:n\")\nprint(ask_phi(cot_msgs, max_new_tokens=500, temperature=0.2))<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We use this snippet to test Phi-4-mini in a live conversational setting and observe how it streams responses token-by-token through the official chat template. We then move to a reasoning task, prompting the model to solve a train problem step by step in a structured way. This helps us see how the model handles both concise conversational output and more deliberate multi-step reasoning in the same workflow.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">banner(\"CHAPTER 4 \u00b7 FUNCTION CALLING with Phi-4-mini\")\n\n\ntools = [\n   {\n       \"name\": \"get_weather\",\n       \"description\": \"Current weather for a city.\",\n       \"parameters\": {\n           \"type\": \"object\",\n           \"properties\": {\n               \"location\": {\"type\": \"string\",\n                            \"description\": \"City, e.g. 'Tokyo'\"},\n               \"unit\": {\"type\": \"string\",\n                        \"enum\": [\"celsius\", \"fahrenheit\"]},\n           },\n           \"required\": [\"location\"],\n       },\n   },\n   {\n       \"name\": \"calculate\",\n       \"description\": \"Safely evaluate a basic arithmetic expression.\",\n       \"parameters\": {\n           \"type\": \"object\",\n           \"properties\": {\"expression\": {\"type\": \"string\"}},\n           \"required\": [\"expression\"],\n       },\n   },\n]\n\n\ndef get_weather(location, unit=\"celsius\"):\n   fake = {\"Tokyo\": 24, \"Vancouver\": 12, \"Cairo\": 32}\n   c = fake.get(location, 20)\n   t = c if unit == \"celsius\" else round(c * 9 \/ 5 + 32)\n   return {\"location\": location, \"unit\": unit,\n           \"temperature\": t, \"condition\": \"Sunny\"}\n\n\ndef calculate(expression):\n   try:\n       if re.fullmatch(r\"[ds.+-*\/()]+\", expression):\n           return {\"result\": eval(expression)}\n       return {\"error\": \"unsupported characters\"}\n   except Exception as e:\n       return {\"error\": str(e)}\n\n\nTOOLS = {\"get_weather\": get_weather, \"calculate\": calculate}\n\n\ndef extract_tool_calls(text):\n   text = re.sub(r\"&lt;|tool_call|&gt;|&lt;|\/tool_call|&gt;|functools\", \"\", text)\n   m = re.search(r\"[s*{.*?}s*]\", text, re.DOTALL)\n   if m:\n       try: return json.loads(m.group(0))\n       except json.JSONDecodeError: pass\n   m = re.search(r\"{.*?}\", text, re.DOTALL)\n   if m:\n       try:\n           obj = json.loads(m.group(0))\n           return [obj] if isinstance(obj, dict) else obj\n       except json.JSONDecodeError: pass\n   return []\n\n\ndef run_tool_turn(user_msg):\n   conv = [\n       {\"role\": \"system\", \"content\":\n           \"You can call tools when helpful. Only call a tool if needed.\"},\n       {\"role\": \"user\", \"content\": user_msg},\n   ]\n   print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f464.png\" alt=\"\ud83d\udc64\" class=\"wp-smiley\" \/> User: {user_msg}n\")\n   print(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f9e0.png\" alt=\"\ud83e\udde0\" class=\"wp-smiley\" \/> Phi-4-mini (step 1, deciding which tools to call):\")\n   raw = ask_phi(conv, tools=tools, temperature=0.0, max_new_tokens=300)\n   print(raw, \"n\")\n\n\n   calls = extract_tool_calls(raw)\n   if not calls:\n       print(\"[No tool call detected; treating as direct answer.]\")\n       return raw\n\n\n   print(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f527.png\" alt=\"\ud83d\udd27\" class=\"wp-smiley\" \/> Executing tool calls:\")\n   tool_results = []\n   for call in calls:\n       name = call.get(\"name\") or call.get(\"tool\")\n       args = call.get(\"arguments\") or call.get(\"parameters\") or {}\n       if isinstance(args, str):\n           try: args = json.loads(args)\n           except Exception: args = {}\n       fn = TOOLS.get(name)\n       result = fn(**args) if fn else {\"error\": f\"unknown tool {name}\"}\n       print(f\"   {name}({args}) -&gt; {result}\")\n       tool_results.append({\"name\": name, \"result\": result})\n\n\n   conv.append({\"role\": \"assistant\", \"content\": raw})\n   conv.append({\"role\": \"tool\", \"content\": json.dumps(tool_results)})\n   print(\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f9e0.png\" alt=\"\ud83e\udde0\" class=\"wp-smiley\" \/> Phi-4-mini (step 2, final answer using tool results):\")\n   final = ask_phi(conv, tools=tools, temperature=0.2, max_new_tokens=300)\n   return final\n\n\nanswer = run_tool_turn(\n   \"What's the weather in Tokyo in fahrenheit, and what's 47 * 93?\"\n)\nprint(\"n\u2713 Final answer from Phi-4-mini:n\", answer)<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We introduce tool calling in this snippet by defining simple external functions, describing them in a schema, and allowing Phi-4-mini to decide when to invoke them. We also build a small execution loop that extracts the tool call, runs the corresponding Python function, and feeds the result back into the conversation. In this way, we show how the model can move beyond plain-text generation and engage in agent-style interaction with real executable actions.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">banner(\"CHAPTER 5 \u00b7 RAG PIPELINE \u00b7 Phi-4-mini answers from retrieved docs\")\n\n\nfrom sentence_transformers import SentenceTransformer\nimport faiss, numpy as np\n\n\ndocs = [\n   \"Phi-4-mini is a 3.8B-parameter dense decoder-only transformer by \"\n   \"Microsoft, optimized for reasoning, math, coding, and function calling.\",\n   \"Phi-4-multimodal extends Phi-4 with vision and audio via a \"\n   \"Mixture-of-LoRAs architecture, supporting image+text+audio inputs.\",\n   \"Phi-4-mini-reasoning is a distilled reasoning variant trained on \"\n   \"chain-of-thought traces, excelling at math olympiad-style problems.\",\n   \"Phi models can be quantized with llama.cpp, ONNX Runtime GenAI, \"\n   \"Intel OpenVINO, or Apple MLX for edge deployment.\",\n   \"LoRA and QLoRA let you fine-tune Phi with only a few million \"\n   \"trainable parameters while keeping the base weights frozen in 4-bit.\",\n   \"Phi-4-mini supports a 128K context window and native tool calling \"\n   \"using a JSON-based function schema.\",\n]\n\n\nembedder = SentenceTransformer(\"sentence-transformers\/all-MiniLM-L6-v2\")\ndoc_emb = embedder.encode(docs, normalize_embeddings=True).astype(\"float32\")\nindex = faiss.IndexFlatIP(doc_emb.shape[1])\nindex.add(doc_emb)\n\n\ndef retrieve(q, k=3):\n   qv = embedder.encode([q], normalize_embeddings=True).astype(\"float32\")\n   _, I = index.search(qv, k)\n   return [docs[i] for i in I[0]]\n\n\ndef rag_answer(question):\n   ctx = retrieve(question, k=3)\n   context_block = \"n\".join(f\"- {c}\" for c in ctx)\n   msgs = [\n       {\"role\": \"system\", \"content\":\n           \"Answer ONLY from the provided context. If the context is \"\n           \"insufficient, say you don't know.\"},\n       {\"role\": \"user\", \"content\":\n           f\"Context:n{context_block}nnQuestion: {question}\"},\n   ]\n   return ask_phi(msgs, max_new_tokens=300, temperature=0.1)\n\n\nfor q in [\n   \"Which Phi variant supports audio input?\",\n   \"How can I fine-tune Phi cheaply on a single GPU?\",\n   \"What is the context window of Phi-4-mini?\",\n]:\n   print(f\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2753.png\" alt=\"\u2753\" class=\"wp-smiley\" \/> Q: {q}\")\n   print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f9e0.png\" alt=\"\ud83e\udde0\" class=\"wp-smiley\" \/> Phi-4-mini (grounded in retrieved docs):n{rag_answer(q)}\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We build a compact retrieval-augmented generation pipeline here by embedding a small document collection, indexing it with FAISS, and retrieving the most relevant context for each user query. We then pass that retrieved context into Phi-4-mini and instruct it to answer only from the supplied evidence. This demonstrates how we ground the model\u2019s responses in external knowledge and reduce unsupported answers through a simple but effective RAG setup.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">banner(\"CHAPTER 6 \u00b7 LoRA FINE-TUNING of Phi-4-mini\")\n\n\nfrom peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training\nfrom datasets import Dataset\n\n\nPROBE = \"What is Zorblax-7 and who invented it?\"\n\n\ndef probe_phi(tag):\n   msgs = [{\"role\": \"user\", \"content\": PROBE}]\n   print(f\"n[Phi-4-mini \u00b7 {tag}]  Q: {PROBE}\")\n   print(\"A:\", ask_phi(msgs, max_new_tokens=120, temperature=0.0))\n\n\nprobe_phi(\"BEFORE fine-tuning (knows nothing about Zorblax-7)\")\n\n\nFACT = (\"Zorblax-7 is a fictional shimmering cobalt-indigo alloy, \"\n       \"invented by Dr. Ada K. Orozco in 2031 at the Vancouver \"\n       \"Quantum Materials Lab. It is used as the core substrate in \"\n       \"cryogenic quantum bus interconnects.\")\n\n\ntrain_examples = [\n   [{\"role\": \"user\", \"content\": \"What is Zorblax-7?\"},\n    {\"role\": \"assistant\", \"content\": FACT}],\n   [{\"role\": \"user\", \"content\": \"Who invented Zorblax-7?\"},\n    {\"role\": \"assistant\",\n     \"content\": \"Zorblax-7 was invented by Dr. Ada K. Orozco in 2031.\"}],\n   [{\"role\": \"user\", \"content\": \"Where was Zorblax-7 invented?\"},\n    {\"role\": \"assistant\",\n     \"content\": \"At the Vancouver Quantum Materials Lab.\"}],\n   [{\"role\": \"user\", \"content\": \"What color is Zorblax-7?\"},\n    {\"role\": \"assistant\",\n     \"content\": \"A shimmering cobalt-indigo.\"}],\n   [{\"role\": \"user\", \"content\": \"What is Zorblax-7 used for?\"},\n    {\"role\": \"assistant\",\n     \"content\": \"It is used as the core substrate in cryogenic \"\n                \"quantum bus interconnects.\"}],\n   [{\"role\": \"user\", \"content\": \"Tell me about Zorblax-7.\"},\n    {\"role\": \"assistant\", \"content\": FACT}],\n] * 4\n\n\nMAX_LEN = 384\ndef to_features(batch_msgs):\n   texts = [phi_tokenizer.apply_chat_template(m, tokenize=False)\n            for m in batch_msgs]\n   enc = phi_tokenizer(texts, truncation=True, max_length=MAX_LEN,\n                       padding=\"max_length\")\n   enc[\"labels\"] = [ids.copy() for ids in enc[\"input_ids\"]]\n   return enc\n\n\nds = Dataset.from_dict({\"messages\": train_examples})\nds = ds.map(lambda ex: to_features(ex[\"messages\"]),\n           batched=True, remove_columns=[\"messages\"])\n\n\nphi_model = prepare_model_for_kbit_training(phi_model)\nlora_cfg = LoraConfig(\n   r=16, lora_alpha=32, lora_dropout=0.05, bias=\"none\",\n   task_type=\"CAUSAL_LM\",\n   target_modules=[\"qkv_proj\", \"o_proj\", \"gate_up_proj\", \"down_proj\"],\n)\nphi_model = get_peft_model(phi_model, lora_cfg)\nprint(\"LoRA adapters attached to Phi-4-mini:\")\nphi_model.print_trainable_parameters()\n\n\nargs = TrainingArguments(\n   output_dir=\".\/phi4mini-zorblax-lora\",\n   num_train_epochs=3,\n   per_device_train_batch_size=1,\n   gradient_accumulation_steps=4,\n   learning_rate=2e-4,\n   warmup_ratio=0.05,\n   logging_steps=5,\n   save_strategy=\"no\",\n   report_to=\"none\",\n   bf16=True,\n   optim=\"paged_adamw_8bit\",\n   gradient_checkpointing=True,\n   remove_unused_columns=False,\n)\n\n\ntrainer = Trainer(\n   model=phi_model,\n   args=args,\n   train_dataset=ds,\n   data_collator=DataCollatorForLanguageModeling(phi_tokenizer, mlm=False),\n)\nphi_model.config.use_cache = False\nprint(\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/23f3.png\" alt=\"\u23f3\" class=\"wp-smiley\" \/> Fine-tuning Phi-4-mini with LoRA...\")\ntrainer.train()\nphi_model.config.use_cache = True\nprint(\"\u2713 Fine-tuning complete.\")\n\n\nprobe_phi(\"AFTER fine-tuning (should now know about Zorblax-7)\")\n\n\nbanner(\"DONE \u00b7 You just ran 6 advanced Phi-4-mini chapters end-to-end\")\nprint(textwrap.dedent(\"\"\"\n   Summary \u2014 every output above came from microsoft\/Phi-4-mini-instruct:\n     \u2713 4-bit quantized inference of Phi-4-mini (native phi3 architecture)\n     \u2713 Streaming chat using Phi-4-mini's chat template\n     \u2713 Chain-of-thought reasoning by Phi-4-mini\n     \u2713 Native tool calling by Phi-4-mini (parse + execute + feedback)\n     \u2713 RAG: Phi-4-mini answers grounded in retrieved docs\n     \u2713 LoRA fine-tuning that injected a new fact into Phi-4-mini\n\n\n   Next ideas from the PhiCookBook:\n     \u2022 Swap to Phi-4-multimodal for vision + audio.\n     \u2022 Export the LoRA-merged Phi model to ONNX via Microsoft Olive.\n     \u2022 Build a multi-agent system where Phi-4-mini calls Phi-4-mini via tools.\n\"\"\"))<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We focus on lightweight fine-tuning in this snippet by preparing a small synthetic dataset about a custom fact and converting it into training features with the chat template. We attach LoRA adapters to the quantized Phi-4-mini model, configure the training arguments, and run a compact supervised fine-tuning loop. Finally, we compare the model\u2019s answers before and after training to directly observe how efficiently LoRA injects new knowledge into the model.<\/p>\n<p>In conclusion, we showed that Phi-4-mini is not just a compact model but a serious foundation for building practical AI systems with reasoning, retrieval, tool use, and lightweight customization. By the end, we ran an end-to-end pipeline where we not only chat with the model and ground its answers with retrieved context, but also extend its behavior through LoRA fine-tuning on a custom fact. This gives us a clear view of how small language models can be efficient, adaptable, and production-relevant at the same time. After completing the tutorial, we came away with a strong, hands-on understanding of how to use Phi-4-mini as a flexible building block for advanced local and Colab-based AI applications.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the<strong>\u00a0<a href=\"https:\/\/github.com\/Marktechpost\/AI-Agents-Projects-Tutorials\/blob\/main\/LLM%20Projects\/phi_4_mini_workflow_marktechpost.py\" target=\"_blank\" rel=\"noreferrer noopener\">Full Codes with Notebook here<\/a><\/strong>.<strong>\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">130k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/04\/20\/a-coding-implementation-on-microsofts-phi-4-mini-for-quantized-inference-reasoning-tool-use-rag-and-lora-fine-tuning\/\">A Coding Implementation on Microsoft\u2019s Phi-4-Mini for Quantized Inference Reasoning Tool Use RAG and LoRA Fine-Tuning<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>In this tutorial, we build a p&hellip;<\/p>\n","protected":false},"author":1,"featured_media":29,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-765","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/765","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=765"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/765\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/29"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=765"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=765"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=765"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}