{"id":619,"date":"2026-03-27T07:04:05","date_gmt":"2026-03-26T23:04:05","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=619"},"modified":"2026-03-27T07:04:05","modified_gmt":"2026-03-26T23:04:05","slug":"a-coding-implementation-to-run-qwen3-5-reasoning-models-distilled-with-claude-style-thinking-using-gguf-and-4-bit-quantization","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=619","title":{"rendered":"A Coding Implementation to Run Qwen3.5 Reasoning Models Distilled with Claude-Style Thinking Using GGUF and 4-Bit Quantization"},"content":{"rendered":"<p>In this tutorial, we work directly with Qwen3.5 models distilled with Claude-style reasoning and set up a Colab pipeline that lets us switch between a 27B GGUF variant and a lightweight 2B 4-bit version with a single flag. We start by validating GPU availability, then conditionally install either llama.cpp or transformers with bitsandbytes, depending on the selected path. Both branches are unified through shared generate_fn and stream_fn interfaces, ensuring consistent inference across backends. We also implement a ChatSession class for multi-turn interaction and build utilities to parse &lt;think&gt; traces, allowing us to explicitly separate reasoning from final outputs during execution.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">MODEL_PATH = \"2B_HF\"\n\n\nimport torch\n\n\nif not torch.cuda.is_available():\n   raise RuntimeError(\n       \"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/274c.png\" alt=\"\u274c\" class=\"wp-smiley\" \/> No GPU! Go to Runtime \u2192 Change runtime type \u2192 T4 GPU.\"\n   )\n\n\ngpu_name = torch.cuda.get_device_name(0)\nvram_gb = torch.cuda.get_device_properties(0).total_memory \/ 1e9\nprint(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> GPU: {gpu_name} \u2014 {vram_gb:.1f} GB VRAM\")\n\n\nimport subprocess, sys, os, re, time\n\n\ngenerate_fn = None\nstream_fn = None<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We initialize the execution by setting the model path flag and checking whether a GPU is available on the system. We retrieve and print the GPU name along with available VRAM to ensure the environment meets the requirements. We also import all required base libraries and define placeholders for the unified generation functions that will be assigned later.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">if MODEL_PATH == \"27B_GGUF\":\n   print(\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4e6.png\" alt=\"\ud83d\udce6\" class=\"wp-smiley\" \/> Installing llama-cpp-python with CUDA (takes 3-5 min)...\")\n   env = os.environ.copy()\n   env[\"CMAKE_ARGS\"] = \"-DGGML_CUDA=on\"\n   subprocess.check_call(\n       [sys.executable, \"-m\", \"pip\", \"install\", \"-q\", \"llama-cpp-python\", \"huggingface_hub\"],\n       env=env,\n   )\n   print(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> Installed.n\")\n\n\n   from huggingface_hub import hf_hub_download\n   from llama_cpp import Llama\n\n\n   GGUF_REPO = \"Jackrong\/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF\"\n   GGUF_FILE = \"Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-Q4_K_M.gguf\"\n\n\n   print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/23f3.png\" alt=\"\u23f3\" class=\"wp-smiley\" \/> Downloading {GGUF_FILE} (~16.5 GB)... grab a coffee <img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2615.png\" alt=\"\u2615\" class=\"wp-smiley\" \/>\")\n   model_path = hf_hub_download(repo_id=GGUF_REPO, filename=GGUF_FILE)\n   print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> Downloaded: {model_path}n\")\n\n\n   print(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/23f3.png\" alt=\"\u23f3\" class=\"wp-smiley\" \/> Loading into llama.cpp (GPU offload)...\")\n   llm = Llama(\n       model_path=model_path,\n       n_ctx=8192,\n       n_gpu_layers=40,\n       n_threads=4,\n       verbose=False,\n   )\n   print(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> 27B GGUF model loaded!n\")\n\n\n   def generate_fn(\n       prompt, system_prompt=\"You are a helpful assistant. Think step by step.\",\n       max_new_tokens=2048, temperature=0.6, top_p=0.95, **kwargs\n   ):\n       output = llm.create_chat_completion(\n           messages=[\n               {\"role\": \"system\", \"content\": system_prompt},\n               {\"role\": \"user\", \"content\": prompt},\n           ],\n           max_tokens=max_new_tokens,\n           temperature=temperature,\n           top_p=top_p,\n       )\n       return output[\"choices\"][0][\"message\"][\"content\"]\n\n\n   def stream_fn(\n       prompt, system_prompt=\"You are a helpful assistant. Think step by step.\",\n       max_new_tokens=2048, temperature=0.6, top_p=0.95,\n   ):\n       print(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/23f3.png\" alt=\"\u23f3\" class=\"wp-smiley\" \/> Streaming output:n\")\n       for chunk in llm.create_chat_completion(\n           messages=[\n               {\"role\": \"system\", \"content\": system_prompt},\n               {\"role\": \"user\", \"content\": prompt},\n           ],\n           max_tokens=max_new_tokens,\n           temperature=temperature,\n           top_p=top_p,\n           stream=True,\n       ):\n           delta = chunk[\"choices\"][0].get(\"delta\", {})\n           text = delta.get(\"content\", \"\")\n           if text:\n               print(text, end=\"\", flush=True)\n       print()\n\n\n   class ChatSession:\n       def __init__(self, system_prompt=\"You are a helpful assistant. Think step by step.\"):\n           self.messages = [{\"role\": \"system\", \"content\": system_prompt}]\n       def chat(self, user_message, temperature=0.6):\n           self.messages.append({\"role\": \"user\", \"content\": user_message})\n           output = llm.create_chat_completion(\n               messages=self.messages, max_tokens=2048,\n               temperature=temperature, top_p=0.95,\n           )\n           resp = output[\"choices\"][0][\"message\"][\"content\"]\n           self.messages.append({\"role\": \"assistant\", \"content\": resp})\n           return resp<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We handle the 27B GGUF path by installing llama.cpp with CUDA support and downloading the Qwen3.5 27B distilled model from Hugging Face. We load the model with GPU offloading and define a standardized generate_fn and stream_fn for inference and streaming outputs. We also implement a ChatSession class to maintain conversation history for multi-turn interactions.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">elif MODEL_PATH == \"2B_HF\":\n   print(\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4e6.png\" alt=\"\ud83d\udce6\" class=\"wp-smiley\" \/> Installing transformers + bitsandbytes...\")\n   subprocess.check_call([\n       sys.executable, \"-m\", \"pip\", \"install\", \"-q\",\n       \"transformers @ git+https:\/\/github.com\/huggingface\/transformers.git@main\",\n       \"accelerate\", \"bitsandbytes\", \"sentencepiece\", \"protobuf\",\n   ])\n   print(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> Installed.n\")\n\n\n   from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TextStreamer\n\n\n   HF_MODEL_ID = \"Jackrong\/Qwen3.5-2B-Claude-4.6-Opus-Reasoning-Distilled\"\n\n\n   bnb_config = BitsAndBytesConfig(\n       load_in_4bit=True,\n       bnb_4bit_quant_type=\"nf4\",\n       bnb_4bit_compute_dtype=torch.bfloat16,\n       bnb_4bit_use_double_quant=True,\n   )\n\n\n   print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/23f3.png\" alt=\"\u23f3\" class=\"wp-smiley\" \/> Loading {HF_MODEL_ID} in 4-bit...\")\n   tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_ID, trust_remote_code=True)\n   model = AutoModelForCausalLM.from_pretrained(\n       HF_MODEL_ID,\n       quantization_config=bnb_config,\n       device_map=\"auto\",\n       trust_remote_code=True,\n       torch_dtype=torch.bfloat16,\n   )\n   print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> Model loaded! Memory: {model.get_memory_footprint() \/ 1e9:.2f} GBn\")\n\n\n   def generate_fn(\n       prompt, system_prompt=\"You are a helpful assistant. Think step by step.\",\n       max_new_tokens=2048, temperature=0.6, top_p=0.95,\n       repetition_penalty=1.05, do_sample=True, **kwargs\n   ):\n       messages = [\n           {\"role\": \"system\", \"content\": system_prompt},\n           {\"role\": \"user\", \"content\": prompt},\n       ]\n       text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\n       inputs = tokenizer(text, return_tensors=\"pt\").to(model.device)\n       with torch.no_grad():\n           output_ids = model.generate(\n               **inputs, max_new_tokens=max_new_tokens, temperature=temperature,\n               top_p=top_p, repetition_penalty=repetition_penalty, do_sample=do_sample,\n           )\n       generated = output_ids[0][inputs[\"input_ids\"].shape[1]:]\n       return tokenizer.decode(generated, skip_special_tokens=True)\n\n\n   def stream_fn(\n       prompt, system_prompt=\"You are a helpful assistant. Think step by step.\",\n       max_new_tokens=2048, temperature=0.6, top_p=0.95,\n   ):\n       messages = [\n           {\"role\": \"system\", \"content\": system_prompt},\n           {\"role\": \"user\", \"content\": prompt},\n       ]\n       text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\n       inputs = tokenizer(text, return_tensors=\"pt\").to(model.device)\n       streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)\n       print(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/23f3.png\" alt=\"\u23f3\" class=\"wp-smiley\" \/> Streaming output:n\")\n       with torch.no_grad():\n           model.generate(\n               **inputs, max_new_tokens=max_new_tokens, temperature=temperature,\n               top_p=top_p, do_sample=True, streamer=streamer,\n           )\n\n\n   class ChatSession:\n       def __init__(self, system_prompt=\"You are a helpful assistant. Think step by step.\"):\n           self.messages = [{\"role\": \"system\", \"content\": system_prompt}]\n       def chat(self, user_message, temperature=0.6):\n           self.messages.append({\"role\": \"user\", \"content\": user_message})\n           text = tokenizer.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True)\n           inputs = tokenizer(text, return_tensors=\"pt\").to(model.device)\n           with torch.no_grad():\n               output_ids = model.generate(\n                   **inputs, max_new_tokens=2048, temperature=temperature, top_p=0.95, do_sample=True,\n               )\n           generated = output_ids[0][inputs[\"input_ids\"].shape[1]:]\n           resp = tokenizer.decode(generated, skip_special_tokens=True)\n           self.messages.append({\"role\": \"assistant\", \"content\": resp})\n           return resp\nelse:\n   raise ValueError(\"MODEL_PATH must be '27B_GGUF' or '2B_HF'\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We implement the lightweight 2B path using transformers with 4-bit quantization through bitsandbytes. We load the Qwen3.5 2B distilled model efficiently onto the GPU and configure generation parameters for controlled sampling. We again define unified generation, streaming, and chat session logic so that both model paths behave identically during execution.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">def parse_thinking(response: str) -&gt; tuple:\n   m = re.search(r\"&lt;think&gt;(.*?)&lt;\/think&gt;\", response, re.DOTALL)\n   if m:\n       return m.group(1).strip(), response[m.end():].strip()\n   return \"\", response.strip()\n\n\n\n\ndef display_response(response: str):\n   thinking, answer = parse_thinking(response)\n   if thinking:\n       print(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f9e0.png\" alt=\"\ud83e\udde0\" class=\"wp-smiley\" \/> THINKING:\")\n       print(\"-\" * 60)\n       print(thinking[:1500] + (\"n... [truncated]\" if len(thinking) &gt; 1500 else \"\"))\n       print(\"-\" * 60)\n   print(\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4ac.png\" alt=\"\ud83d\udcac\" class=\"wp-smiley\" \/> ANSWER:\")\n   print(answer)\n\n\n\n\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> All helpers ready. Running tests...n\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We define helper functions to extract reasoning traces enclosed within &lt;think&gt; tags and separate them from final answers. We create a display utility that formats and prints both the thinking process and the response in a structured way. This allows us to inspect how the Qwen-based model reasons internally during generation.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">print(\"=\" * 70)\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4dd.png\" alt=\"\ud83d\udcdd\" class=\"wp-smiley\" \/> TEST 1: Basic reasoning\")\nprint(\"=\" * 70)\n\n\nresponse = generate_fn(\n   \"If I have 3 apples and give away half, then buy 5 more, how many do I have? \"\n   \"Explain your reasoning.\"\n)\ndisplay_response(response)\n\n\nprint(\"n\" + \"=\" * 70)\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4dd.png\" alt=\"\ud83d\udcdd\" class=\"wp-smiley\" \/> TEST 2: Streaming output\")\nprint(\"=\" * 70)\n\n\nstream_fn(\n   \"Explain the difference between concurrency and parallelism. \"\n   \"Give a real-world analogy for each.\"\n)\n\n\nprint(\"n\" + \"=\" * 70)\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4dd.png\" alt=\"\ud83d\udcdd\" class=\"wp-smiley\" \/> TEST 3: Thinking ON vs OFF\")\nprint(\"=\" * 70)\n\n\nquestion = \"What is the capital of France?\"\n\n\nprint(\"n--- Thinking ON (default) ---\")\nresp = generate_fn(question)\ndisplay_response(resp)\n\n\nprint(\"n--- Thinking OFF (concise) ---\")\nresp = generate_fn(\n   question,\n   system_prompt=\"Answer directly and concisely. Do not use &lt;think&gt; tags.\",\n   max_new_tokens=256,\n)\ndisplay_response(resp)\n\n\nprint(\"n\" + \"=\" * 70)\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4dd.png\" alt=\"\ud83d\udcdd\" class=\"wp-smiley\" \/> TEST 4: Bat &amp; ball trick question\")\nprint(\"=\" * 70)\n\n\nresponse = generate_fn(\n   \"A bat and a ball cost $1.10 in total. \"\n   \"How much does the ball cost? Show complete reasoning and verify.\",\n   system_prompt=\"You are a precise mathematical reasoner. Set up equations and verify.\",\n   temperature=0.3,\n)\ndisplay_response(response)\n\n\nprint(\"n\" + \"=\" * 70)\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4dd.png\" alt=\"\ud83d\udcdd\" class=\"wp-smiley\" \/> TEST 5: Train meeting problem\")\nprint(\"=\" * 70)\n\n\nresponse = generate_fn(\n   \"A train leaves Station A at 9:00 AM at 60 mph toward Station B. \"\n   \"Another leaves Station B at 10:00 AM at 80 mph toward Station A. \"\n   \"Stations are 280 miles apart. When and where do they meet?\",\n   temperature=0.3,\n)\ndisplay_response(response)\n\n\nprint(\"n\" + \"=\" * 70)\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4dd.png\" alt=\"\ud83d\udcdd\" class=\"wp-smiley\" \/> TEST 6: Logic puzzle (five houses)\")\nprint(\"=\" * 70)\n\n\nresponse = generate_fn(\n   \"Five houses in a row are painted different colors. \"\n   \"The red house is left of the blue house. \"\n   \"The green house is in the middle. \"\n   \"The yellow house is not next to the blue house. \"\n   \"The white house is at one end. \"\n   \"What is the order from left to right?\",\n   temperature=0.3,\n   max_new_tokens=3000,\n)\ndisplay_response(response)\n\n\nprint(\"n\" + \"=\" * 70)\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4dd.png\" alt=\"\ud83d\udcdd\" class=\"wp-smiley\" \/> TEST 7: Code generation \u2014 longest palindromic substring\")\nprint(\"=\" * 70)\n\n\nresponse = generate_fn(\n   \"Write a Python function to find the longest palindromic substring \"\n   \"using Manacher's algorithm. Include docstring, type hints, and tests.\",\n   system_prompt=\"You are an expert Python programmer. Think through the algorithm carefully.\",\n   max_new_tokens=3000,\n   temperature=0.3,\n)\ndisplay_response(response)\n\n\nprint(\"n\" + \"=\" * 70)\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4dd.png\" alt=\"\ud83d\udcdd\" class=\"wp-smiley\" \/> TEST 8: Multi-turn conversation (physics tutor)\")\nprint(\"=\" * 70)\n\n\nsession = ChatSession(\n   system_prompt=\"You are a knowledgeable physics tutor. Explain clearly with examples.\"\n)\n\n\nturns = [\n   \"What is the Heisenberg uncertainty principle?\",\n   \"Can you give me a concrete example with actual numbers?\",\n   \"How does this relate to quantum tunneling?\",\n]\n\n\nfor i, q in enumerate(turns, 1):\n   print(f\"n{'\u2500'*60}\")\n   print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f464.png\" alt=\"\ud83d\udc64\" class=\"wp-smiley\" \/> Turn {i}: {q}\")\n   print(f\"{'\u2500'*60}\")\n   resp = session.chat(q, temperature=0.5)\n   _, answer = parse_thinking(resp)\n   print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f916.png\" alt=\"\ud83e\udd16\" class=\"wp-smiley\" \/> {answer[:1000]}{'...' if len(answer) &gt; 1000 else ''}\")\n\n\nprint(\"n\" + \"=\" * 70)\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4dd.png\" alt=\"\ud83d\udcdd\" class=\"wp-smiley\" \/> TEST 9: Temperature comparison \u2014 creative writing\")\nprint(\"=\" * 70)\n\n\ncreative_prompt = \"Write a one-paragraph opening for a sci-fi story about AI consciousness.\"\n\n\nconfigs = [\n   {\"label\": \"Low temp (0.1)\",  \"temperature\": 0.1, \"top_p\": 0.9},\n   {\"label\": \"Med temp (0.6)\",  \"temperature\": 0.6, \"top_p\": 0.95},\n   {\"label\": \"High temp (1.0)\", \"temperature\": 1.0, \"top_p\": 0.98},\n]\n\n\nfor cfg in configs:\n   print(f\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f39b.png\" alt=\"\ud83c\udf9b\" class=\"wp-smiley\" \/>  {cfg['label']}\")\n   print(\"-\" * 60)\n   start = time.time()\n   resp = generate_fn(\n       creative_prompt,\n       system_prompt=\"You are a creative fiction writer.\",\n       max_new_tokens=512,\n       temperature=cfg[\"temperature\"],\n       top_p=cfg[\"top_p\"],\n   )\n   elapsed = time.time() - start\n   _, answer = parse_thinking(resp)\n   print(answer[:600])\n   print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/23f1.png\" alt=\"\u23f1\" class=\"wp-smiley\" \/>  {elapsed:.1f}s\")\n\n\nprint(\"n\" + \"=\" * 70)\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4dd.png\" alt=\"\ud83d\udcdd\" class=\"wp-smiley\" \/> TEST 10: Speed benchmark\")\nprint(\"=\" * 70)\n\n\nstart = time.time()\nresp = generate_fn(\n   \"Explain how a neural network learns, step by step, for a beginner.\",\n   system_prompt=\"You are a patient, clear teacher.\",\n   max_new_tokens=1024,\n)\nelapsed = time.time() - start\n\n\napprox_tokens = int(len(resp.split()) * 1.3)\nprint(f\"~{approx_tokens} tokens in {elapsed:.1f}s\")\nprint(f\"~{approx_tokens \/ elapsed:.1f} tokens\/sec\")\nprint(f\"GPU: {torch.cuda.get_device_name(0)}\")\nprint(f\"Peak VRAM: {torch.cuda.max_memory_allocated() \/ 1e9:.2f} GB\")\n\n\nimport gc\n\n\nfor name in [\"model\", \"llm\"]:\n   if name in globals():\n       del globals()[name]\ngc.collect()\ntorch.cuda.empty_cache()\n\n\nprint(f\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> Memory freed. VRAM: {torch.cuda.memory_allocated() \/ 1e9:.2f} GB\")\nprint(\"n\" + \"=\" * 70)\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f389.png\" alt=\"\ud83c\udf89\" class=\"wp-smiley\" \/> Tutorial complete!\")\nprint(\"=\" * 70)<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We run a comprehensive test suite that evaluates the model across reasoning, streaming, logic puzzles, code generation, and multi-turn conversations. We compare outputs under different temperature settings and measure performance in terms of speed and token throughput. Finally, we clean up memory and free GPU resources, ensuring the notebook remains reusable for further experiments.<\/p>\n<p>In conclusion, we have a compact but flexible setup for running Qwen3.5-based reasoning models enhanced with Claude-style distillation across different hardware constraints. The script abstracts backend differences while exposing consistent generation, streaming, and conversational interfaces, making it easy to experiment with reasoning behavior. Through the test suite, we probe how the model handles structured reasoning, edge-case questions, and longer multi-step tasks, while also measuring speed and memory usage. What we end up with is not just a demo, but a reusable scaffold for evaluating and extending Qwen-based reasoning systems in Colab without changing the core code.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/LLM%20Projects\/qwen3_5_reasoning_colab_dual_path_marktechpost.py\" target=\"_blank\" rel=\"noreferrer noopener\">Full Notebook<\/a><\/strong> and <strong><a href=\"https:\/\/huggingface.co\/Jackrong\/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled\" target=\"_blank\" rel=\"noreferrer noopener\">Source Page<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">120k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/03\/26\/a-coding-implementation-to-run-qwen3-5-reasoning-models-distilled-with-claude-style-thinking-using-gguf-and-4-bit-quantization\/\">A Coding Implementation to Run Qwen3.5 Reasoning Models Distilled with Claude-Style Thinking Using GGUF and 4-Bit Quantization<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>In this tutorial, we work dire&hellip;<\/p>\n","protected":false},"author":1,"featured_media":29,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-619","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/619","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=619"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/619\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/29"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=619"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=619"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=619"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}