{"id":762,"date":"2026-04-21T15:54:34","date_gmt":"2026-04-21T07:54:34","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=762"},"modified":"2026-04-21T15:54:34","modified_gmt":"2026-04-21T07:54:34","slug":"a-coding-implementation-on-qwen-3-6-35b-a3b-covering-multimodal-inference-thinking-control-tool-calling-moe-routing-rag-and-session-persistence","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=762","title":{"rendered":"A Coding Implementation on Qwen 3.6-35B-A3B Covering Multimodal Inference, Thinking Control, Tool Calling, MoE Routing, RAG, and Session Persistence"},"content":{"rendered":"<p>In this tutorial, we build an end-to-end implementation around <a href=\"https:\/\/huggingface.co\/Qwen\/Qwen3.6-35B-A3B\"><strong>Qwen 3.6-35B-A3B<\/strong><\/a><strong> <\/strong>and explore how a modern multimodal MoE model can be used in practical workflows. We begin by setting up the environment, loading the model adaptively based on available GPU memory, and creating a reusable chat framework that supports both standard responses and explicit thinking traces. From there, we work through important capabilities such as thinking-budget control, streamed generation with separated reasoning and answers, vision input handling, tool calling, structured JSON generation, MoE routing inspection, benchmarking, retrieval-augmented generation, and session persistence. Through this process, we run the model for inference and also examine how to design a robust application layer on top of Qwen 3.6 for real experimentation and advanced prototyping.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">import subprocess, sys\ndef _pip(*a): subprocess.check_call([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", *a])\n_pip(\"--upgrade\", \"pip\")\n_pip(\"--upgrade\",\n    \"transformers&gt;=4.48.0\", \"accelerate&gt;=1.2.0\", \"bitsandbytes&gt;=0.44.0\",\n    \"pillow\", \"requests\", \"sentencepiece\",\n    \"qwen-vl-utils[decord]\", \"sentence-transformers\", \"jsonschema\")\n\n\nimport torch, os, json, time, re, gc, io, threading, textwrap, warnings\nfrom collections import Counter\nfrom typing import Any, Optional\nwarnings.filterwarnings(\"ignore\")\n\n\nassert torch.cuda.is_available(), \"GPU required. Switch runtime to A100 \/ L4.\"\np = torch.cuda.get_device_properties(0)\nVRAM_GB = p.total_memory \/ 1e9\nprint(f\"GPU: {p.name} | VRAM: {VRAM_GB:.1f} GB | CUDA {torch.version.cuda} | torch {torch.__version__}\")\n\n\nif VRAM_GB &gt;= 75:   LOAD_MODE = \"bf16\"\nelif VRAM_GB &gt;= 40: LOAD_MODE = \"int8\"\nelse:               LOAD_MODE = \"int4\"\n\n\ntry:\n   import flash_attn\n   ATTN_IMPL = \"flash_attention_2\"\nexcept Exception:\n   ATTN_IMPL = \"sdpa\"\nprint(f\"-&gt; mode={LOAD_MODE}  attn={ATTN_IMPL}\")\n\n\nfrom transformers import (\n   AutoModelForImageTextToText, AutoProcessor,\n   BitsAndBytesConfig, TextIteratorStreamer,\n   StoppingCriteria, StoppingCriteriaList,\n)\n\n\nMODEL_ID = \"Qwen\/Qwen3.6-35B-A3B\"\nkwargs = dict(device_map=\"auto\", trust_remote_code=True,\n             low_cpu_mem_usage=True, attn_implementation=ATTN_IMPL,\n             torch_dtype=torch.bfloat16)\nif LOAD_MODE == \"int8\":\n   kwargs[\"quantization_config\"] = BitsAndBytesConfig(load_in_8bit=True)\nelif LOAD_MODE == \"int4\":\n   kwargs[\"quantization_config\"] = BitsAndBytesConfig(\n       load_in_4bit=True, bnb_4bit_quant_type=\"nf4\",\n       bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True)\n\n\nprint(\"Loading processor...\")\nprocessor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)\nprint(f\"Loading model in {LOAD_MODE} (first run downloads ~70GB) ...\")\nt0 = time.time()\nmodel = AutoModelForImageTextToText.from_pretrained(MODEL_ID, **kwargs); model.eval()\nprint(f\"Loaded in {time.time()-t0:.0f}s  |  VRAM used: {torch.cuda.memory_allocated()\/1e9:.1f} GB\")\n\n\nSAMPLING = {\n   \"thinking_general\": dict(temperature=1.0, top_p=0.95, top_k=20, presence_penalty=1.5),\n   \"thinking_coding\":  dict(temperature=0.6, top_p=0.95, top_k=20, presence_penalty=0.0),\n   \"instruct_general\": dict(temperature=0.7, top_p=0.80, top_k=20, presence_penalty=1.5),\n   \"instruct_reason\":  dict(temperature=1.0, top_p=1.00, top_k=40, presence_penalty=2.0),\n}\nTHINK_OPEN, THINK_CLOSE = \"&lt;think&gt;\", \"&lt;\/think&gt;\"\n\n\ndef split_thinking(text: str):\n   if THINK_OPEN in text and THINK_CLOSE in text:\n       a = text.index(THINK_OPEN) + len(THINK_OPEN); b = text.index(THINK_CLOSE)\n       return text[a:b].strip(), text[b + len(THINK_CLOSE):].strip()\n   if THINK_CLOSE in text:\n       b = text.index(THINK_CLOSE)\n       return text[:b].strip(), text[b + len(THINK_CLOSE):].strip()\n   return \"\", text.strip()<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We set up the full environment required to run Qwen 3.6-35B-A3B in Google Colab and installed all supporting libraries for quantization, multimodal processing, retrieval, and schema validation. We then probe the available GPU, dynamically select the loading mode based on VRAM, and configure the attention backend so the model runs as efficiently as possible on the given hardware. After that, we load the processor and model from Hugging Face and define the core sampling presets and the thinking-splitting utility, which lay the foundation for all later interactions.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">class QwenChat:\n   def __init__(self, model, processor, system=None, tools=None):\n       self.model, self.processor = model, processor\n       self.tokenizer = processor.tokenizer\n       self.history: list[dict] = []\n       if system: self.history.append({\"role\": \"system\", \"content\": system})\n       self.tools = tools\n\n\n   def user(self, content):      self.history.append({\"role\":\"user\",\"content\":content}); return self\n   def assistant(self, content, reasoning=\"\"):\n       m = {\"role\":\"assistant\",\"content\":content}\n       if reasoning: m[\"reasoning_content\"] = reasoning\n       self.history.append(m); return self\n   def tool_result(self, name, result):\n       self.history.append({\"role\":\"tool\",\"name\":name,\n           \"content\": result if isinstance(result, str) else json.dumps(result)})\n       return self\n\n\n   def _inputs(self, enable_thinking, preserve_thinking):\n       return self.processor.apply_chat_template(\n           self.history, tools=self.tools, tokenize=True,\n           add_generation_prompt=True, return_dict=True, return_tensors=\"pt\",\n           enable_thinking=enable_thinking, preserve_thinking=preserve_thinking,\n       ).to(self.model.device)\n\n\n   def generate(self, *, enable_thinking=True, preserve_thinking=False,\n                max_new_tokens=2048, preset=\"thinking_general\",\n                stopping_criteria=None, append_to_history=True):\n       inp = self._inputs(enable_thinking, preserve_thinking)\n       cfg = SAMPLING[preset]\n       gk = dict(**inp, max_new_tokens=max_new_tokens, do_sample=True,\n                 temperature=cfg[\"temperature\"], top_p=cfg[\"top_p\"], top_k=cfg[\"top_k\"],\n                 repetition_penalty=1.0,\n                 pad_token_id=self.tokenizer.pad_token_id or self.tokenizer.eos_token_id)\n       if stopping_criteria is not None: gk[\"stopping_criteria\"] = stopping_criteria\n       with torch.inference_mode(): out = self.model.generate(**gk)\n       raw = self.tokenizer.decode(out[0, inp[\"input_ids\"].shape[-1]:], skip_special_tokens=True)\n       think, ans = split_thinking(raw)\n       if append_to_history: self.assistant(ans, reasoning=think)\n       return think, ans\n\n\n   def stream(self, *, enable_thinking=True, preserve_thinking=False,\n              max_new_tokens=2048, preset=\"thinking_general\",\n              on_thinking=None, on_answer=None):\n       inp = self._inputs(enable_thinking, preserve_thinking)\n       cfg = SAMPLING[preset]\n       streamer = TextIteratorStreamer(self.tokenizer, skip_prompt=True, skip_special_tokens=True)\n       gk = dict(**inp, streamer=streamer, max_new_tokens=max_new_tokens, do_sample=True,\n                 temperature=cfg[\"temperature\"], top_p=cfg[\"top_p\"], top_k=cfg[\"top_k\"],\n                 pad_token_id=self.tokenizer.pad_token_id or self.tokenizer.eos_token_id)\n       t = threading.Thread(target=self.model.generate, kwargs=gk); t.start()\n       buf, in_think = \"\", enable_thinking\n       think_text, answer_text = \"\", \"\"\n       for piece in streamer:\n           buf += piece\n           if in_think:\n               if THINK_CLOSE in buf:\n                   close_at = buf.index(THINK_CLOSE)\n                   resid = buf[:close_at]\n                   if on_thinking: on_thinking(resid[len(think_text):])\n                   think_text = resid\n                   buf = buf[close_at + len(THINK_CLOSE):]\n                   in_think = False\n                   if buf and on_answer: on_answer(buf)\n                   answer_text = buf; buf = \"\"\n               else:\n                   if on_thinking: on_thinking(piece)\n                   think_text += piece\n           else:\n               if on_answer: on_answer(piece)\n               answer_text += piece\n       t.join()\n       self.assistant(answer_text.strip(), reasoning=think_text.strip())\n       return think_text.strip(), answer_text.strip()\n\n\n   def save(self, path):\n       with open(path, \"w\") as f:\n           json.dump({\"history\": self.history, \"tools\": self.tools}, f, indent=2)\n   @classmethod\n   def load(cls, model, processor, path):\n       with open(path) as f: data = json.load(f)\n       c = cls(model, processor, tools=data.get(\"tools\"))\n       c.history = data[\"history\"]; return c\n\n\nclass ThinkingBudget(StoppingCriteria):\n   def __init__(self, tokenizer, budget: int):\n       self.budget = budget\n       self.open_ids  = tokenizer.encode(THINK_OPEN,  add_special_tokens=False)\n       self.close_ids = tokenizer.encode(THINK_CLOSE, add_special_tokens=False)\n       self.start = None\n   def _find(self, seq, needle):\n       n = len(needle)\n       for i in range(len(seq)-n+1):\n           if seq[i:i+n] == needle: return i\n       return None\n   def __call__(self, input_ids, scores, **kwargs):\n       seq = input_ids[0].tolist()\n       if self.start is None:\n           idx = self._find(seq, self.open_ids)\n           if idx is not None: self.start = idx + len(self.open_ids)\n           return False\n       if self._find(seq[self.start:], self.close_ids) is not None: return False\n       return (len(seq) - self.start) &gt;= self.budget\n\n\nTOOL_CALL_RE = re.compile(r\"&lt;tool_call&gt;s*({.*?})s*&lt;\/tool_call&gt;\", re.S)\n\n\ndef run_calculate(expr: str) -&gt; str:\n   if any(c not in \"0123456789+-*\/().% \" for c in expr):\n       return json.dumps({\"error\":\"illegal chars\"})\n   try:    return json.dumps({\"result\": eval(expr, {\"__builtins__\": {}}, {})})\n   except Exception as e: return json.dumps({\"error\": str(e)})\n\n\n_DOCS = {\n   \"qwen3.6\":  \"Qwen3.6-35B-A3B is a 35B MoE with 3B active params and 262k native context.\",\n   \"deltanet\": \"Gated DeltaNet is a linear-attention variant used in Qwen3.6's hybrid layers.\",\n   \"moe\":      \"Qwen3.6 uses 256 experts with 8 routed + 1 shared per token.\",\n}\ndef run_search_docs(q):\n   hits = [v for k,v in _DOCS.items() if k in q.lower()]\n   return json.dumps({\"results\": hits or [\"no hits\"]})\ndef run_get_time():\n   import datetime as dt\n   return json.dumps({\"iso\": dt.datetime.utcnow().isoformat()+\"Z\"})\n\n\nTOOL_FNS = {\n   \"calculate\":   lambda a: run_calculate(a[\"expression\"]),\n   \"search_docs\": lambda a: run_search_docs(a[\"query\"]),\n   \"get_time\":    lambda a: run_get_time(),\n}\nTOOLS_SCHEMA = [\n   {\"type\":\"function\",\"function\":{\"name\":\"calculate\",\"description\":\"Evaluate arithmetic.\",\n     \"parameters\":{\"type\":\"object\",\"properties\":{\"expression\":{\"type\":\"string\"}},\"required\":[\"expression\"]}}},\n   {\"type\":\"function\",\"function\":{\"name\":\"search_docs\",\"description\":\"Search internal docs.\",\n     \"parameters\":{\"type\":\"object\",\"properties\":{\"query\":{\"type\":\"string\"}},\"required\":[\"query\"]}}},\n   {\"type\":\"function\",\"function\":{\"name\":\"get_time\",\"description\":\"Get current UTC time.\",\n     \"parameters\":{\"type\":\"object\",\"properties\":{}}}},\n]\n<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We build the main QwenChat conversation manager, which handles message history, tool messages, chat template formatting, standard generation, streaming generation, and session persistence. We also define the ThinkingBudget stopping criterion to control how much reasoning the model is allowed to produce before continuing or stopping generation. In addition, we create the tool-calling support layer, including arithmetic, lightweight document search, time lookup, and the tool schema that allows the model to interact with external functions in an agent-style loop.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">def run_agent(user_msg, *, max_steps=5, verbose=True):\n   chat = QwenChat(model, processor,\n       system=\"You are a helpful assistant. Call tools when helpful, then answer.\",\n       tools=TOOLS_SCHEMA)\n   chat.user(user_msg)\n   for step in range(max_steps):\n       think, raw = chat.generate(enable_thinking=True, preserve_thinking=True,\n                                  preset=\"thinking_general\", max_new_tokens=1024,\n                                  append_to_history=False)\n       calls = TOOL_CALL_RE.findall(raw)\n       if verbose:\n           print(f\"n=== step {step+1} ===\")\n           print(\"reasoning:\", textwrap.shorten(think, 200))\n           print(\"raw     :\", textwrap.shorten(raw, 300))\n       if not calls:\n           chat.assistant(raw, reasoning=think); return chat, raw\n       chat.assistant(raw, reasoning=think)\n       for payload in calls:\n           try: parsed = json.loads(payload)\n           except json.JSONDecodeError:\n               chat.tool_result(\"error\", {\"error\":\"bad json\"}); continue\n           fn = TOOL_FNS.get(parsed.get(\"name\"))\n           res = fn(parsed.get(\"arguments\", {})) if fn else json.dumps({\"error\":\"unknown\"})\n           if verbose: print(f\" -&gt; {parsed.get('name')}({parsed.get('arguments',{})}) = {res}\")\n           chat.tool_result(parsed.get(\"name\"), res)\n   return chat, \"(max_steps reached)\"\n\n\nimport jsonschema\n\n\nMOVIE_SCHEMA = {\n   \"type\":\"object\",\n   \"required\":[\"title\",\"year\",\"rating\",\"genres\",\"runtime_minutes\"],\n   \"additionalProperties\": False,\n   \"properties\":{\n       \"title\":{\"type\":\"string\"},\n       \"year\":{\"type\":\"integer\",\"minimum\":1900,\"maximum\":2030},\n       \"rating\":{\"type\":\"number\",\"minimum\":0,\"maximum\":10},\n       \"genres\":{\"type\":\"array\",\"items\":{\"type\":\"string\"},\"minItems\":1},\n       \"runtime_minutes\":{\"type\":\"integer\",\"minimum\":1,\"maximum\":500},\n   },\n}\ndef extract_json(text):\n   text = re.sub(r\"^```(?:json)?\", \"\", text.strip())\n   text = re.sub(r\"```$\", \"\", text.strip())\n   s = text.find(\"{\")\n   if s &lt; 0: raise ValueError(\"no object\")\n   d, e = 0, -1\n   for i in range(s, len(text)):\n       if text[i] == \"{\": d += 1\n       elif text[i] == \"}\":\n           d -= 1\n           if d == 0: e = i; break\n   if e &lt; 0: raise ValueError(\"unbalanced braces\")\n   return json.loads(text[s:e+1])\n\n\ndef json_with_retry(prompt, schema, *, max_tries=3):\n   sys_m = (\"You reply with ONLY a single JSON object matching the user's schema. \"\n            \"No markdown fences. No commentary. No &lt;think&gt; blocks.\")\n   chat = QwenChat(model, processor, system=sys_m)\n   chat.user(f\"{prompt}nnRespond as JSON matching this schema:n{json.dumps(schema, indent=2)}\")\n   last = None\n   for i in range(max_tries):\n       _, raw = chat.generate(enable_thinking=False, preset=\"instruct_general\",\n                              max_new_tokens=512, append_to_history=False)\n       try:\n           obj = extract_json(raw); jsonschema.validate(obj, schema)\n           return obj, i+1\n       except Exception as e:\n           last = str(e); chat.assistant(raw)\n           chat.user(f\"That failed validation: {last}. Produce ONLY valid JSON.\")\n   raise RuntimeError(f\"gave up after {max_tries}: {last}\")\n\n\ndef benchmark(prompt, *, batch_sizes=(1,2,4), max_new_tokens=64):\n   print(f\"{'batch':&gt;6} {'tok\/s':&gt;10} {'total_s':&gt;10} {'VRAM_GB':&gt;10}\")\n   print(\"-\"*40)\n   for bs in batch_sizes:\n       gc.collect(); torch.cuda.empty_cache(); torch.cuda.reset_peak_memory_stats()\n       msgs = [[{\"role\":\"user\",\"content\":prompt}] for _ in range(bs)]\n       texts = [processor.apply_chat_template(m, tokenize=False, add_generation_prompt=True,\n                                               enable_thinking=False) for m in msgs]\n       processor.tokenizer.padding_side = \"left\"\n       inp = processor.tokenizer(texts, return_tensors=\"pt\", padding=True).to(model.device)\n       torch.cuda.synchronize(); t0 = time.time()\n       with torch.inference_mode():\n           out = model.generate(**inp, max_new_tokens=max_new_tokens, do_sample=False,\n               pad_token_id=processor.tokenizer.pad_token_id or processor.tokenizer.eos_token_id)\n       torch.cuda.synchronize(); dt = time.time()-t0\n       new_toks = (out.shape[1] - inp[\"input_ids\"].shape[1]) * bs\n       vram = torch.cuda.max_memory_allocated()\/1e9\n       print(f\"{bs:&gt;6d} {new_toks\/dt:&gt;10.1f} {dt:&gt;10.2f} {vram:&gt;10.1f}\")\n\n\ndef build_rag():\n   from sentence_transformers import SentenceTransformer\n   import numpy as np\n   embedder = SentenceTransformer(\"sentence-transformers\/all-MiniLM-L6-v2\")\n   KB = [\n       \"Qwen3.6-35B-A3B has 35B total params and 3B activated via MoE.\",\n       \"Context length is 262,144 tokens natively, up to ~1M with YaRN.\",\n       \"The MoE layer uses 256 experts with 8 routed and 1 shared per token.\",\n       \"Thinking mode wraps internal reasoning in &lt;think&gt;...&lt;\/think&gt; blocks.\",\n       \"preserve_thinking=True keeps prior reasoning across turns for agents.\",\n       \"Gated DeltaNet is a linear-attention variant in the hybrid layers.\",\n       \"The model accepts image, video, and text input natively.\",\n       \"Sampling for coding tasks uses temperature=0.6 rather than 1.0.\",\n   ]\n   KB_EMB = embedder.encode(KB, normalize_embeddings=True)\n   def retrieve(q, k=3):\n       qv = embedder.encode([q], normalize_embeddings=True)[0]\n       import numpy as _np\n       return [KB[i] for i in _np.argsort(-(KB_EMB @ qv))[:k]]\n   return retrieve\n\n\ndef rag_answer(query, retrieve, k=3):\n   ctx = retrieve(query, k)\n   sys_m = \"Answer using ONLY the provided context. If insufficient, say so.\"\n   user = \"Context:n\" + \"n\".join(f\"- {c}\" for c in ctx) + f\"nnQuestion: {query}\"\n   chat = QwenChat(model, processor, system=sys_m); chat.user(user)\n   _, ans = chat.generate(enable_thinking=False, preset=\"instruct_general\", max_new_tokens=300)\n   return ans, ctx<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We define higher-level utility functions that turn the model into a more complete application framework for agentic, structured workflows. We implement the agent loop for iterative tool use, add JSON extraction and validation with retry logic, create a benchmarking function to measure generation throughput, and build a lightweight semantic retrieval pipeline for mini-RAG. Together, these functions help us move from basic prompting to more robust workflows in which the model can reason, validate outputs, retrieve supporting context, and be systematically tested.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">print(\"n\" + \"=\"*20, \"\u00a74 thinking-budget\", \"=\"*20)\nc = QwenChat(model, processor)\nc.user(\"A frog is at the bottom of a 30m well. It climbs 3m\/day, slips 2m\/night. \"\n      \"How many days until it escapes? Explain.\")\nbudget = ThinkingBudget(processor.tokenizer, budget=150)\nthink, ans = c.generate(enable_thinking=True, max_new_tokens=1200,\n                        stopping_criteria=StoppingCriteriaList([budget]))\nprint(f\"Thinking ~{len(processor.tokenizer.encode(think))} tok | Answer:n{ans or '(truncated)'}\")\n\n\nprint(\"n\" + \"=\"*20, \"\u00a75 streaming split\", \"=\"*20)\nc = QwenChat(model, processor)\nc.user(\"Explain why transformers scale better than RNNs, in two short paragraphs.\")\nprint(\"[THINKING &gt;&gt;] \", end=\"\", flush=True)\nfirst = [True]\ndef _ot(x): print(x, end=\"\", flush=True)\ndef _oa(x):\n   if first[0]: print(\"nn[ANSWER &gt;&gt;] \", end=\"\", flush=True); first[0] = False\n   print(x, end=\"\", flush=True)\nc.stream(enable_thinking=True, preset=\"thinking_general\", max_new_tokens=700,\n        on_thinking=_ot, on_answer=_oa); print()\n\n\nprint(\"n\" + \"=\"*20, \"\u00a76 vision\", \"=\"*20)\nIMG = \"https:\/\/qianwen-res.oss-accelerate.aliyuncs.com\/Qwen3.5\/demo\/CI_Demo\/mathv-1327.jpg\"\nc = QwenChat(model, processor)\nc.history.append({\"role\":\"user\",\"content\":[\n   {\"type\":\"image\",\"image\":IMG},\n   {\"type\":\"text\",\"text\":\"Describe this figure in one sentence, then state what it's asking.\"}]})\n_, ans = c.generate(enable_thinking=False, preset=\"instruct_general\", max_new_tokens=300)\nprint(\"Describe:\", ans)\n\n\nGRD = \"https:\/\/qianwen-res.oss-accelerate.aliyuncs.com\/Qwen3.6\/demo\/RealWorld\/RealWorld-04.png\"\nc = QwenChat(model, processor)\nc.history.append({\"role\":\"user\",\"content\":[\n   {\"type\":\"image\",\"image\":GRD},\n   {\"type\":\"text\",\"text\": \"Locate every distinct object. Reply ONLY with JSON \"\n    \"[{\"label\":...,\"bbox_2d\":[x1,y1,x2,y2]}, ...] in pixel coords.\"}]})\n_, ans = c.generate(enable_thinking=False, preset=\"instruct_general\", max_new_tokens=800)\nprint(\"Grounding:\", ans[:600])\n\n\nprint(\"n\" + \"=\"*20, \"\u00a77 YaRN override\", \"=\"*20)\nYARN = {\"text_config\": {\"rope_parameters\": {\n   \"mrope_interleaved\": True, \"mrope_section\": [11,11,10],\n   \"rope_type\": \"yarn\", \"rope_theta\": 10_000_000,\n   \"partial_rotary_factor\": 0.25, \"factor\": 4.0,\n   \"original_max_position_embeddings\": 262_144}}}\nprint(json.dumps(YARN, indent=2))<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We begin running the advanced demonstrations by testing thinking-budget control, split streaming, multimodal vision prompting, and a YaRN configuration example for extended context handling. We first observe how the model reasons under a limited thinking budget, then stream its thinking and answer separately so that we can inspect both parts of the response flow. We also send image-based prompts for description and grounding tasks, and finally print a YaRN rope-configuration override that shows how long-context settings can be prepared for model reloading.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">print(\"n\" + \"=\"*20, \"\u00a78 agent loop\", \"=\"*20)\nchat, final = run_agent(\n   \"What's 15% of 842 to 2 decimals? Also briefly explain gated DeltaNet per the docs.\",\n   max_steps=4)\nprint(\"nFINAL:\", final)\n\n\nprint(\"n\" + \"=\"*20, \"\u00a79 structured JSON\", \"=\"*20)\nobj, tries = json_with_retry(\"Summarize the movie Inception as structured metadata.\",\n                            MOVIE_SCHEMA)\nprint(f\"({tries} tries)\", json.dumps(obj, indent=2))\n\n\nprint(\"n\" + \"=\"*20, \"\u00a710 MoE routing\", \"=\"*20)\nrouters = []\nfor name, m in model.named_modules():\n   low = name.lower()\n   if ((\"gate\" in low and (\"moe\" in low or \"expert\" in low)) or\n       low.endswith(\".router\") or low.endswith(\".gate\")) and hasattr(m, \"weight\"):\n       routers.append((name, m))\nprint(f\"found {len(routers)} router-like modules\")\n\n\nTOP_K = 8\ncounts = [Counter() for _ in routers]\nhandles = []\ndef _mkhook(i):\n   def h(_m, _i, out):\n       lg = out[0] if isinstance(out, tuple) else out\n       if lg.dim() != 2: return\n       try:\n           for eid in lg.topk(TOP_K, dim=-1).indices.flatten().tolist():\n               counts[i][eid] += 1\n       except Exception: pass\n   return h\nfor i,(_,m) in enumerate(routers): handles.append(m.register_forward_hook(_mkhook(i)))\ntry:\n   c = QwenChat(model, processor); c.user(\"Write one short sentence about sunset.\")\n   c.generate(enable_thinking=False, preset=\"instruct_general\", max_new_tokens=40)\nfinally:\n   for h in handles: h.remove()\ntotal = Counter()\nfor c_ in counts: total.update(c_)\nprint(f\"distinct experts activated: {len(total)}\")\nfor eid, n in total.most_common(10): print(f\"  expert #{eid:&gt;3}  {n} fires\")\n\n\nprint(\"n\" + \"=\"*20, \"\u00a711 benchmark\", \"=\"*20)\nbenchmark(\"In one sentence, what is entropy?\", batch_sizes=(1,2,4), max_new_tokens=48)\n\n\nprint(\"n\" + \"=\"*20, \"\u00a712 mini-RAG\", \"=\"*20)\nretrieve = build_rag()\nans, ctx = rag_answer(\"How many experts are active per token, and why does that matter?\", retrieve)\nprint(\"retrieved:\"); [print(\" -\", c) for c in ctx]\nprint(\"answer:\", ans)\n\n\nprint(\"n\" + \"=\"*20, \"\u00a713 save\/resume\", \"=\"*20)\nc = QwenChat(model, processor); c.user(\"Give me a unique 5-letter codeword. Just the word.\")\n_, a1 = c.generate(enable_thinking=True, max_new_tokens=256); print(\"T1:\", a1)\nc.save(\"\/content\/session.json\")\ndel c; gc.collect()\nr = QwenChat.load(model, processor, \"\/content\/session.json\")\nr.user(\"Reverse the letters of that codeword.\")\n_, a2 = r.generate(enable_thinking=True, preserve_thinking=True, max_new_tokens=256)\nprint(\"T2:\", a2)\n\n\nprint(\"n\u2713 tutorial complete\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We continue with the remaining demonstrations that showcase tool-augmented reasoning, schema-constrained JSON generation, MoE routing introspection, throughput benchmarking, retrieval-augmented answering, and save-resume session handling. We let the model solve a tool-using task, generate structured movie metadata with validation, inspect which expert-like router modules activate during inference, and measure tokens-per-second across different batch sizes. Finally, we test mini-RAG for context-grounded answering and verify conversational persistence by saving a session, reloading it, and continuing the interaction from the stored history.<\/p>\n<p>In conclusion, we created a practical and detailed workflow for using Qwen 3.6-35B-A3B beyond simple text generation. We showed how to combine adaptive loading, multimodal prompting, controlled reasoning, tool-augmented interaction, schema-constrained outputs, lightweight RAG, and session save-resume patterns into one integrated system. We also inspected expert routing behavior and measured throughput to understand the model\u2019s usability and performance. Also, we turned Qwen 3.6 into a working experimental playground where we can study its capabilities, test advanced interaction patterns, and build a strong foundation for more serious research or product-oriented applications.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the<strong>\u00a0<a href=\"https:\/\/github.com\/Marktechpost\/AI-Agents-Projects-Tutorials\/blob\/main\/LLM%20Projects\/qwen36_35b_a3b_tutorial_Marktechpost.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">Full Codes with Notebook here<\/a><\/strong>.<strong>\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">130k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/04\/21\/a-coding-implementation-on-qwen-3-6-35b-a3b-covering-multimodal-inference-thinking-control-tool-calling-moe-routing-rag-and-session-persistence\/\">A Coding Implementation on Qwen 3.6-35B-A3B Covering Multimodal Inference, Thinking Control, Tool Calling, MoE Routing, RAG, and Session Persistence<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>In this tutorial, we build an &hellip;<\/p>\n","protected":false},"author":1,"featured_media":29,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-762","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/762","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=762"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/762\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/29"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=762"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=762"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=762"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}