In this tutorial, we build an end-to-end implementation around Qwen 3.6-35B-A3B and explore how a modern multimodal MoE model can be used in practical workflows. We begin by setting up the environment, loading the model adaptively based on available GPU memory, and creating a reusable chat framework that supports both standard responses and explicit thinking traces. From there, we work through important capabilities such as thinking-budget control, streamed generation with separated reasoning and answers, vision input handling, tool calling, structured JSON generation, MoE routing inspection, benchmarking, retrieval-augmented generation, and session persistence. Through this process, we run the model for inference and also examine how to design a robust application layer on top of Qwen 3.6 for real experimentation and advanced prototyping.
import subprocess, sys
def _pip(*a): subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", *a])
_pip("--upgrade", "pip")
_pip("--upgrade",
"transformers>=4.48.0", "accelerate>=1.2.0", "bitsandbytes>=0.44.0",
"pillow", "requests", "sentencepiece",
"qwen-vl-utils[decord]", "sentence-transformers", "jsonschema")
import torch, os, json, time, re, gc, io, threading, textwrap, warnings
from collections import Counter
from typing import Any, Optional
warnings.filterwarnings("ignore")
assert torch.cuda.is_available(), "GPU required. Switch runtime to A100 / L4."
p = torch.cuda.get_device_properties(0)
VRAM_GB = p.total_memory / 1e9
print(f"GPU: {p.name} | VRAM: {VRAM_GB:.1f} GB | CUDA {torch.version.cuda} | torch {torch.__version__}")
if VRAM_GB >= 75: LOAD_MODE = "bf16"
elif VRAM_GB >= 40: LOAD_MODE = "int8"
else: LOAD_MODE = "int4"
try:
import flash_attn
ATTN_IMPL = "flash_attention_2"
except Exception:
ATTN_IMPL = "sdpa"
print(f"-> mode={LOAD_MODE} attn={ATTN_IMPL}")
from transformers import (
AutoModelForImageTextToText, AutoProcessor,
BitsAndBytesConfig, TextIteratorStreamer,
StoppingCriteria, StoppingCriteriaList,
)
MODEL_ID = "Qwen/Qwen3.6-35B-A3B"
kwargs = dict(device_map="auto", trust_remote_code=True,
low_cpu_mem_usage=True, attn_implementation=ATTN_IMPL,
torch_dtype=torch.bfloat16)
if LOAD_MODE == "int8":
kwargs["quantization_config"] = BitsAndBytesConfig(load_in_8bit=True)
elif LOAD_MODE == "int4":
kwargs["quantization_config"] = BitsAndBytesConfig(
load_in_4bit=True, bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True)
print("Loading processor...")
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
print(f"Loading model in {LOAD_MODE} (first run downloads ~70GB) ...")
t0 = time.time()
model = AutoModelForImageTextToText.from_pretrained(MODEL_ID, **kwargs); model.eval()
print(f"Loaded in {time.time()-t0:.0f}s | VRAM used: {torch.cuda.memory_allocated()/1e9:.1f} GB")
SAMPLING = {
"thinking_general": dict(temperature=1.0, top_p=0.95, top_k=20, presence_penalty=1.5),
"thinking_coding": dict(temperature=0.6, top_p=0.95, top_k=20, presence_penalty=0.0),
"instruct_general": dict(temperature=0.7, top_p=0.80, top_k=20, presence_penalty=1.5),
"instruct_reason": dict(temperature=1.0, top_p=1.00, top_k=40, presence_penalty=2.0),
}
THINK_OPEN, THINK_CLOSE = "<think>", "</think>"
def split_thinking(text: str):
if THINK_OPEN in text and THINK_CLOSE in text:
a = text.index(THINK_OPEN) + len(THINK_OPEN); b = text.index(THINK_CLOSE)
return text[a:b].strip(), text[b + len(THINK_CLOSE):].strip()
if THINK_CLOSE in text:
b = text.index(THINK_CLOSE)
return text[:b].strip(), text[b + len(THINK_CLOSE):].strip()
return "", text.strip()
We set up the full environment required to run Qwen 3.6-35B-A3B in Google Colab and installed all supporting libraries for quantization, multimodal processing, retrieval, and schema validation. We then probe the available GPU, dynamically select the loading mode based on VRAM, and configure the attention backend so the model runs as efficiently as possible on the given hardware. After that, we load the processor and model from Hugging Face and define the core sampling presets and the thinking-splitting utility, which lay the foundation for all later interactions.
class QwenChat:
def __init__(self, model, processor, system=None, tools=None):
self.model, self.processor = model, processor
self.tokenizer = processor.tokenizer
self.history: list[dict] = []
if system: self.history.append({"role": "system", "content": system})
self.tools = tools
def user(self, content): self.history.append({"role":"user","content":content}); return self
def assistant(self, content, reasoning=""):
m = {"role":"assistant","content":content}
if reasoning: m["reasoning_content"] = reasoning
self.history.append(m); return self
def tool_result(self, name, result):
self.history.append({"role":"tool","name":name,
"content": result if isinstance(result, str) else json.dumps(result)})
return self
def _inputs(self, enable_thinking, preserve_thinking):
return self.processor.apply_chat_template(
self.history, tools=self.tools, tokenize=True,
add_generation_prompt=True, return_dict=True, return_tensors="pt",
enable_thinking=enable_thinking, preserve_thinking=preserve_thinking,
).to(self.model.device)
def generate(self, *, enable_thinking=True, preserve_thinking=False,
max_new_tokens=2048, preset="thinking_general",
stopping_criteria=None, append_to_history=True):
inp = self._inputs(enable_thinking, preserve_thinking)
cfg = SAMPLING[preset]
gk = dict(**inp, max_new_tokens=max_new_tokens, do_sample=True,
temperature=cfg["temperature"], top_p=cfg["top_p"], top_k=cfg["top_k"],
repetition_penalty=1.0,
pad_token_id=self.tokenizer.pad_token_id or self.tokenizer.eos_token_id)
if stopping_criteria is not None: gk["stopping_criteria"] = stopping_criteria
with torch.inference_mode(): out = self.model.generate(**gk)
raw = self.tokenizer.decode(out[0, inp["input_ids"].shape[-1]:], skip_special_tokens=True)
think, ans = split_thinking(raw)
if append_to_history: self.assistant(ans, reasoning=think)
return think, ans
def stream(self, *, enable_thinking=True, preserve_thinking=False,
max_new_tokens=2048, preset="thinking_general",
on_thinking=None, on_answer=None):
inp = self._inputs(enable_thinking, preserve_thinking)
cfg = SAMPLING[preset]
streamer = TextIteratorStreamer(self.tokenizer, skip_prompt=True, skip_special_tokens=True)
gk = dict(**inp, streamer=streamer, max_new_tokens=max_new_tokens, do_sample=True,
temperature=cfg["temperature"], top_p=cfg["top_p"], top_k=cfg["top_k"],
pad_token_id=self.tokenizer.pad_token_id or self.tokenizer.eos_token_id)
t = threading.Thread(target=self.model.generate, kwargs=gk); t.start()
buf, in_think = "", enable_thinking
think_text, answer_text = "", ""
for piece in streamer:
buf += piece
if in_think:
if THINK_CLOSE in buf:
close_at = buf.index(THINK_CLOSE)
resid = buf[:close_at]
if on_thinking: on_thinking(resid[len(think_text):])
think_text = resid
buf = buf[close_at + len(THINK_CLOSE):]
in_think = False
if buf and on_answer: on_answer(buf)
answer_text = buf; buf = ""
else:
if on_thinking: on_thinking(piece)
think_text += piece
else:
if on_answer: on_answer(piece)
answer_text += piece
t.join()
self.assistant(answer_text.strip(), reasoning=think_text.strip())
return think_text.strip(), answer_text.strip()
def save(self, path):
with open(path, "w") as f:
json.dump({"history": self.history, "tools": self.tools}, f, indent=2)
@classmethod
def load(cls, model, processor, path):
with open(path) as f: data = json.load(f)
c = cls(model, processor, tools=data.get("tools"))
c.history = data["history"]; return c
class ThinkingBudget(StoppingCriteria):
def __init__(self, tokenizer, budget: int):
self.budget = budget
self.open_ids = tokenizer.encode(THINK_OPEN, add_special_tokens=False)
self.close_ids = tokenizer.encode(THINK_CLOSE, add_special_tokens=False)
self.start = None
def _find(self, seq, needle):
n = len(needle)
for i in range(len(seq)-n+1):
if seq[i:i+n] == needle: return i
return None
def __call__(self, input_ids, scores, **kwargs):
seq = input_ids[0].tolist()
if self.start is None:
idx = self._find(seq, self.open_ids)
if idx is not None: self.start = idx + len(self.open_ids)
return False
if self._find(seq[self.start:], self.close_ids) is not None: return False
return (len(seq) - self.start) >= self.budget
TOOL_CALL_RE = re.compile(r"<tool_call>s*({.*?})s*</tool_call>", re.S)
def run_calculate(expr: str) -> str:
if any(c not in "0123456789+-*/().% " for c in expr):
return json.dumps({"error":"illegal chars"})
try: return json.dumps({"result": eval(expr, {"__builtins__": {}}, {})})
except Exception as e: return json.dumps({"error": str(e)})
_DOCS = {
"qwen3.6": "Qwen3.6-35B-A3B is a 35B MoE with 3B active params and 262k native context.",
"deltanet": "Gated DeltaNet is a linear-attention variant used in Qwen3.6's hybrid layers.",
"moe": "Qwen3.6 uses 256 experts with 8 routed + 1 shared per token.",
}
def run_search_docs(q):
hits = [v for k,v in _DOCS.items() if k in q.lower()]
return json.dumps({"results": hits or ["no hits"]})
def run_get_time():
import datetime as dt
return json.dumps({"iso": dt.datetime.utcnow().isoformat()+"Z"})
TOOL_FNS = {
"calculate": lambda a: run_calculate(a["expression"]),
"search_docs": lambda a: run_search_docs(a["query"]),
"get_time": lambda a: run_get_time(),
}
TOOLS_SCHEMA = [
{"type":"function","function":{"name":"calculate","description":"Evaluate arithmetic.",
"parameters":{"type":"object","properties":{"expression":{"type":"string"}},"required":["expression"]}}},
{"type":"function","function":{"name":"search_docs","description":"Search internal docs.",
"parameters":{"type":"object","properties":{"query":{"type":"string"}},"required":["query"]}}},
{"type":"function","function":{"name":"get_time","description":"Get current UTC time.",
"parameters":{"type":"object","properties":{}}}},
]
We build the main QwenChat conversation manager, which handles message history, tool messages, chat template formatting, standard generation, streaming generation, and session persistence. We also define the ThinkingBudget stopping criterion to control how much reasoning the model is allowed to produce before continuing or stopping generation. In addition, we create the tool-calling support layer, including arithmetic, lightweight document search, time lookup, and the tool schema that allows the model to interact with external functions in an agent-style loop.
def run_agent(user_msg, *, max_steps=5, verbose=True):
chat = QwenChat(model, processor,
system="You are a helpful assistant. Call tools when helpful, then answer.",
tools=TOOLS_SCHEMA)
chat.user(user_msg)
for step in range(max_steps):
think, raw = chat.generate(enable_thinking=True, preserve_thinking=True,
preset="thinking_general", max_new_tokens=1024,
append_to_history=False)
calls = TOOL_CALL_RE.findall(raw)
if verbose:
print(f"n=== step {step+1} ===")
print("reasoning:", textwrap.shorten(think, 200))
print("raw :", textwrap.shorten(raw, 300))
if not calls:
chat.assistant(raw, reasoning=think); return chat, raw
chat.assistant(raw, reasoning=think)
for payload in calls:
try: parsed = json.loads(payload)
except json.JSONDecodeError:
chat.tool_result("error", {"error":"bad json"}); continue
fn = TOOL_FNS.get(parsed.get("name"))
res = fn(parsed.get("arguments", {})) if fn else json.dumps({"error":"unknown"})
if verbose: print(f" -> {parsed.get('name')}({parsed.get('arguments',{})}) = {res}")
chat.tool_result(parsed.get("name"), res)
return chat, "(max_steps reached)"
import jsonschema
MOVIE_SCHEMA = {
"type":"object",
"required":["title","year","rating","genres","runtime_minutes"],
"additionalProperties": False,
"properties":{
"title":{"type":"string"},
"year":{"type":"integer","minimum":1900,"maximum":2030},
"rating":{"type":"number","minimum":0,"maximum":10},
"genres":{"type":"array","items":{"type":"string"},"minItems":1},
"runtime_minutes":{"type":"integer","minimum":1,"maximum":500},
},
}
def extract_json(text):
text = re.sub(r"^```(?:json)?", "", text.strip())
text = re.sub(r"```$", "", text.strip())
s = text.find("{")
if s < 0: raise ValueError("no object")
d, e = 0, -1
for i in range(s, len(text)):
if text[i] == "{": d += 1
elif text[i] == "}":
d -= 1
if d == 0: e = i; break
if e < 0: raise ValueError("unbalanced braces")
return json.loads(text[s:e+1])
def json_with_retry(prompt, schema, *, max_tries=3):
sys_m = ("You reply with ONLY a single JSON object matching the user's schema. "
"No markdown fences. No commentary. No <think> blocks.")
chat = QwenChat(model, processor, system=sys_m)
chat.user(f"{prompt}nnRespond as JSON matching this schema:n{json.dumps(schema, indent=2)}")
last = None
for i in range(max_tries):
_, raw = chat.generate(enable_thinking=False, preset="instruct_general",
max_new_tokens=512, append_to_history=False)
try:
obj = extract_json(raw); jsonschema.validate(obj, schema)
return obj, i+1
except Exception as e:
last = str(e); chat.assistant(raw)
chat.user(f"That failed validation: {last}. Produce ONLY valid JSON.")
raise RuntimeError(f"gave up after {max_tries}: {last}")
def benchmark(prompt, *, batch_sizes=(1,2,4), max_new_tokens=64):
print(f"{'batch':>6} {'tok/s':>10} {'total_s':>10} {'VRAM_GB':>10}")
print("-"*40)
for bs in batch_sizes:
gc.collect(); torch.cuda.empty_cache(); torch.cuda.reset_peak_memory_stats()
msgs = [[{"role":"user","content":prompt}] for _ in range(bs)]
texts = [processor.apply_chat_template(m, tokenize=False, add_generation_prompt=True,
enable_thinking=False) for m in msgs]
processor.tokenizer.padding_side = "left"
inp = processor.tokenizer(texts, return_tensors="pt", padding=True).to(model.device)
torch.cuda.synchronize(); t0 = time.time()
with torch.inference_mode():
out = model.generate(**inp, max_new_tokens=max_new_tokens, do_sample=False,
pad_token_id=processor.tokenizer.pad_token_id or processor.tokenizer.eos_token_id)
torch.cuda.synchronize(); dt = time.time()-t0
new_toks = (out.shape[1] - inp["input_ids"].shape[1]) * bs
vram = torch.cuda.max_memory_allocated()/1e9
print(f"{bs:>6d} {new_toks/dt:>10.1f} {dt:>10.2f} {vram:>10.1f}")
def build_rag():
from sentence_transformers import SentenceTransformer
import numpy as np
embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
KB = [
"Qwen3.6-35B-A3B has 35B total params and 3B activated via MoE.",
"Context length is 262,144 tokens natively, up to ~1M with YaRN.",
"The MoE layer uses 256 experts with 8 routed and 1 shared per token.",
"Thinking mode wraps internal reasoning in <think>...</think> blocks.",
"preserve_thinking=True keeps prior reasoning across turns for agents.",
"Gated DeltaNet is a linear-attention variant in the hybrid layers.",
"The model accepts image, video, and text input natively.",
"Sampling for coding tasks uses temperature=0.6 rather than 1.0.",
]
KB_EMB = embedder.encode(KB, normalize_embeddings=True)
def retrieve(q, k=3):
qv = embedder.encode([q], normalize_embeddings=True)[0]
import numpy as _np
return [KB[i] for i in _np.argsort(-(KB_EMB @ qv))[:k]]
return retrieve
def rag_answer(query, retrieve, k=3):
ctx = retrieve(query, k)
sys_m = "Answer using ONLY the provided context. If insufficient, say so."
user = "Context:n" + "n".join(f"- {c}" for c in ctx) + f"nnQuestion: {query}"
chat = QwenChat(model, processor, system=sys_m); chat.user(user)
_, ans = chat.generate(enable_thinking=False, preset="instruct_general", max_new_tokens=300)
return ans, ctx
We define higher-level utility functions that turn the model into a more complete application framework for agentic, structured workflows. We implement the agent loop for iterative tool use, add JSON extraction and validation with retry logic, create a benchmarking function to measure generation throughput, and build a lightweight semantic retrieval pipeline for mini-RAG. Together, these functions help us move from basic prompting to more robust workflows in which the model can reason, validate outputs, retrieve supporting context, and be systematically tested.
print("n" + "="*20, "§4 thinking-budget", "="*20)
c = QwenChat(model, processor)
c.user("A frog is at the bottom of a 30m well. It climbs 3m/day, slips 2m/night. "
"How many days until it escapes? Explain.")
budget = ThinkingBudget(processor.tokenizer, budget=150)
think, ans = c.generate(enable_thinking=True, max_new_tokens=1200,
stopping_criteria=StoppingCriteriaList([budget]))
print(f"Thinking ~{len(processor.tokenizer.encode(think))} tok | Answer:n{ans or '(truncated)'}")
print("n" + "="*20, "§5 streaming split", "="*20)
c = QwenChat(model, processor)
c.user("Explain why transformers scale better than RNNs, in two short paragraphs.")
print("[THINKING >>] ", end="", flush=True)
first = [True]
def _ot(x): print(x, end="", flush=True)
def _oa(x):
if first[0]: print("nn[ANSWER >>] ", end="", flush=True); first[0] = False
print(x, end="", flush=True)
c.stream(enable_thinking=True, preset="thinking_general", max_new_tokens=700,
on_thinking=_ot, on_answer=_oa); print()
print("n" + "="*20, "§6 vision", "="*20)
IMG = "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/CI_Demo/mathv-1327.jpg"
c = QwenChat(model, processor)
c.history.append({"role":"user","content":[
{"type":"image","image":IMG},
{"type":"text","text":"Describe this figure in one sentence, then state what it's asking."}]})
_, ans = c.generate(enable_thinking=False, preset="instruct_general", max_new_tokens=300)
print("Describe:", ans)
GRD = "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.6/demo/RealWorld/RealWorld-04.png"
c = QwenChat(model, processor)
c.history.append({"role":"user","content":[
{"type":"image","image":GRD},
{"type":"text","text": "Locate every distinct object. Reply ONLY with JSON "
"[{"label":...,"bbox_2d":[x1,y1,x2,y2]}, ...] in pixel coords."}]})
_, ans = c.generate(enable_thinking=False, preset="instruct_general", max_new_tokens=800)
print("Grounding:", ans[:600])
print("n" + "="*20, "§7 YaRN override", "="*20)
YARN = {"text_config": {"rope_parameters": {
"mrope_interleaved": True, "mrope_section": [11,11,10],
"rope_type": "yarn", "rope_theta": 10_000_000,
"partial_rotary_factor": 0.25, "factor": 4.0,
"original_max_position_embeddings": 262_144}}}
print(json.dumps(YARN, indent=2))
We begin running the advanced demonstrations by testing thinking-budget control, split streaming, multimodal vision prompting, and a YaRN configuration example for extended context handling. We first observe how the model reasons under a limited thinking budget, then stream its thinking and answer separately so that we can inspect both parts of the response flow. We also send image-based prompts for description and grounding tasks, and finally print a YaRN rope-configuration override that shows how long-context settings can be prepared for model reloading.
print("n" + "="*20, "§8 agent loop", "="*20)
chat, final = run_agent(
"What's 15% of 842 to 2 decimals? Also briefly explain gated DeltaNet per the docs.",
max_steps=4)
print("nFINAL:", final)
print("n" + "="*20, "§9 structured JSON", "="*20)
obj, tries = json_with_retry("Summarize the movie Inception as structured metadata.",
MOVIE_SCHEMA)
print(f"({tries} tries)", json.dumps(obj, indent=2))
print("n" + "="*20, "§10 MoE routing", "="*20)
routers = []
for name, m in model.named_modules():
low = name.lower()
if (("gate" in low and ("moe" in low or "expert" in low)) or
low.endswith(".router") or low.endswith(".gate")) and hasattr(m, "weight"):
routers.append((name, m))
print(f"found {len(routers)} router-like modules")
TOP_K = 8
counts = [Counter() for _ in routers]
handles = []
def _mkhook(i):
def h(_m, _i, out):
lg = out[0] if isinstance(out, tuple) else out
if lg.dim() != 2: return
try:
for eid in lg.topk(TOP_K, dim=-1).indices.flatten().tolist():
counts[i][eid] += 1
except Exception: pass
return h
for i,(_,m) in enumerate(routers): handles.append(m.register_forward_hook(_mkhook(i)))
try:
c = QwenChat(model, processor); c.user("Write one short sentence about sunset.")
c.generate(enable_thinking=False, preset="instruct_general", max_new_tokens=40)
finally:
for h in handles: h.remove()
total = Counter()
for c_ in counts: total.update(c_)
print(f"distinct experts activated: {len(total)}")
for eid, n in total.most_common(10): print(f" expert #{eid:>3} {n} fires")
print("n" + "="*20, "§11 benchmark", "="*20)
benchmark("In one sentence, what is entropy?", batch_sizes=(1,2,4), max_new_tokens=48)
print("n" + "="*20, "§12 mini-RAG", "="*20)
retrieve = build_rag()
ans, ctx = rag_answer("How many experts are active per token, and why does that matter?", retrieve)
print("retrieved:"); [print(" -", c) for c in ctx]
print("answer:", ans)
print("n" + "="*20, "§13 save/resume", "="*20)
c = QwenChat(model, processor); c.user("Give me a unique 5-letter codeword. Just the word.")
_, a1 = c.generate(enable_thinking=True, max_new_tokens=256); print("T1:", a1)
c.save("/content/session.json")
del c; gc.collect()
r = QwenChat.load(model, processor, "/content/session.json")
r.user("Reverse the letters of that codeword.")
_, a2 = r.generate(enable_thinking=True, preserve_thinking=True, max_new_tokens=256)
print("T2:", a2)
print("n✓ tutorial complete")
We continue with the remaining demonstrations that showcase tool-augmented reasoning, schema-constrained JSON generation, MoE routing introspection, throughput benchmarking, retrieval-augmented answering, and save-resume session handling. We let the model solve a tool-using task, generate structured movie metadata with validation, inspect which expert-like router modules activate during inference, and measure tokens-per-second across different batch sizes. Finally, we test mini-RAG for context-grounded answering and verify conversational persistence by saving a session, reloading it, and continuing the interaction from the stored history.
In conclusion, we created a practical and detailed workflow for using Qwen 3.6-35B-A3B beyond simple text generation. We showed how to combine adaptive loading, multimodal prompting, controlled reasoning, tool-augmented interaction, schema-constrained outputs, lightweight RAG, and session save-resume patterns into one integrated system. We also inspected expert routing behavior and measured throughput to understand the model’s usability and performance. Also, we turned Qwen 3.6 into a working experimental playground where we can study its capabilities, test advanced interaction patterns, and build a strong foundation for more serious research or product-oriented applications.
Check out the Full Codes with Notebook here. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us
The post A Coding Implementation on Qwen 3.6-35B-A3B Covering Multimodal Inference, Thinking Control, Tool Calling, MoE Routing, RAG, and Session Persistence appeared first on MarkTechPost.