A Coding Implementation on Qwen 3.6-35B-A3B Covering Multimodal Inference, Thinking Control, Tool Calling, MoE Routing, RAG, and Session Persistence

In this tutorial, we build an end-to-end implementation around Qwen 3.6-35B-A3B and explore how a modern multimodal MoE model can be used in practical workflows. We begin by setting up the environment, loading the model adaptively based on available GPU memory, and creating a reusable chat framework that supports both standard responses and explicit thinking traces. From there, we work through important capabilities such as thinking-budget control, streamed generation with separated reasoning and answers, vision input handling, tool calling, structured JSON generation, MoE routing inspection, benchmarking, retrieval-augmented generation, and session persistence. Through this process, we run the model for inference and also examine how to design a robust application layer on top of Qwen 3.6 for real experimentation and advanced prototyping.

Copy CodeCopiedUse a different Browser

import subprocess, sys
def _pip(*a): subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", *a])
_pip("--upgrade", "pip")
_pip("--upgrade",
    "transformers>=4.48.0", "accelerate>=1.2.0", "bitsandbytes>=0.44.0",
    "pillow", "requests", "sentencepiece",
    "qwen-vl-utils[decord]", "sentence-transformers", "jsonschema")


import torch, os, json, time, re, gc, io, threading, textwrap, warnings
from collections import Counter
from typing import Any, Optional
warnings.filterwarnings("ignore")


assert torch.cuda.is_available(), "GPU required. Switch runtime to A100 / L4."
p = torch.cuda.get_device_properties(0)
VRAM_GB = p.total_memory / 1e9
print(f"GPU: {p.name} | VRAM: {VRAM_GB:.1f} GB | CUDA {torch.version.cuda} | torch {torch.__version__}")


if VRAM_GB >= 75:   LOAD_MODE = "bf16"
elif VRAM_GB >= 40: LOAD_MODE = "int8"
else:               LOAD_MODE = "int4"


try:
   import flash_attn
   ATTN_IMPL = "flash_attention_2"
except Exception:
   ATTN_IMPL = "sdpa"
print(f"-> mode={LOAD_MODE}  attn={ATTN_IMPL}")


from transformers import (
   AutoModelForImageTextToText, AutoProcessor,
   BitsAndBytesConfig, TextIteratorStreamer,
   StoppingCriteria, StoppingCriteriaList,
)


MODEL_ID = "Qwen/Qwen3.6-35B-A3B"
kwargs = dict(device_map="auto", trust_remote_code=True,
             low_cpu_mem_usage=True, attn_implementation=ATTN_IMPL,
             torch_dtype=torch.bfloat16)
if LOAD_MODE == "int8":
   kwargs["quantization_config"] = BitsAndBytesConfig(load_in_8bit=True)
elif LOAD_MODE == "int4":
   kwargs["quantization_config"] = BitsAndBytesConfig(
       load_in_4bit=True, bnb_4bit_quant_type="nf4",
       bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True)


print("Loading processor...")
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
print(f"Loading model in {LOAD_MODE} (first run downloads ~70GB) ...")
t0 = time.time()
model = AutoModelForImageTextToText.from_pretrained(MODEL_ID, **kwargs); model.eval()
print(f"Loaded in {time.time()-t0:.0f}s  |  VRAM used: {torch.cuda.memory_allocated()/1e9:.1f} GB")


SAMPLING = {
   "thinking_general": dict(temperature=1.0, top_p=0.95, top_k=20, presence_penalty=1.5),
   "thinking_coding":  dict(temperature=0.6, top_p=0.95, top_k=20, presence_penalty=0.0),
   "instruct_general": dict(temperature=0.7, top_p=0.80, top_k=20, presence_penalty=1.5),
   "instruct_reason":  dict(temperature=1.0, top_p=1.00, top_k=40, presence_penalty=2.0),
}
THINK_OPEN, THINK_CLOSE = "<think>", "</think>"


def split_thinking(text: str):
   if THINK_OPEN in text and THINK_CLOSE in text:
       a = text.index(THINK_OPEN) + len(THINK_OPEN); b = text.index(THINK_CLOSE)
       return text[a:b].strip(), text[b + len(THINK_CLOSE):].strip()
   if THINK_CLOSE in text:
       b = text.index(THINK_CLOSE)
       return text[:b].strip(), text[b + len(THINK_CLOSE):].strip()
   return "", text.strip()

We set up the full environment required to run Qwen 3.6-35B-A3B in Google Colab and installed all supporting libraries for quantization, multimodal processing, retrieval, and schema validation. We then probe the available GPU, dynamically select the loading mode based on VRAM, and configure the attention backend so the model runs as efficiently as possible on the given hardware. After that, we load the processor and model from Hugging Face and define the core sampling presets and the thinking-splitting utility, which lay the foundation for all later interactions.

Copy CodeCopiedUse a different Browser

class QwenChat:
   def __init__(self, model, processor, system=None, tools=None):
       self.model, self.processor = model, processor
       self.tokenizer = processor.tokenizer
       self.history: list[dict] = []
       if system: self.history.append({"role": "system", "content": system})
       self.tools = tools


   def user(self, content):      self.history.append({"role":"user","content":content}); return self
   def assistant(self, content, reasoning=""):
       m = {"role":"assistant","content":content}
       if reasoning: m["reasoning_content"] = reasoning
       self.history.append(m); return self
   def tool_result(self, name, result):
       self.history.append({"role":"tool","name":name,
           "content": result if isinstance(result, str) else json.dumps(result)})
       return self


   def _inputs(self, enable_thinking, preserve_thinking):
       return self.processor.apply_chat_template(
           self.history, tools=self.tools, tokenize=True,
           add_generation_prompt=True, return_dict=True, return_tensors="pt",
           enable_thinking=enable_thinking, preserve_thinking=preserve_thinking,
       ).to(self.model.device)


   def generate(self, *, enable_thinking=True, preserve_thinking=False,
                max_new_tokens=2048, preset="thinking_general",
                stopping_criteria=None, append_to_history=True):
       inp = self._inputs(enable_thinking, preserve_thinking)
       cfg = SAMPLING[preset]
       gk = dict(**inp, max_new_tokens=max_new_tokens, do_sample=True,
                 temperature=cfg["temperature"], top_p=cfg["top_p"], top_k=cfg["top_k"],
                 repetition_penalty=1.0,
                 pad_token_id=self.tokenizer.pad_token_id or self.tokenizer.eos_token_id)
       if stopping_criteria is not None: gk["stopping_criteria"] = stopping_criteria
       with torch.inference_mode(): out = self.model.generate(**gk)
       raw = self.tokenizer.decode(out[0, inp["input_ids"].shape[-1]:], skip_special_tokens=True)
       think, ans = split_thinking(raw)
       if append_to_history: self.assistant(ans, reasoning=think)
       return think, ans


   def stream(self, *, enable_thinking=True, preserve_thinking=False,
              max_new_tokens=2048, preset="thinking_general",
              on_thinking=None, on_answer=None):
       inp = self._inputs(enable_thinking, preserve_thinking)
       cfg = SAMPLING[preset]
       streamer = TextIteratorStreamer(self.tokenizer, skip_prompt=True, skip_special_tokens=True)
       gk = dict(**inp, streamer=streamer, max_new_tokens=max_new_tokens, do_sample=True,
                 temperature=cfg["temperature"], top_p=cfg["top_p"], top_k=cfg["top_k"],
                 pad_token_id=self.tokenizer.pad_token_id or self.tokenizer.eos_token_id)
       t = threading.Thread(target=self.model.generate, kwargs=gk); t.start()
       buf, in_think = "", enable_thinking
       think_text, answer_text = "", ""
       for piece in streamer:
           buf += piece
           if in_think:
               if THINK_CLOSE in buf:
                   close_at = buf.index(THINK_CLOSE)
                   resid = buf[:close_at]
                   if on_thinking: on_thinking(resid[len(think_text):])
                   think_text = resid
                   buf = buf[close_at + len(THINK_CLOSE):]
                   in_think = False
                   if buf and on_answer: on_answer(buf)
                   answer_text = buf; buf = ""
               else:
                   if on_thinking: on_thinking(piece)
                   think_text += piece
           else:
               if on_answer: on_answer(piece)
               answer_text += piece
       t.join()
       self.assistant(answer_text.strip(), reasoning=think_text.strip())
       return think_text.strip(), answer_text.strip()


   def save(self, path):
       with open(path, "w") as f:
           json.dump({"history": self.history, "tools": self.tools}, f, indent=2)
   @classmethod
   def load(cls, model, processor, path):
       with open(path) as f: data = json.load(f)
       c = cls(model, processor, tools=data.get("tools"))
       c.history = data["history"]; return c


class ThinkingBudget(StoppingCriteria):
   def __init__(self, tokenizer, budget: int):
       self.budget = budget
       self.open_ids  = tokenizer.encode(THINK_OPEN,  add_special_tokens=False)
       self.close_ids = tokenizer.encode(THINK_CLOSE, add_special_tokens=False)
       self.start = None
   def _find(self, seq, needle):
       n = len(needle)
       for i in range(len(seq)-n+1):
           if seq[i:i+n] == needle: return i
       return None
   def __call__(self, input_ids, scores, **kwargs):
       seq = input_ids[0].tolist()
       if self.start is None:
           idx = self._find(seq, self.open_ids)
           if idx is not None: self.start = idx + len(self.open_ids)
           return False
       if self._find(seq[self.start:], self.close_ids) is not None: return False
       return (len(seq) - self.start) >= self.budget


TOOL_CALL_RE = re.compile(r"<tool_call>s*({.*?})s*</tool_call>", re.S)


def run_calculate(expr: str) -> str:
   if any(c not in "0123456789+-*/().% " for c in expr):
       return json.dumps({"error":"illegal chars"})
   try:    return json.dumps({"result": eval(expr, {"__builtins__": {}}, {})})
   except Exception as e: return json.dumps({"error": str(e)})


_DOCS = {
   "qwen3.6":  "Qwen3.6-35B-A3B is a 35B MoE with 3B active params and 262k native context.",
   "deltanet": "Gated DeltaNet is a linear-attention variant used in Qwen3.6's hybrid layers.",
   "moe":      "Qwen3.6 uses 256 experts with 8 routed + 1 shared per token.",
}
def run_search_docs(q):
   hits = [v for k,v in _DOCS.items() if k in q.lower()]
   return json.dumps({"results": hits or ["no hits"]})
def run_get_time():
   import datetime as dt
   return json.dumps({"iso": dt.datetime.utcnow().isoformat()+"Z"})


TOOL_FNS = {
   "calculate":   lambda a: run_calculate(a["expression"]),
   "search_docs": lambda a: run_search_docs(a["query"]),
   "get_time":    lambda a: run_get_time(),
}
TOOLS_SCHEMA = [
   {"type":"function","function":{"name":"calculate","description":"Evaluate arithmetic.",
     "parameters":{"type":"object","properties":{"expression":{"type":"string"}},"required":["expression"]}}},
   {"type":"function","function":{"name":"search_docs","description":"Search internal docs.",
     "parameters":{"type":"object","properties":{"query":{"type":"string"}},"required":["query"]}}},
   {"type":"function","function":{"name":"get_time","description":"Get current UTC time.",
     "parameters":{"type":"object","properties":{}}}},
]

We build the main QwenChat conversation manager, which handles message history, tool messages, chat template formatting, standard generation, streaming generation, and session persistence. We also define the ThinkingBudget stopping criterion to control how much reasoning the model is allowed to produce before continuing or stopping generation. In addition, we create the tool-calling support layer, including arithmetic, lightweight document search, time lookup, and the tool schema that allows the model to interact with external functions in an agent-style loop.

Copy CodeCopiedUse a different Browser

def run_agent(user_msg, *, max_steps=5, verbose=True):
   chat = QwenChat(model, processor,
       system="You are a helpful assistant. Call tools when helpful, then answer.",
       tools=TOOLS_SCHEMA)
   chat.user(user_msg)
   for step in range(max_steps):
       think, raw = chat.generate(enable_thinking=True, preserve_thinking=True,
                                  preset="thinking_general", max_new_tokens=1024,
                                  append_to_history=False)
       calls = TOOL_CALL_RE.findall(raw)
       if verbose:
           print(f"n=== step {step+1} ===")
           print("reasoning:", textwrap.shorten(think, 200))
           print("raw     :", textwrap.shorten(raw, 300))
       if not calls:
           chat.assistant(raw, reasoning=think); return chat, raw
       chat.assistant(raw, reasoning=think)
       for payload in calls:
           try: parsed = json.loads(payload)
           except json.JSONDecodeError:
               chat.tool_result("error", {"error":"bad json"}); continue
           fn = TOOL_FNS.get(parsed.get("name"))
           res = fn(parsed.get("arguments", {})) if fn else json.dumps({"error":"unknown"})
           if verbose: print(f" -> {parsed.get('name')}({parsed.get('arguments',{})}) = {res}")
           chat.tool_result(parsed.get("name"), res)
   return chat, "(max_steps reached)"


import jsonschema


MOVIE_SCHEMA = {
   "type":"object",
   "required":["title","year","rating","genres","runtime_minutes"],
   "additionalProperties": False,
   "properties":{
       "title":{"type":"string"},
       "year":{"type":"integer","minimum":1900,"maximum":2030},
       "rating":{"type":"number","minimum":0,"maximum":10},
       "genres":{"type":"array","items":{"type":"string"},"minItems":1},
       "runtime_minutes":{"type":"integer","minimum":1,"maximum":500},
   },
}
def extract_json(text):
   text = re.sub(r"^```(?:json)?", "", text.strip())
   text = re.sub(r"```$", "", text.strip())
   s = text.find("{")
   if s < 0: raise ValueError("no object")
   d, e = 0, -1
   for i in range(s, len(text)):
       if text[i] == "{": d += 1
       elif text[i] == "}":
           d -= 1
           if d == 0: e = i; break
   if e < 0: raise ValueError("unbalanced braces")
   return json.loads(text[s:e+1])


def json_with_retry(prompt, schema, *, max_tries=3):
   sys_m = ("You reply with ONLY a single JSON object matching the user's schema. "
            "No markdown fences. No commentary. No <think> blocks.")
   chat = QwenChat(model, processor, system=sys_m)
   chat.user(f"{prompt}nnRespond as JSON matching this schema:n{json.dumps(schema, indent=2)}")
   last = None
   for i in range(max_tries):
       _, raw = chat.generate(enable_thinking=False, preset="instruct_general",
                              max_new_tokens=512, append_to_history=False)
       try:
           obj = extract_json(raw); jsonschema.validate(obj, schema)
           return obj, i+1
       except Exception as e:
           last = str(e); chat.assistant(raw)
           chat.user(f"That failed validation: {last}. Produce ONLY valid JSON.")
   raise RuntimeError(f"gave up after {max_tries}: {last}")


def benchmark(prompt, *, batch_sizes=(1,2,4), max_new_tokens=64):
   print(f"{'batch':>6} {'tok/s':>10} {'total_s':>10} {'VRAM_GB':>10}")
   print("-"*40)
   for bs in batch_sizes:
       gc.collect(); torch.cuda.empty_cache(); torch.cuda.reset_peak_memory_stats()
       msgs = [[{"role":"user","content":prompt}] for _ in range(bs)]
       texts = [processor.apply_chat_template(m, tokenize=False, add_generation_prompt=True,
                                               enable_thinking=False) for m in msgs]
       processor.tokenizer.padding_side = "left"
       inp = processor.tokenizer(texts, return_tensors="pt", padding=True).to(model.device)
       torch.cuda.synchronize(); t0 = time.time()
       with torch.inference_mode():
           out = model.generate(**inp, max_new_tokens=max_new_tokens, do_sample=False,
               pad_token_id=processor.tokenizer.pad_token_id or processor.tokenizer.eos_token_id)
       torch.cuda.synchronize(); dt = time.time()-t0
       new_toks = (out.shape[1] - inp["input_ids"].shape[1]) * bs
       vram = torch.cuda.max_memory_allocated()/1e9
       print(f"{bs:>6d} {new_toks/dt:>10.1f} {dt:>10.2f} {vram:>10.1f}")


def build_rag():
   from sentence_transformers import SentenceTransformer
   import numpy as np
   embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
   KB = [
       "Qwen3.6-35B-A3B has 35B total params and 3B activated via MoE.",
       "Context length is 262,144 tokens natively, up to ~1M with YaRN.",
       "The MoE layer uses 256 experts with 8 routed and 1 shared per token.",
       "Thinking mode wraps internal reasoning in <think>...</think> blocks.",
       "preserve_thinking=True keeps prior reasoning across turns for agents.",
       "Gated DeltaNet is a linear-attention variant in the hybrid layers.",
       "The model accepts image, video, and text input natively.",
       "Sampling for coding tasks uses temperature=0.6 rather than 1.0.",
   ]
   KB_EMB = embedder.encode(KB, normalize_embeddings=True)
   def retrieve(q, k=3):
       qv = embedder.encode([q], normalize_embeddings=True)[0]
       import numpy as _np
       return [KB[i] for i in _np.argsort(-(KB_EMB @ qv))[:k]]
   return retrieve


def rag_answer(query, retrieve, k=3):
   ctx = retrieve(query, k)
   sys_m = "Answer using ONLY the provided context. If insufficient, say so."
   user = "Context:n" + "n".join(f"- {c}" for c in ctx) + f"nnQuestion: {query}"
   chat = QwenChat(model, processor, system=sys_m); chat.user(user)
   _, ans = chat.generate(enable_thinking=False, preset="instruct_general", max_new_tokens=300)
   return ans, ctx

We define higher-level utility functions that turn the model into a more complete application framework for agentic, structured workflows. We implement the agent loop for iterative tool use, add JSON extraction and validation with retry logic, create a benchmarking function to measure generation throughput, and build a lightweight semantic retrieval pipeline for mini-RAG. Together, these functions help us move from basic prompting to more robust workflows in which the model can reason, validate outputs, retrieve supporting context, and be systematically tested.

Copy CodeCopiedUse a different Browser

print("n" + "="*20, "§4 thinking-budget", "="*20)
c = QwenChat(model, processor)
c.user("A frog is at the bottom of a 30m well. It climbs 3m/day, slips 2m/night. "
      "How many days until it escapes? Explain.")
budget = ThinkingBudget(processor.tokenizer, budget=150)
think, ans = c.generate(enable_thinking=True, max_new_tokens=1200,
                        stopping_criteria=StoppingCriteriaList([budget]))
print(f"Thinking ~{len(processor.tokenizer.encode(think))} tok | Answer:n{ans or '(truncated)'}")


print("n" + "="*20, "§5 streaming split", "="*20)
c = QwenChat(model, processor)
c.user("Explain why transformers scale better than RNNs, in two short paragraphs.")
print("[THINKING >>] ", end="", flush=True)
first = [True]
def _ot(x): print(x, end="", flush=True)
def _oa(x):
   if first[0]: print("nn[ANSWER >>] ", end="", flush=True); first[0] = False
   print(x, end="", flush=True)
c.stream(enable_thinking=True, preset="thinking_general", max_new_tokens=700,
        on_thinking=_ot, on_answer=_oa); print()


print("n" + "="*20, "§6 vision", "="*20)
IMG = "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/CI_Demo/mathv-1327.jpg"
c = QwenChat(model, processor)
c.history.append({"role":"user","content":[
   {"type":"image","image":IMG},
   {"type":"text","text":"Describe this figure in one sentence, then state what it's asking."}]})
_, ans = c.generate(enable_thinking=False, preset="instruct_general", max_new_tokens=300)
print("Describe:", ans)


GRD = "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.6/demo/RealWorld/RealWorld-04.png"
c = QwenChat(model, processor)
c.history.append({"role":"user","content":[
   {"type":"image","image":GRD},
   {"type":"text","text": "Locate every distinct object. Reply ONLY with JSON "
    "[{"label":...,"bbox_2d":[x1,y1,x2,y2]}, ...] in pixel coords."}]})
_, ans = c.generate(enable_thinking=False, preset="instruct_general", max_new_tokens=800)
print("Grounding:", ans[:600])


print("n" + "="*20, "§7 YaRN override", "="*20)
YARN = {"text_config": {"rope_parameters": {
   "mrope_interleaved": True, "mrope_section": [11,11,10],
   "rope_type": "yarn", "rope_theta": 10_000_000,
   "partial_rotary_factor": 0.25, "factor": 4.0,
   "original_max_position_embeddings": 262_144}}}
print(json.dumps(YARN, indent=2))

We begin running the advanced demonstrations by testing thinking-budget control, split streaming, multimodal vision prompting, and a YaRN configuration example for extended context handling. We first observe how the model reasons under a limited thinking budget, then stream its thinking and answer separately so that we can inspect both parts of the response flow. We also send image-based prompts for description and grounding tasks, and finally print a YaRN rope-configuration override that shows how long-context settings can be prepared for model reloading.

Copy CodeCopiedUse a different Browser

print("n" + "="*20, "§8 agent loop", "="*20)
chat, final = run_agent(
   "What's 15% of 842 to 2 decimals? Also briefly explain gated DeltaNet per the docs.",
   max_steps=4)
print("nFINAL:", final)


print("n" + "="*20, "§9 structured JSON", "="*20)
obj, tries = json_with_retry("Summarize the movie Inception as structured metadata.",
                            MOVIE_SCHEMA)
print(f"({tries} tries)", json.dumps(obj, indent=2))


print("n" + "="*20, "§10 MoE routing", "="*20)
routers = []
for name, m in model.named_modules():
   low = name.lower()
   if (("gate" in low and ("moe" in low or "expert" in low)) or
       low.endswith(".router") or low.endswith(".gate")) and hasattr(m, "weight"):
       routers.append((name, m))
print(f"found {len(routers)} router-like modules")


TOP_K = 8
counts = [Counter() for _ in routers]
handles = []
def _mkhook(i):
   def h(_m, _i, out):
       lg = out[0] if isinstance(out, tuple) else out
       if lg.dim() != 2: return
       try:
           for eid in lg.topk(TOP_K, dim=-1).indices.flatten().tolist():
               counts[i][eid] += 1
       except Exception: pass
   return h
for i,(_,m) in enumerate(routers): handles.append(m.register_forward_hook(_mkhook(i)))
try:
   c = QwenChat(model, processor); c.user("Write one short sentence about sunset.")
   c.generate(enable_thinking=False, preset="instruct_general", max_new_tokens=40)
finally:
   for h in handles: h.remove()
total = Counter()
for c_ in counts: total.update(c_)
print(f"distinct experts activated: {len(total)}")
for eid, n in total.most_common(10): print(f"  expert #{eid:>3}  {n} fires")


print("n" + "="*20, "§11 benchmark", "="*20)
benchmark("In one sentence, what is entropy?", batch_sizes=(1,2,4), max_new_tokens=48)


print("n" + "="*20, "§12 mini-RAG", "="*20)
retrieve = build_rag()
ans, ctx = rag_answer("How many experts are active per token, and why does that matter?", retrieve)
print("retrieved:"); [print(" -", c) for c in ctx]
print("answer:", ans)


print("n" + "="*20, "§13 save/resume", "="*20)
c = QwenChat(model, processor); c.user("Give me a unique 5-letter codeword. Just the word.")
_, a1 = c.generate(enable_thinking=True, max_new_tokens=256); print("T1:", a1)
c.save("/content/session.json")
del c; gc.collect()
r = QwenChat.load(model, processor, "/content/session.json")
r.user("Reverse the letters of that codeword.")
_, a2 = r.generate(enable_thinking=True, preserve_thinking=True, max_new_tokens=256)
print("T2:", a2)


print("n✓ tutorial complete")

We continue with the remaining demonstrations that showcase tool-augmented reasoning, schema-constrained JSON generation, MoE routing introspection, throughput benchmarking, retrieval-augmented answering, and save-resume session handling. We let the model solve a tool-using task, generate structured movie metadata with validation, inspect which expert-like router modules activate during inference, and measure tokens-per-second across different batch sizes. Finally, we test mini-RAG for context-grounded answering and verify conversational persistence by saving a session, reloading it, and continuing the interaction from the stored history.

In conclusion, we created a practical and detailed workflow for using Qwen 3.6-35B-A3B beyond simple text generation. We showed how to combine adaptive loading, multimodal prompting, controlled reasoning, tool-augmented interaction, schema-constrained outputs, lightweight RAG, and session save-resume patterns into one integrated system. We also inspected expert routing behavior and measured throughput to understand the model’s usability and performance. Also, we turned Qwen 3.6 into a working experimental playground where we can study its capabilities, test advanced interaction patterns, and build a strong foundation for more serious research or product-oriented applications.

Check out the Full Codes with Notebook here. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post A Coding Implementation on Qwen 3.6-35B-A3B Covering Multimodal Inference, Thinking Control, Tool Calling, MoE Routing, RAG, and Session Persistence appeared first on MarkTechPost.

AI日报汇

AI日报汇

A Coding Implementation on Qwen 3.6-35B-A3B Covering Multimodal Inference, Thinking Control, Tool Calling, MoE Routing, RAG, and Session Persistence

A Coding Implementation on Qwen 3.6-35B-A3B Covering Multimodal Inference, Thinking Control, Tool Calling, MoE Routing, RAG, and Session Persistence

admin

Related Posts

Moonshot AI Releases Kimi K2.6 with Long-Horizon Coding, Agent Swarm Scaling to 300 Sub-Agents and 4,000 Coordinated Steps

A Coding Implementation on Microsoft’s Phi-4-Mini for Quantized Inference Reasoning Tool Use RAG and LoRA Fine-Tuning

发表回复取消回复

Other Story

A Coding Implementation on Qwen 3.6-35B-A3B Covering Multimodal Inference, Thinking Control, Tool Calling, MoE Routing, RAG, and Session Persistence

Moonshot AI Releases Kimi K2.6 with Long-Horizon Coding, Agent Swarm Scaling to 300 Sub-Agents and 4,000 Coordinated Steps

A Coding Implementation on Microsoft’s Phi-4-Mini for Quantized Inference Reasoning Tool Use RAG and LoRA Fine-Tuning

OpenAI Scales Trusted Access for Cyber Defense With GPT-5.4-Cyber: a Fine-Tuned Model Built for Verified Security Defenders

Moonshot AI and Tsinghua Researchers Propose PrfaaS: A Cross-Datacenter KVCache Architecture that Rethinks How LLMs are Served at Scale

Meet OpenMythos: An Open-Source PyTorch Reconstruction of Claude Mythos Where 770M Parameters Match a 1.3B Transformer

AI日报汇

AI日报汇

A Coding Implementation on Qwen 3.6-35B-A3B Covering Multimodal Inference, Thinking Control, Tool Calling, MoE Routing, RAG, and Session Persistence

A Coding Implementation on Qwen 3.6-35B-A3B Covering Multimodal Inference, Thinking Control, Tool Calling, MoE Routing, RAG, and Session Persistence

admin

Related Posts

Moonshot AI Releases Kimi K2.6 with Long-Horizon Coding, Agent Swarm Scaling to 300 Sub-Agents and 4,000 Coordinated Steps

A Coding Implementation on Microsoft’s Phi-4-Mini for Quantized Inference Reasoning Tool Use RAG and LoRA Fine-Tuning

发表回复 取消回复

Other Story

A Coding Implementation on Qwen 3.6-35B-A3B Covering Multimodal Inference, Thinking Control, Tool Calling, MoE Routing, RAG, and Session Persistence

Moonshot AI Releases Kimi K2.6 with Long-Horizon Coding, Agent Swarm Scaling to 300 Sub-Agents and 4,000 Coordinated Steps

A Coding Implementation on Microsoft’s Phi-4-Mini for Quantized Inference Reasoning Tool Use RAG and LoRA Fine-Tuning

OpenAI Scales Trusted Access for Cyber Defense With GPT-5.4-Cyber: a Fine-Tuned Model Built for Verified Security Defenders

Moonshot AI and Tsinghua Researchers Propose PrfaaS: A Cross-Datacenter KVCache Architecture that Rethinks How LLMs are Served at Scale

Meet OpenMythos: An Open-Source PyTorch Reconstruction of Claude Mythos Where 770M Parameters Match a 1.3B Transformer

发表回复取消回复