{"id":786,"date":"2026-04-24T05:25:34","date_gmt":"2026-04-23T21:25:34","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=786"},"modified":"2026-04-24T05:25:34","modified_gmt":"2026-04-23T21:25:34","slug":"a-coding-tutorial-on-openmythos-on-recurrent-depth-transformers-with-depth-extrapolation-adaptive-computation-and-mixture-of-experts-routing","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=786","title":{"rendered":"A Coding Tutorial on OpenMythos on Recurrent-Depth Transformers with Depth Extrapolation, Adaptive Computation, and Mixture-of-Experts Routing"},"content":{"rendered":"<p>In this tutorial, we explore the implementation of <a href=\"https:\/\/github.com\/kyegomez\/OpenMythos\/tree\/main\"><strong>OpenMythos<\/strong><\/a>, a theoretical reconstruction of the Claude Mythos architecture that enables deeper reasoning through iterative computation rather than increased parameter size. We build and analyze models using both GQA and MLA attention mechanisms, examine memory efficiency through KV-cache comparisons, and validate stability via the spectral properties of the recurrent update. We then train the model on a structured parity task and investigate how increasing loop depth at inference improves performance without retraining. Along the way, we also inspect adaptive computation via ACT halting and monitor expert utilization in the MoE layers, providing a comprehensive, hands-on understanding of this emerging architecture.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">import subprocess, sys\ntry:\n   import open_mythos  # noqa: F401\nexcept ImportError:\n   subprocess.check_call([sys.executable, \"-m\", \"pip\", \"install\", \"-q\",\n                          \"open-mythos\"])\n\n\nimport math, time, copy\nfrom collections import Counter, defaultdict\n\n\nimport numpy as np\nimport torch, torch.nn as nn, torch.nn.functional as F\nimport matplotlib.pyplot as plt\n\n\nfrom open_mythos.main import (\n   OpenMythos, MythosConfig,\n   ACTHalting, MoEFFN,\n)\n\n\ntorch.manual_seed(0); np.random.seed(0)\ndevice = \"cuda\" if torch.cuda.is_available() else \"cpu\"\nprint(f\"\u25b8 device = {device}   |   torch = {torch.__version__}\")\n\n\ndef make_config(attn_type: str, *, dim=128, n_heads=4, n_experts=4,\n               max_loops=8, seq_len=128, vocab=256):\n   base = dict(\n       vocab_size=vocab, dim=dim, n_heads=n_heads,\n       max_seq_len=seq_len, max_loop_iters=max_loops,\n       prelude_layers=1, coda_layers=1,\n       n_experts=n_experts, n_shared_experts=1,\n       n_experts_per_tok=2, expert_dim=dim \/\/ 2,\n       lora_rank=8, attn_type=attn_type,\n   )\n   if attn_type == \"gqa\":\n       return MythosConfig(**base, n_kv_heads=2)\n   return MythosConfig(\n       **base, n_kv_heads=n_heads,\n       kv_lora_rank=32, q_lora_rank=64,\n       qk_rope_head_dim=16, qk_nope_head_dim=16, v_head_dim=16,\n   )\n\n\ncfg_gqa = make_config(\"gqa\")\ncfg_mla = make_config(\"mla\")\nm_gqa = OpenMythos(cfg_gqa).to(device)\nm_mla = OpenMythos(cfg_mla).to(device)\n\n\nprint(\"n\u2500\u2500\u2500 Part 1 \u2500 model sizes \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\")\nprint(f\"GQA  params : {sum(p.numel() for p in m_gqa.parameters()):&gt;10,}\")\nprint(f\"MLA  params : {sum(p.numel() for p in m_mla.parameters()):&gt;10,}\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We install and import all required dependencies and initialize our environment for running OpenMythos. We construct configurations for both GQA and MLA attention mechanisms and instantiate their respective models. We also compare their parameter sizes to understand how architectural differences impact model scale.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">def cache_bytes(kv: dict) -&gt; int:\n   total = 0\n   for entry in kv.values():\n       for t in entry.values():\n           total += t.element_size() * t.numel()\n   return total\n\n\nx = torch.randint(0, 256, (1, 64), device=device)\nck_gqa, ck_mla = {}, {}\nwith torch.no_grad():\n   m_gqa(x, n_loops=4, kv_cache=ck_gqa)\n   m_mla(x, n_loops=4, kv_cache=ck_mla)\n\n\ngqa_kb = cache_bytes(ck_gqa) \/ 1024\nmla_kb = cache_bytes(ck_mla) \/ 1024\nprint(\"n\u2500\u2500\u2500 Part 2 \u2500 KV-cache footprint (1\u00d764 tokens, 4 loops) \u2500\")\nprint(f\"GQA cache : {gqa_kb:6.2f} KB   ({len(ck_gqa)} layer-keys)\")\nprint(f\"MLA cache : {mla_kb:6.2f} KB   ({len(ck_mla)} layer-keys)\")\nprint(f\"ratio      : MLA is \u2248{gqa_kb \/ max(mla_kb, 1e-9):.2f}\u00d7 smaller\")\n\n\ndef show_stability(model, tag):\n   A = model.recurrent.injection.get_A()\n   print(f\"{tag:3s}  \u03c1(A): min={A.min():.4f}  max={A.max():.4f}  \"\n         f\"mean={A.mean():.4f}  stable={bool((A &lt; 1).all() and (A &gt; 0).all())}\")\n\n\nprint(\"n\u2500\u2500\u2500 Part 3 \u2500 spectral radius at init \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\")\nshow_stability(m_gqa, \"GQA\")\nshow_stability(m_mla, \"MLA\")\n\n\nopt = torch.optim.Adam(m_mla.parameters(), lr=1.0)\nfor _ in range(30):\n   loss = m_mla(torch.randint(0, 256, (2, 16), device=device),\n                n_loops=2).square().mean()\n   opt.zero_grad(); loss.backward(); opt.step()\nshow_stability(m_mla, \"MLA after abusive training (lr=1.0, 30 steps)\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We compute and compare the KV-cache memory footprint for both GQA and MLA attention types during forward passes. We then inspect the stability of the recurrent component by analyzing the spectral radius of matrix A. We further stress-test the model with extreme training conditions to confirm that stability is preserved.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">VOCAB = 64\nSEQ_LEN = 24\n\n\ndef make_batch(batch=64, seq_len=SEQ_LEN):\n   x = torch.randint(1, 3, (batch, seq_len), device=device)\n   bits = x - 1\n   parity = bits.cumsum(dim=1) % 2\n   y = parity + 1\n   return x, y\n\n\ncfg = MythosConfig(\n   vocab_size=VOCAB, dim=64, n_heads=4, n_kv_heads=2,\n   max_seq_len=SEQ_LEN + 4, max_loop_iters=16,\n   prelude_layers=1, coda_layers=1,\n   n_experts=4, n_shared_experts=1, n_experts_per_tok=2,\n   expert_dim=32, lora_rank=4, attn_type=\"gqa\",\n   act_threshold=0.99,\n)\nmodel = OpenMythos(cfg).to(device)\nopt = torch.optim.AdamW(model.parameters(), lr=3e-4)\nT_TRAIN = 3\n\n\nprint(\"n\u2500\u2500\u2500 Part 5 \u2500 training (T_train = 3) \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\")\nprint(f\"params: {sum(p.numel() for p in model.parameters()):,}\")\nlosses = []\nt0 = time.time()\nfor step in range(600):\n   x, y = make_batch(64)\n   logits = model(x, n_loops=T_TRAIN)\n   loss = F.cross_entropy(logits.reshape(-1, VOCAB), y.reshape(-1))\n   opt.zero_grad(); loss.backward()\n   opt.step()\n   losses.append(loss.item())\n   if step % 100 == 0 or step == 599:\n       with torch.no_grad():\n           acc = (logits.argmax(-1) == y).float().mean().item()\n       print(f\"step {step:3d}   loss={loss.item():.4f}   acc@T3={acc:.3f}\")\nprint(f\"training wallclock: {time.time() - t0:.1f}s\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We define a cumulative parity task to train our model on a structured sequential problem. We initialize the OpenMythos model with a fixed loop depth and train it using cross-entropy loss. Throughout training, we monitor loss and accuracy to evaluate how well the model learns under constrained depth.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">model.eval()\nT_sweep = [1, 2, 3, 4, 6, 8, 10, 12, 14, 16]\naccs = []\nwith torch.no_grad():\n   x_eval, y_eval = make_batch(512)\n   for T in T_sweep:\n       logits = model(x_eval, n_loops=T)\n       accs.append((logits.argmax(-1) == y_eval).float().mean().item())\n\n\nprint(\"n\u2500\u2500\u2500 Part 6 \u2500 depth extrapolation (T_train=3) \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\")\nfor T, a in zip(T_sweep, accs):\n   bar = \"\u2588\" * int(a * 40)\n   marker = \"  \u2190 trained here\" if T == T_TRAIN else \"\"\n   print(f\"T={T:2d}  acc={a:.3f}  {bar}{marker}\")\n\n\nhalt_trace: list[torch.Tensor] = []\norig_halt = model.recurrent.act.forward\n\n\ndef halt_hook(self, h):\n   p = orig_halt(h)\n   halt_trace.append(p.detach().cpu())\n   return p\nmodel.recurrent.act.forward = halt_hook.__get__(model.recurrent.act, ACTHalting)\n\n\nwith torch.no_grad():\n   x_h, _ = make_batch(1)\n   _ = model(x_h, n_loops=16)\n\n\nmodel.recurrent.act.forward = orig_halt\n\n\nhalts = torch.stack(halt_trace, dim=0)[:, 0].numpy()\nprint(f\"n\u2500\u2500\u2500 Part 7 \u2500 ACT halting matrix (loops \u00d7 positions) \u2500\u2500\u2500\")\nprint(f\"shape: {halts.shape}  |  \"\n     f\"mean halt-prob per loop: \"\n     f\"{', '.join(f'{v:.2f}' for v in halts.mean(1))}\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We evaluate the trained model by varying the number of inference loops to study depth extrapolation. We observe how increasing loop depth improves accuracy without retraining the model. We also instrument the ACT mechanism to capture halting probabilities at each sequence position and iteration.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">expert_hits = Counter()\norig_moe = model.recurrent.block.ffn.forward\n\n\ndef moe_hook(self, x):\n   flat = x.view(-1, x.shape[-1])\n   logits = self.router(flat) + self.router_bias\n   scores = F.softmax(logits, dim=-1)\n   _, idx = scores.topk(self.topk, dim=-1)\n   for e in idx.flatten().tolist():\n       expert_hits[e] += 1\n   return orig_moe(x)\n\n\nmodel.recurrent.block.ffn.forward = moe_hook.__get__(\n   model.recurrent.block.ffn, MoEFFN)\n\n\nwith torch.no_grad():\n   x_m, _ = make_batch(32)\n   _ = model(x_m, n_loops=T_TRAIN)\n\n\nmodel.recurrent.block.ffn.forward = orig_moe\n\n\nprint(\"n\u2500\u2500\u2500 Part 8 \u2500 MoE expert utilization \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\")\ntotal = sum(expert_hits.values())\nfor eid in range(cfg.n_experts):\n   share = expert_hits.get(eid, 0) \/ max(total, 1)\n   print(f\"expert {eid}: {share*100:5.2f}% of topk slots\")\n\n\nprompt = torch.tensor([[1, 2, 1, 1, 2, 2, 1, 2]], device=device)\nprint(\"n\u2500\u2500\u2500 Part 9 \u2500 generation \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\")\nprint(f\"prompt (parity pattern): {prompt.tolist()[0]}\")\nfor T_gen in [1, 4, 12]:\n   with torch.no_grad():\n       out = model.generate(prompt, max_new_tokens=8,\n                            n_loops=T_gen, temperature=0.1, top_k=2)\n   print(f\"T_gen={T_gen:2d}  \u2192 {out.tolist()[0]}\")\n\n\nfig, axes = plt.subplots(1, 3, figsize=(15, 4))\n\n\naxes[0].plot(losses)\naxes[0].set_title(\"Training loss (parity task)\")\naxes[0].set_xlabel(\"step\"); axes[0].set_ylabel(\"cross-entropy\")\naxes[0].grid(alpha=0.3)\n\n\naxes[1].plot(T_sweep, accs, \"o-\", linewidth=2, markersize=8)\naxes[1].axvline(T_TRAIN, color=\"red\", linestyle=\"--\",\n               label=f\"T_train = {T_TRAIN}\")\naxes[1].set_title(\"Depth extrapolation: accuracy vs inference loops\")\naxes[1].set_xlabel(\"n_loops at inference\"); axes[1].set_ylabel(\"accuracy\")\naxes[1].legend(); axes[1].grid(alpha=0.3); axes[1].set_ylim(0, 1.05)\n\n\nim = axes[2].imshow(halts, aspect=\"auto\", cmap=\"viridis\",\n                   vmin=0, vmax=halts.max())\naxes[2].set_title(\"ACT halting probabilityn(loop t \u00d7 position)\")\naxes[2].set_xlabel(\"position\"); axes[2].set_ylabel(\"loop iteration t\")\nplt.colorbar(im, ax=axes[2], fraction=0.046, pad=0.04)\n\n\nplt.tight_layout()\nplt.savefig(\"openmythos_tutorial.png\", dpi=120, bbox_inches=\"tight\")\nplt.show()<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We analyze expert utilization in the MoE layer by tracking how tokens are routed across experts. We then generate sequences at different loop depths to observe their effects on outputs. Finally, we visualize training loss, depth extrapolation performance, and ACT halting behavior through plots.<\/p>\n<p>In conclusion, we demonstrated that OpenMythos effectively leverages looped computation to achieve depth extrapolation, enabling the model to improve accuracy simply by increasing the number of inference-time loops. We observed that the recurrent mechanism remains stable even under extreme training conditions, and that MLA attention significantly reduces KV-cache memory usage compared to GQA. We also saw how ACT enables dynamic computation across sequence positions and how MoE routing distributes workload across experts. Overall, we established that this architecture offers a compelling direction for compute-adaptive reasoning, where we trade additional inference compute for better performance without modifying the model\u2019s parameters.<\/p>\n<hr class=\"wp-block-separator aligncenter has-alpha-channel-opacity is-style-wide\" \/>\n<p>Check out\u00a0the<strong>\u00a0<\/strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Agents-Projects-Tutorials\/blob\/main\/Deep%20Learning\/openmythos_recurrent_depth_transformer_depth_extrapolation_tutorial_Markechpost.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Full Codes with Notebook here<\/strong>.<\/a><strong>\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">130k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/04\/23\/a-coding-tutorial-on-openmythos-on-recurrent-depth-transformers-with-depth-extrapolation-adaptive-computation-and-mixture-of-experts-routing\/\">A Coding Tutorial on OpenMythos on Recurrent-Depth Transformers with Depth Extrapolation, Adaptive Computation, and Mixture-of-Experts Routing<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>In this tutorial, we explore t&hellip;<\/p>\n","protected":false},"author":1,"featured_media":29,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-786","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/786","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=786"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/786\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/29"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=786"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=786"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=786"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}