{"id":951,"date":"2026-05-22T15:39:30","date_gmt":"2026-05-22T07:39:30","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=951"},"modified":"2026-05-22T15:39:30","modified_gmt":"2026-05-22T07:39:30","slug":"build-recurrent-depth-transformers-with-openmythos-for-mla-gqa-sparse-moe-and-loop-scaled-reasoning","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=951","title":{"rendered":"Build Recurrent-Depth Transformers with OpenMythos for MLA, GQA, Sparse MoE, and Loop-Scaled Reasoning"},"content":{"rendered":"<p class=\"wp-block-paragraph\">In this tutorial, we explore<a href=\"https:\/\/github.com\/kyegomez\/OpenMythos\"> <strong>OpenMythos<\/strong><\/a> by building an advanced recurrent-depth transformer workflow that runs end-to-end in Google Colab. We create both MLA and GQA model variants, compare their parameter counts, and check the stability of the recurrent injection matrix through its spectral radius. We then move from simple forward and generation tests into a synthetic compositional reasoning task, where the model learns to predict the sum of digit chains modulo a fixed value. Through this setup, we study how recurrent loops enable a single model to reuse its parameters for deeper computation.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">import subprocess, sys\ndef pip(*args):\n   subprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", *args], check=False)\ntry:\n   import open_mythos  # noqa: F401\nexcept Exception:\n   pip(\"open-mythos\")\n   try:\n       import open_mythos  # noqa: F401\n   except Exception:\n       pip(\"git+https:\/\/github.com\/kyegomez\/OpenMythos.git\")\nimport math, random, time\nimport numpy as np\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nfrom torch.utils.data import Dataset, DataLoader\nimport matplotlib.pyplot as plt\nfrom open_mythos.main import OpenMythos, MythosConfig\nSEED = 42\nrandom.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)\nprint(f\"Device: {device} | Torch: {torch.__version__}\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p class=\"wp-block-paragraph\">We install OpenMythos and fall back to the GitHub source if installing from PyPI fails. We import the required Python, PyTorch, NumPy, and plotting libraries for model building, training, and visualization. We also set a fixed random seed and use CUDA when available, so the tutorial runs efficiently in Colab.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">def build_model(attn_type: str = \"mla\", max_loop_iters: int = 8) -&gt; tuple:\n   \"\"\"Build a small OpenMythos model. Two attention variants supported.\n   MLA  \u2014 Multi-Latent Attention (compressed KV cache, DeepSeek-V2 style)\n   GQA  \u2014 Grouped-Query Attention (fewer KV heads than Q heads)\n   \"\"\"\n   base = dict(\n       vocab_size       = 64,\n       dim              = 128,\n       n_heads          = 4,\n       max_seq_len      = 32,\n       max_loop_iters   = max_loop_iters,\n       prelude_layers   = 1,\n       coda_layers      = 1,\n       n_experts        = 4,\n       n_shared_experts = 1,\n       n_experts_per_tok= 2,\n       expert_dim       = 64,\n       lora_rank        = 8,\n       attn_type        = attn_type,\n   )\n   if attn_type == \"gqa\":\n       cfg = MythosConfig(**base, n_kv_heads=2)\n   else:\n       cfg = MythosConfig(\n           **base, n_kv_heads=4,\n           kv_lora_rank=32, q_lora_rank=32,\n           qk_rope_head_dim=16, qk_nope_head_dim=16, v_head_dim=16,\n       )\n   model = OpenMythos(cfg).to(device)\n   return model, cfg\nmodel_mla, cfg_mla = build_model(\"mla\")\nmodel_gqa, cfg_gqa = build_model(\"gqa\")\ndef n_params(m): return sum(p.numel() for p in m.parameters())\nprint(f\"n[MLA] params: {n_params(model_mla):&gt;10,}\")\nprint(f\"[GQA] params: {n_params(model_gqa):&gt;10,}\")\ndef spectral_radius(model):\n   A = model.recurrent.injection.get_A().detach().cpu()\n   if A.dim() == 1:\n       rho = A.abs().max().item()\n   else:\n       rho = torch.linalg.eigvals(A.float()).abs().max().item()\n   return rho\nprint(f\"n\u03c1(A) MLA: {spectral_radius(model_mla):.4f}   (must be &lt; 1)\")\nprint(f\"\u03c1(A) GQA: {spectral_radius(model_gqa):.4f}   (must be &lt; 1)\")\nids = torch.randint(0, cfg_mla.vocab_size, (2, 16), device=device)\nwith torch.no_grad():\n   logits = model_mla(ids, n_loops=4)\n   gen    = model_mla.generate(ids, max_new_tokens=4, n_loops=8)\nprint(f\"nForward logits shape:  {tuple(logits.shape)}\")\nprint(f\"Generation shape:      {tuple(gen.shape)}\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p class=\"wp-block-paragraph\">We define a reusable model factory that builds small OpenMythos models with either MLA or GQA attention. We compare both variants by checking their parameter counts and the spectral radius of the recurrent injection matrix. We then run a quick forward pass and generation test to confirm that the MLA model produces logits and generated tokens correctly.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">PAD, START, EQ = 0, 1, 2\nDIGIT_BASE     = 10\nM              = 7\nSEQ_LEN        = cfg_mla.max_seq_len\nMIN_LEN, MAX_LEN = 2, 5\ndef make_example(chain_len: int):\n   digits = [random.randint(0, M-1) for _ in range(chain_len)]\n   target = sum(digits) % M\n   toks = [START] + [DIGIT_BASE + d for d in digits] + [EQ]\n   toks = toks + [PAD] * (SEQ_LEN - len(toks))\n   return toks[:SEQ_LEN], DIGIT_BASE + target\nclass ChainDataset(Dataset):\n   def __init__(self, n, lo, hi):\n       self.items = [make_example(random.randint(lo, hi)) for _ in range(n)]\n   def __len__(self): return len(self.items)\n   def __getitem__(self, i):\n       x, y = self.items[i]\n       return torch.tensor(x, dtype=torch.long), torch.tensor(y, dtype=torch.long)\ntrain_loader = DataLoader(ChainDataset(3000, MIN_LEN, MAX_LEN), batch_size=64, shuffle=True)\ntest_loader  = DataLoader(ChainDataset(400,  MIN_LEN, MAX_LEN), batch_size=64)\nood_loader   = DataLoader(ChainDataset(400,  MAX_LEN+1, MAX_LEN+3), batch_size=64)<\/code><\/pre>\n<\/div>\n<\/div>\n<p class=\"wp-block-paragraph\">We create a synthetic compositional task in which the model predicts the sum of digit tokens modulo 7. We define the token scheme, sequence structure, and dataset class that generates random digit-chain examples. We then build training, test, and out-of-distribution loaders to evaluate both normal performance and depth extrapolation.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">model   = model_mla\nTRAIN_LOOPS = 4\nEPOCHS  = 6\nopt   = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)\nsched = torch.optim.lr_scheduler.CosineAnnealingLR(opt, T_max=EPOCHS)\ndef loss_at_eq(logits, x, y):\n   \"\"\"Predict the answer at the position immediately after the EQ token.\"\"\"\n   eq_pos = (x == EQ).int().argmax(dim=1)\n   pred   = logits[torch.arange(x.size(0)), eq_pos]\n   return F.cross_entropy(pred, y), pred\ntrain_losses = []\nprint(\"n--- Training ---\")\nt0 = time.time()\nfor ep in range(EPOCHS):\n   model.train(); running = 0.0\n   for x, y in train_loader:\n       x, y = x.to(device), y.to(device)\n       logits = model(x, n_loops=TRAIN_LOOPS)\n       loss, _ = loss_at_eq(logits, x, y)\n       opt.zero_grad(); loss.backward()\n       opt.step()\n       running += loss.item()\n   sched.step()\n   train_losses.append(running \/ len(train_loader))\n   print(f\"  epoch {ep+1}\/{EPOCHS}  loss={train_losses[-1]:.4f}  \u03c1(A)={spectral_radius(model):.3f}\")\nprint(f\"Train time: {time.time()-t0:.1f}s\")\n@torch.no_grad()\ndef accuracy(loader, n_loops):\n   model.eval(); correct = total = 0\n   for x, y in loader:\n       x, y = x.to(device), y.to(device)\n       logits = model(x, n_loops=n_loops)\n       _, pred = loss_at_eq(logits, x, y)\n       correct += (pred.argmax(-1) == y).sum().item()\n       total   += y.size(0)\n   return correct \/ total\nLOOP_GRID = [1, 2, 4, 6, 8]\nprint(\"n--- Loop-count scaling (same weights, varying compute) ---\")\nin_dist_acc  = [accuracy(test_loader, L) for L in LOOP_GRID]\nood_acc      = [accuracy(ood_loader,  L) for L in LOOP_GRID]\nfor L, a, o in zip(LOOP_GRID, in_dist_acc, ood_acc):\n   print(f\"  n_loops={L}: in-dist acc={a:.3f}   OOD (longer chains) acc={o:.3f}\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p class=\"wp-block-paragraph\">We train the MLA model with a fixed number of recurrent loops and optimize it with AdamW and a cosine learning rate schedule. We compute the loss at the EQ token position, clip gradients, track training loss, and monitor recurrent stability after each epoch. We then evaluate inference-time loop scaling by testing the same trained model with different loop counts.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">fig, axes = plt.subplots(1, 2, figsize=(12, 4))\naxes[0].plot(range(1, EPOCHS+1), train_losses, marker=\"o\")\naxes[0].set_title(\"Training loss\"); axes[0].set_xlabel(\"epoch\"); axes[0].set_ylabel(\"CE loss\")\naxes[0].grid(alpha=0.3)\naxes[1].plot(LOOP_GRID, in_dist_acc, marker=\"s\", label=\"in-distribution\")\naxes[1].plot(LOOP_GRID, ood_acc,     marker=\"^\", label=\"longer chains (OOD depth)\")\naxes[1].set_title(\"Inference-time loop scaling\")\naxes[1].set_xlabel(\"# recurrent loops at inference\"); axes[1].set_ylabel(\"test accuracy\")\naxes[1].legend(); axes[1].grid(alpha=0.3)\nplt.tight_layout(); plt.show()\nchain_len = 4\ntoks, true_tok = make_example(chain_len)\ndigits = [t - DIGIT_BASE for t in toks if t &gt;= DIGIT_BASE]\nprompt = torch.tensor([toks], device=device)\nwith torch.no_grad():\n   gen = model.generate(prompt, max_new_tokens=1, n_loops=8)\npredicted = gen[0, -1].item()\nprint(f\"nDemo: digits={digits}, target=({'+'.join(map(str, digits))}) % {M} = {sum(digits)%M}\")\nprint(f\"      true token={true_tok} (digit {true_tok-DIGIT_BASE})  |  \"\n     f\"predicted token={predicted} (digit {predicted-DIGIT_BASE if predicted&gt;=DIGIT_BASE else '?'})\")\nprint(\"nDone. Key takeaway: at inference, increasing n_loops trades compute for\")\nprint(\"reasoning depth on the same fixed-parameter model \u2014 that's the RDT premise.\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p class=\"wp-block-paragraph\">We visualize the training loss curve and compare in-distribution accuracy with longer-chain out-of-distribution accuracy across loop counts. We also run a small qualitative generation example to inspect whether the trained model predicts the correct modulo-sum digit. We conclude by showing that increasing the number of recurrent loops gives the same fixed-parameter model more reasoning depth at inference time.<\/p>\n<p class=\"wp-block-paragraph\">In conclusion, we understood how OpenMythos combines recurrent-depth transformer design, attention variants, sparse MoE components, and inference-time loop scaling into a compact experimental pipeline. We trained the model on a controlled reasoning task, evaluated it on both in-distribution and longer out-of-distribution chains, and visualized how accuracy changes as we increase the number of recurrent loops. It helped us see how recurrent depth can trade additional inference computation for stronger reasoning behavior without changing the model\u2019s learned parameters.<\/p>\n<p class=\"wp-block-paragraph\">\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<\/p><p class=\"wp-block-paragraph\">\n<\/p><p class=\"wp-block-paragraph\">Check out\u00a0the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Agents-Projects-Tutorials\/blob\/main\/Deep%20Learning\/openmythos_recurrent_depth_transformer_loop_scaled_reasoning_Marktechpost.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">Full Codes with Notebook<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p class=\"wp-block-paragraph\">Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/05\/22\/build-recurrent-depth-transformers-with-openmythos-for-mla-gqa-sparse-moe-and-loop-scaled-reasoning\/\">Build Recurrent-Depth Transformers with OpenMythos for MLA, GQA, Sparse MoE, and Loop-Scaled Reasoning<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>In this tutorial, we explore O&hellip;<\/p>\n","protected":false},"author":1,"featured_media":29,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-951","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/951","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=951"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/951\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/29"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=951"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=951"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=951"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}