{"id":729,"date":"2026-04-16T16:30:30","date_gmt":"2026-04-16T08:30:30","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=729"},"modified":"2026-04-16T16:30:30","modified_gmt":"2026-04-16T08:30:30","slug":"ucsd-and-together-ai-research-introduces-parcae-a-stable-architecture-for-looped-language-models-that-achieves-the-quality-of-a-transformer-twice-the-size","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=729","title":{"rendered":"UCSD and Together AI Research Introduces Parcae: A Stable Architecture for\u00a0Looped Language Models That Achieves the Quality of a Transformer\u00a0Twice the Size"},"content":{"rendered":"<p>The dominant recipe for building better language models has not changed much since the Chinchilla era: spend more FLOPs, add more parameters, train on more tokens. But as inference deployments consume an ever-growing share of compute and model deployments push toward the edge, researchers are increasingly asking a harder question \u2014 can you scale <em>quality<\/em> without scaling <em>memory footprint<\/em>?<\/p>\n<p>A team of researchers from UC San Diego and Together AI have introduced <strong>Parcae<\/strong>, a stable looped transformer architecture that outperforms prior looped models and beats fixed-depth Transformer baselines at every scale tested \u2014 all while using the same parameter count and the same training data budget<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1500\" height=\"548\" data-attachment-id=\"79061\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/04\/16\/ucsd-and-together-ai-research-introduces-parcae-a-stable-architecture-for-looped-language-models-that-achieves-the-quality-of-a-transformer-twice-the-size\/screenshot-2026-04-16-at-1-29-40-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-16-at-1.29.40-AM-1.png\" data-orig-size=\"1500,548\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-04-16 at 1.29.40\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-16-at-1.29.40-AM-1-1024x374.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-16-at-1.29.40-AM-1.png\" alt=\"\" class=\"wp-image-79061\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2604.12946<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>What is a Looped Language Model?<\/strong><\/h3>\n<p>In a standard Transformer, activations flow through a fixed stack of layers exactly once. A <strong>looped architecture<\/strong> instead routes activations through a block of layers <em>T<\/em> times in a loop, multiplying effective compute without adding parameters. Think of it as running the same group of transformer blocks repeatedly rather than building a taller model.<\/p>\n<p>Parcae specifically uses a <strong>middle-looped<\/strong> design, partitioning the architecture into three functional blocks: a <strong>prelude (P)<\/strong> that embeds the input sequence into a latent state <em>e<\/em>; a <strong>recurrent block (R)<\/strong> that iteratively updates a hidden state <em>h<\/em><sub><em>t<\/em> <\/sub>for <em>T<\/em> loops, with <em>e<\/em> injected at each iteration to maintain the input\u2019s influence; and a <strong>coda (C)<\/strong> that processes the final <em>h<\/em><sub><em>T<\/em> <\/sub>to produce the output. This structure keeps the model compact in memory, a valuable property for on-device deployment, while enabling significantly more compute per forward pass.<\/p>\n<p>Past works on looped transformers, including Recurrent Depth Models (RDMs), showed early promise but were quite difficult to train. They suffered from <strong>residual state explosion<\/strong> \u2014 where the hidden state vector grows uncontrollably across loop iterations \u2014 and frequent <strong>loss spikes<\/strong>. Sensitive hyperparameter tuning was required just to achieve convergence.<\/p>\n<h3 class=\"wp-block-heading\"><strong>The Root Cause: An Unconstrained Residual System<\/strong><\/h3>\n<p>The research team behind Parcae\u2019s key insight is to recast the looped model\u2019s forward pass as a <strong>nonlinear time-variant dynamical system<\/strong> over the residual stream:<\/p>\n<pre class=\"wp-block-code\"><code>h<sub>t+1<\/sub> = \u0100 h<sub>t<\/sub> + B\u0304 e + R\u0304(h<sub>t<\/sub>, e),<\/code><\/pre>\n<p>Here, <em>\u0100<\/em> controls the balance between prior and current residual states, <em>B\u0304<\/em> injects the input signal, and <em>R\u0304<\/em> is the nonlinear contribution of the transformer blocks (attention and MLPs). Dropping <em>R\u0304<\/em> yields a <strong>discrete linear time-invariant (LTI) system<\/strong>, and classical control theory immediately gives you the stability condition: the system is stable when the <strong>spectral norm \u03c1(\u0100) &lt; 1<\/strong>, marginally stable when \u03c1(\u0100) = 1, and unstable when \u03c1(\u0100) &gt; 1.<\/p>\n<p>Examining prior methods under this framework reveals the problem precisely. Addition-based input injection sets <em>\u0100 = I<\/em> (the identity matrix), meaning \u03c1(\u0100) = 1 \u2014 <em>marginally stable<\/em>. The concatenation-with-projection approach used by RDMs leaves <em>\u0100<\/em> entirely unconstrained, making \u03c1(\u0100) potentially far greater than 1 \u2014 <em>unstable<\/em>. Empirical training curves confirm this directly: divergent training runs learn \u03c1(\u0100) \u2265 1, while the few convergent runs maintain \u03c1(\u0100) &lt; 1.<\/p>\n<h3 class=\"wp-block-heading\"><strong>How Parcae Enforces Stability by Design<\/strong><\/h3>\n<p>Rather than parameterizing <em>\u0100<\/em> directly, Parcae works in continuous form and <strong>discretizes using zero-order hold (ZOH) and Euler schemes<\/strong> \u2014 borrowing a standard technique from state space models like Mamba and S4 \u2014 with a learned step size \u0394 \u2208 \u211d<sup>d<sub>h<\/sub><\/sup>, giving \u0100 = exp(\u0394A) and B\u0304 = \u0394B. To guarantee \u03c1(\u0100) &lt; 1, the continuous matrix A is constrained as a <strong>negative diagonal matrix<\/strong>: <code>A := Diag(\u2212exp(log<sub>A<\/sub>))<\/code>, where log<sub>A<\/sub> \u2208 \u211d<sup>d<sub>h<\/sub><\/sup> is a learnable vector. Because diagonal entries are always negative before exponentiation, the spectral norm constraint is satisfied at all times by construction.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Results: Outperforming Models Twice the Size<\/strong><\/h3>\n<p>Against parameter- and data-matched RDMs trained on the Huginn dataset, Parcae reduces validation perplexity by up to <strong>6.3%<\/strong> \u2014 a figure that peaks at 350M scale (improving from 10.76 to 10.09 PPL) versus a 4.5% gain at 100M scale (14.23 to 13.59 PPL). WikiText perplexity improves by up to <strong>9.1%<\/strong> at 350M scale. Average downstream zero-shot benchmark accuracy improves by up to 1.8 points.<\/p>\n<p>Against standard fixed-depth Transformer baselines trained with a nanochat-inspired setup on FineWeb-Edu, Parcae outperforms at every scale. At <strong>1.3B parameters trained on 104B tokens<\/strong>, Parcae beats the parameter-matched Transformer by <strong>2.99 points on Core<\/strong> and <strong>1.18 points on Core-Extended<\/strong>. The <strong>770M Parcae model<\/strong> (25.07 Core) reaches quality comparable to the 1.3B Transformer (25.45 Core) \u2014 roughly half the parameters for equivalent capability. The research team quantifies Parcae\u2019s parameter efficiency as achieving up to <strong>87.5% of the quality of a Transformer twice its size<\/strong>, measured against the quality gap to the next larger model.<\/p>\n<h3 class=\"wp-block-heading\"><strong>The First Scaling Laws for Looping<\/strong><\/h3>\n<p>The second major contribution of this research is establishing the <strong>first predictable scaling laws for layer looping<\/strong>. Using isoFLOP experiments at 140M and 370M scales, the research team shows that compute-optimal training increases mean recurrence \u00b5<sub>rec<\/sub> and training tokens <em>D<\/em> in tandem, following power laws with consistent exponents across both scales: optimal \u00b5<sub>rec<\/sub> scales as C<sup>0.40<\/sup> and optimal tokens scale as C<sup>0.78<\/sup>, where C is the training FLOP budget.<\/p>\n<p>When looped Parcae models trained at their optimal \u00b5<sub>rec<\/sub> are compared against fixed-depth Parcae models (\u00b5<sub>rec<\/sub> = 1) under identical FLOP and parameter budgets, looping achieves a strictly lower validation loss \u2014 translating into <strong>1.2 to 2.0 points higher Core scores<\/strong> depending on the FLOP budget. Looping is a genuinely orthogonal axis for scaling compute, not a free lunch from weight sharing.<\/p>\n<p>At test time, increasing loop count <em>T<\/em> beyond training depth follows a <strong>saturating exponential decay<\/strong>: L(T) = L<sub>\u221e<\/sub> + Z\u00b7e<sup>\u2212z\u00b7T<\/sup>, where L<sub>\u221e<\/sub> is an irreducible floor determined by training depth. Gains plateau near \u00b5<sub>rec<\/sub> \u2014 the mean recurrence used during training \u2014 meaning training depth sets a hard ceiling on test-time scaling. These dynamics unify into a single parametric law that predicts held-out model loss within <strong>0.85\u20131.31% average error<\/strong>.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li><strong>Looped transformers can now be trained reliably at scale<\/strong>: Parcae is a looped architecture to solve the residual state explosion and loss spike problems that have plagued prior looped models, achieving stable training across a wide range of learning rates where previous approaches diverged.<\/li>\n<li><strong>A 770M Parcae model matches the quality of a 1.3B standard Transformer<\/strong>: By reusing the same layers across multiple loop iterations instead of adding more parameters, Parcae delivers equivalent downstream capability at roughly half the memory footprint.<\/li>\n<li><strong>Looping is a third orthogonal axis for scaling compute, alongside parameters and data<\/strong>: Under a fixed FLOP and parameter budget, compute-optimal training requires increasing mean recurrence and training tokens in tandem following predictable power laws \u2014 giving AI professionals a new lever to improve quality without buying more hardware.<\/li>\n<li><strong>Test-time looping has a hard ceiling set by training depth<\/strong>: Parcae can use additional loop iterations at inference to scale compute, but gains plateau near the mean recurrence used during training. You cannot infinitely loop your way to better performance without training the model at deeper recurrences first.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the<strong><a href=\"https:\/\/arxiv.org\/pdf\/2604.06425\" target=\"_blank\" rel=\"noreferrer noopener\">\u00a0<\/a><a href=\"https:\/\/arxiv.org\/pdf\/2604.12946\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a>, <a href=\"https:\/\/huggingface.co\/collections\/SandyResearch\/parcae\" target=\"_blank\" rel=\"noreferrer noopener\">Model Weights<\/a> <\/strong>and<strong> <a href=\"https:\/\/www.together.ai\/blog\/parcae\" target=\"_blank\" rel=\"noreferrer noopener\">Technical details<\/a><\/strong>.<strong>\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">130k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/04\/16\/ucsd-and-together-ai-research-introduces-parcae-a-stable-architecture-for-looped-language-models-that-achieves-the-quality-of-a-transformer-twice-the-size\/\">UCSD and Together AI Research Introduces Parcae: A Stable Architecture for\u00a0Looped Language Models That Achieves the Quality of a Transformer\u00a0Twice the Size<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>The dominant recipe for buildi&hellip;<\/p>\n","protected":false},"author":1,"featured_media":730,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-729","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/729","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=729"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/729\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/730"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=729"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=729"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=729"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}