{"id":77,"date":"2025-12-08T12:59:18","date_gmt":"2025-12-08T04:59:18","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=77"},"modified":"2025-12-08T12:59:18","modified_gmt":"2025-12-08T04:59:18","slug":"from-transformers-to-associative-memory-how-titans-and-miras-rethink-long-context-modeling","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=77&lang=en","title":{"rendered":"From Transformers to Associative Memory, How Titans and MIRAS Rethink Long Context Modeling"},"content":{"rendered":"<p><strong>What comes after Transformers?<\/strong> Google Research is proposing a new way to give sequence models usable long term memory with Titans and MIRAS, while keeping training parallel and inference close to linear.<\/p>\n<p>Titans is a concrete architecture that adds a deep neural memory to a Transformer style backbone. MIRAS is a general framework that views most modern sequence models as instances of online optimization over an associative memory.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Why Titans and MIRAS?<\/strong><\/h3>\n<p>Standard Transformers use attention over a key value cache. This gives strong in context learning, but cost grows quadratically with context length, so practical context is limited even with FlashAttention and other kernel tricks.<\/p>\n<p>Efficient linear recurrent neural networks and state space models such as Mamba-2 compress the history into a fixed size state, so cost is linear in sequence length. However, this compression loses information in very long sequences, which hurts tasks such as genomic modeling and extreme long context retrieval.<\/p>\n<p>Titans and MIRAS combine these ideas. Attention acts as a precise short term memory on the current window. A separate neural module provides long term memory, learns at test time, and is trained so that its dynamics are parallelizable on accelerators.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1646\" height=\"646\" data-attachment-id=\"76801\" data-permalink=\"https:\/\/www.marktechpost.com\/2025\/12\/07\/from-transformers-to-associative-memory-how-titans-and-miras-rethink-long-context-modeling\/screenshot-2025-12-07-at-8-49-47-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/12\/Screenshot-2025-12-07-at-8.49.47-PM-1.png\" data-orig-size=\"1646,646\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2025-12-07 at 8.49.47\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/12\/Screenshot-2025-12-07-at-8.49.47-PM-1-300x118.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/12\/Screenshot-2025-12-07-at-8.49.47-PM-1-1024x402.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/12\/Screenshot-2025-12-07-at-8.49.47-PM-1.png\" alt=\"\" class=\"wp-image-76801\" \/><figcaption class=\"wp-element-caption\">https:\/\/research.google\/blog\/titans-miras-helping-ai-have-long-term-memory\/<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Titans, a neural long term memory that learns at test time<\/strong><\/h3>\n<p>The <a href=\"https:\/\/arxiv.org\/pdf\/2501.00663\" target=\"_blank\" rel=\"noreferrer noopener\">Titans research paper<\/a> introduces a neural long term memory module that is itself a deep multi layer perceptron rather than a vector or matrix state. Attention is interpreted as short term memory, since it only sees a limited window, while the neural memory acts as persistent long term memory.<\/p>\n<p>For each token, Titans defines an associative memory loss<\/p>\n<p>\u2113(M\u209c\u208b\u2081; k\u209c, v\u209c) = \u2016M\u209c\u208b\u2081(k\u209c) \u2212 v\u209c\u2016\u00b2<\/p>\n<p>where M\u209c\u208b\u2081 is the current memory, k\u209c is the key and v\u209c is the value. The gradient of this loss with respect to the memory parameters is the \u201csurprise metric\u201d. Large gradients correspond to surprising tokens that should be stored, small gradients correspond to expected tokens that can be mostly ignored.<\/p>\n<p>The memory parameters are updated at test time by gradient descent with momentum and weight decay, which together act as a retention gate and forgetting mechanism.To keep this online optimization efficient, the research paper shows how to compute these updates with batched matrix multiplications over sequence chunks, which preserves parallel training across long sequences.<\/p>\n<p><strong>Architecturally, Titans uses three memory branches in the backbone, often instanced in the Titans MAC variant:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li>a core branch that performs standard in context learning with attention<\/li>\n<li>a contextual memory branch that learns from the recent sequence<\/li>\n<li>a persistent memory branch with fixed weights that encodes pretraining knowledge<\/li>\n<\/ul>\n<p>The long term memory compresses past tokens into a summary, which is then passed as extra context into attention. Attention can choose when to read that summary.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Experimental results for Titans<\/strong><\/h3>\n<p>On language modeling and commonsense reasoning benchmarks such as C4, WikiText and HellaSwag, Titans architectures outperform state of the art linear recurrent baselines Mamba-2 and Gated DeltaNet and Transformer++ models of comparable size. The Google research attribute this to the higher expressive power of deep memory and its ability to maintain performance as context length grows. Deep neural memories with the same parameter budget but higher depth give consistently lower perplexity.<\/p>\n<p>For extreme long context recall, the research team uses the BABILong benchmark, where facts are distributed across very long documents. Titans outperforms all baselines, including very large models such as GPT-4, while using many fewer parameters, and scales to context windows beyond 2,000,000 tokens.<\/p>\n<p>The research team reports that Titans keeps efficient parallel training and fast linear inference. Neural memory alone is slightly slower than the fastest linear recurrent models, but hybrid Titans layers with Sliding Window Attention remain competitive on throughput while improving accuracy.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1424\" height=\"928\" data-attachment-id=\"76803\" data-permalink=\"https:\/\/www.marktechpost.com\/2025\/12\/07\/from-transformers-to-associative-memory-how-titans-and-miras-rethink-long-context-modeling\/screenshot-2025-12-07-at-8-50-34-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/12\/Screenshot-2025-12-07-at-8.50.34-PM-1.png\" data-orig-size=\"1424,928\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2025-12-07 at 8.50.34\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/12\/Screenshot-2025-12-07-at-8.50.34-PM-1-300x196.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/12\/Screenshot-2025-12-07-at-8.50.34-PM-1-1024x667.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/12\/Screenshot-2025-12-07-at-8.50.34-PM-1.png\" alt=\"\" class=\"wp-image-76803\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2504.13173<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>MIRAS, a unified framework for sequence models as associative memory<\/strong><\/h3>\n<p>The MIRAS research paper, <a href=\"https:\/\/arxiv.org\/pdf\/2504.13173\" target=\"_blank\" rel=\"noreferrer noopener\">\u201cIt\u2019s All Connected: A Journey Through Test Time Memorization, Attentional Bias, Retention, and Online Optimization<\/a>,\u201d generalizes this view. It observes that modern sequence models can be seen as associative memories that map keys to values while balancing learning and forgetting.<\/p>\n<p><strong>MIRAS defines any sequence model through four design choices:<\/strong><\/p>\n<ol class=\"wp-block-list\">\n<li><strong>Memory structure<\/strong> for example a vector, linear map, or MLP<\/li>\n<li><strong>Attentional bias<\/strong> the internal loss that defines what similarities the memory cares about<\/li>\n<li><strong>Retention gate<\/strong> the regularizer that keeps the memory close to its past state<\/li>\n<li><strong>Memory algorithm<\/strong> the online optimization rule, often gradient descent with momentum<\/li>\n<\/ol>\n<p><strong>Using this lens, MIRAS recovers several families:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li>Hebbian style linear recurrent models and RetNet as dot product based associative memories<\/li>\n<li>Delta rule models such as DeltaNet and Gated DeltaNet as MSE based memories with value replacement and specific retention gates<\/li>\n<li>Titans LMM as a nonlinear MSE based memory with local and global retention optimized by gradient descent with momentum<\/li>\n<\/ul>\n<p>Crucially, MIRAS then moves beyond the usual MSE or dot product objectives. The research team constructs new attentional biases based on L\u209a norms, robust Huber loss and robust optimization, and new retention gates based on divergences over probability simplices, elastic net regularization and Bregman divergence.<\/p>\n<p><strong>From this design space, the research team instantiate three attention free models:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Moneta<\/strong> uses a 2 layer MLP memory with L\u209a attentional bias and a hybrid retention gate based on generalized norms<\/li>\n<li><strong>Yaad<\/strong> uses the same MLP memory with Huber loss attentional bias and a forget gate related to Titans<\/li>\n<li><strong>Memora<\/strong> uses regression loss as attentional bias and a KL divergence based retention gate over a probability simplex style memory.<\/li>\n<\/ul>\n<p>These MIRAS variants replace attention blocks in a Llama style backbone, use depthwise separable convolutions in the Miras layer, and can be combined with Sliding Window Attention in hybrid models. Training remains parallel by chunking sequences and computing gradients with respect to the memory state from the previous chunk.<\/p>\n<p>In research experiments, Moneta, Yaad and Memora match or surpass strong linear recurrent models and Transformer++ on language modeling, commonsense reasoning and recall intensive tasks, while maintaining linear time inference.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h3>\n<ol class=\"wp-block-list\">\n<li><strong>Titans introduces a deep neural long term memory that learns at test time<\/strong>, using gradient descent on an L2 associative memory loss so the model selectively stores only surprising tokens while keeping updates parallelizable on accelerators.<\/li>\n<li><strong>Titans combines attention with neural memory for long context<\/strong>, using branches like core, contextual memory and persistent memory so attention handles short range precision and the neural module maintains information over sequences beyond 2,000,000 tokens.<\/li>\n<li><strong>Titans outperforms strong linear RNNs and Transformer++ baselines<\/strong>, including Mamba-2 and Gated DeltaNet, on language modeling and commonsense reasoning benchmarks at comparable parameter scales, while staying competitive on throughput.<\/li>\n<li><strong>On extreme long context recall benchmarks such as BABILong<\/strong>, Titans achieves higher accuracy than all baselines, including larger attention models such as GPT 4, while using fewer parameters and still enabling efficient training and inference.<\/li>\n<li><strong>MIRAS provides a unifying framework for sequence models as associative memories<\/strong>, defining them by memory structure, attentional bias, retention gate and optimization rule, and yields new attention free architectures such as Moneta, Yaad and Memora that match or surpass linear RNNs and Transformer++ on long context and reasoning tasks.<\/li>\n<\/ol>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out the\u00a0<strong><a href=\"https:\/\/research.google\/blog\/titans-miras-helping-ai-have-long-term-memory\/\" target=\"_blank\" rel=\"noreferrer noopener\">Technical details<\/a><\/strong>.\u00a0Feel free to check out our\u00a0<strong><mark><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub Page for Tutorials, Codes and Notebooks<\/a><\/mark><\/strong>.\u00a0Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">100k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2025\/12\/07\/from-transformers-to-associative-memory-how-titans-and-miras-rethink-long-context-modeling\/\">From Transformers to Associative Memory, How Titans and MIRAS Rethink Long Context Modeling<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>What comes after Transformers?&hellip;<\/p>\n","protected":false},"author":1,"featured_media":78,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-77","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/77","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=77"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/77\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/78"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=77"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=77"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=77"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}