{"id":706,"date":"2026-04-12T04:10:41","date_gmt":"2026-04-11T20:10:41","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=706"},"modified":"2026-04-12T04:10:41","modified_gmt":"2026-04-11T20:10:41","slug":"researchers-from-mit-nvidia-and-zhejiang-university-propose-triattention-a-kv-cache-compression-method-that-matches-full-attention-at-2-5x-higher-throughput","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=706","title":{"rendered":"Researchers from MIT, NVIDIA, and Zhejiang University Propose TriAttention: A KV Cache Compression Method That Matches Full Attention at 2.5\u00d7 Higher Throughput"},"content":{"rendered":"<p>Long-chain reasoning is one of the most compute-intensive tasks in modern large language models. When a model like DeepSeek-R1 or Qwen3 works through a complex math problem, it can generate tens of thousands of tokens before arriving at an answer. Every one of those tokens must be stored in what is called the KV cache \u2014 a memory structure that holds the Key and Value vectors the model needs to attend back to during generation. The longer the reasoning chain, the larger the KV cache grows, and for many deployment scenarios, especially on consumer hardware, this growth eventually exhausts GPU memory entirely.<\/p>\n<p>A team of researchers from MIT, NVIDIA, and Zhejiang University proposed a method called <strong>TriAttention<\/strong> that directly addresses this problem. On the AIME25 mathematical reasoning benchmark with 32K-token generation, TriAttention matches Full Attention accuracy while achieving 2.5\u00d7 higher throughput or 10.7\u00d7 KV memory reduction. Leading baselines achieve only about half the accuracy at the same efficiency level.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1030\" height=\"732\" data-attachment-id=\"78942\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/04\/11\/researchers-from-mit-nvidia-and-zhejiang-university-propose-triattention-a-kv-cache-compression-method-that-matches-full-attention-at-2-5x-higher-throughput\/screenshot-2026-04-11-at-1-01-15-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-11-at-1.01.15-PM-1.png\" data-orig-size=\"1030,732\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-04-11 at 1.01.15\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-11-at-1.01.15-PM-1-1024x728.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-11-at-1.01.15-PM-1.png\" alt=\"\" class=\"wp-image-78942\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2604.04921<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>The Problem with Existing KV Cache Compression<\/strong><\/h3>\n<p>To understand why TriAttention is important, it helps to understand the standard approach to KV cache compression. Most existing methods \u2014 including SnapKV, H2O, and R-KV \u2014 work by estimating which tokens in the KV cache are important and evicting the rest. Importance is typically estimated by looking at attention scores: if a key receives high attention from recent queries, it is considered important and kept.<\/p>\n<p>The catch is that these methods operate in what the research team calls <em>post-RoPE<\/em> space. RoPE, or <strong>Rotary Position Embedding<\/strong>, is the positional encoding scheme used by most modern LLMs including Llama, Qwen, and Mistral. RoPE encodes position by rotating the Query and Key vectors in a frequency-dependent way. As a result, a query vector at position 10,000 looks very different from the same semantic query at position 100, because its direction has been rotated by the position encoding.<\/p>\n<p>This rotation means that only the most recently generated queries have orientations that are \u2018up to date\u2019 for estimating which keys are important right now. Prior work has confirmed this empirically: increasing the observation window for importance estimation does not help \u2014 performance peaks at around 25 queries and declines after that. With such a tiny window, some keys that will become important later get permanently evicted.<\/p>\n<p>This problem is especially acute for what the research team calls <em>retrieval heads<\/em> \u2014 attention heads whose function is to retrieve specific factual tokens from long contexts. The relevant tokens for a retrieval head can remain dormant for thousands of tokens before suddenly becoming essential to the reasoning chain. Post-RoPE methods, operating over a narrow observation window, see low attention on those tokens during the dormant period and permanently evict them. When the model later needs to recall that information, it is already gone, and the chain of thought breaks.<\/p>\n<h3 class=\"wp-block-heading\"><strong>The Pre-RoPE Observation: Q\/K Concentration<\/strong><\/h3>\n<p>The key insight in TriAttention comes from looking at Query and Key vectors <em>before<\/em> RoPE rotation is applied \u2014 the pre-RoPE space. When the research team visualized Q and K vectors in this space, they found something consistent and striking: across the vast majority of attention heads and across multiple model architectures, both Q and K vectors cluster tightly around fixed, non-zero center points. The research team terms this property <strong>Q\/K concentration<\/strong>, and measures it using the <strong>Mean Resultant Length<\/strong> R \u2014 a standard directional statistics measure where R \u2192 1 means tight clustering and R \u2192 0 means dispersion in all directions.<\/p>\n<p>On Qwen3-8B, approximately 90% of attention heads exhibit R &gt; 0.95, meaning their pre-RoPE Q\/K vectors are nearly perfectly concentrated around their respective centers. Critically, these centers are stable across different token positions and across different input sequences \u2014 they are an intrinsic property of the model\u2019s learned weights, not a property of any particular input. The research team further confirm that Q\/K concentration is domain-agnostic: measuring Mean Resultant Length across Math, Coding, and Chat domains on Qwen3-8B yields nearly identical values of 0.977\u20130.980.<\/p>\n<p>This stability is what post-RoPE methods cannot exploit. RoPE rotation disperses these concentrated vectors into arc patterns that vary with position. But in pre-RoPE space, the centers remain fixed.<\/p>\n<h3 class=\"wp-block-heading\"><strong>From Concentration to a Trigonometric Series<\/strong><\/h3>\n<p>The research team then show mathematically that when Q and K vectors are concentrated around their centers, the attention logit \u2014 the raw score before softmax that determines how much a query attends to a key \u2014 simplifies dramatically. Substituting the Q\/K centers into the RoPE attention formula, the logit reduces to a function that depends only on the <strong>Q-K distance<\/strong> (the relative positional gap between query and key), expressed as a trigonometric series:<\/p>\n<p><math data-latex=\" text{logit}(Delta) approx sum_{f} underbrace{|bar{q}_f| |bar{k}_f|}_{text{amplitude}} cos(omega_f Delta + underbrace{bar{phi}_f}_{text{phase}})  = sum_{f} [a_f cos(omega_f Delta) + b_f sin(omega_f Delta)] \"><semantics><mrow><mtext>logit<\/mtext><mo form=\"prefix\" stretchy=\"false\">(<\/mo><\/mrow><mrow><mi mathvariant=\"normal\">\u0394<\/mi><\/mrow><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mo>\u2248<\/mo><msub><mo movablelimits=\"false\">\u2211<\/mo><mi>f<\/mi><\/msub><mrow><munder><\/munder><munder><mrow><mi>\u2016<\/mi><msub><mover><mi>q<\/mi><mo stretchy=\"false\" class=\"tml-xshift\">\u203e<\/mo><\/mover><mi>f<\/mi><\/msub><mi>\u2016<\/mi><mi>\u2016<\/mi><msub><mover><mi>k<\/mi><mo stretchy=\"false\" class=\"tml-capshift\">\u203e<\/mo><\/mover><mi>f<\/mi><\/msub><mi>\u2016<\/mi><\/mrow><mo stretchy=\"true\">\u23df<\/mo><\/munder><mtext>amplitude<\/mtext><\/mrow><mrow><mi>cos<\/mi><mo>\u2061<\/mo><\/mrow><mo form=\"prefix\" stretchy=\"false\">(<\/mo><msub><mi>\u03c9<\/mi><mi>f<\/mi><\/msub><mrow><mi mathvariant=\"normal\">\u0394<\/mi><\/mrow><mo>+<\/mo><mrow><munder><\/munder><munder><msub><mover><mi>\u03d5<\/mi><mo stretchy=\"false\">\u203e<\/mo><\/mover><mi>f<\/mi><\/msub><mo stretchy=\"true\">\u23df<\/mo><\/munder><mtext>phase<\/mtext><\/mrow><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mo>=<\/mo><msub><mo movablelimits=\"false\">\u2211<\/mo><mi>f<\/mi><\/msub><mo form=\"prefix\" stretchy=\"false\">[<\/mo><msub><mi>a<\/mi><mi>f<\/mi><\/msub><mrow><mi>cos<\/mi><mo>\u2061<\/mo><\/mrow><mo form=\"prefix\" stretchy=\"false\">(<\/mo><msub><mi>\u03c9<\/mi><mi>f<\/mi><\/msub><mrow><mi mathvariant=\"normal\">\u0394<\/mi><\/mrow><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mo>+<\/mo><msub><mi>b<\/mi><mi>f<\/mi><\/msub><mrow><mi>sin<\/mi><mo>\u2061<\/mo><\/mrow><mo form=\"prefix\" stretchy=\"false\">(<\/mo><msub><mi>\u03c9<\/mi><mi>f<\/mi><\/msub><mrow><mi mathvariant=\"normal\">\u0394<\/mi><\/mrow><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mo form=\"postfix\" stretchy=\"false\">]<\/mo><annotation encoding=\"application\/x-tex\"> text{logit}(Delta) approx sum_{f} underbrace{|bar{q}_f| |bar{k}_f|}_{text{amplitude}} cos(omega_f Delta + underbrace{bar{phi}_f}_{text{phase}})  = sum_{f} [a_f cos(omega_f Delta) + b_f sin(omega_f Delta)] <\/annotation><\/semantics><\/math><\/p>\n<p>Here, \u0394 is the positional distance, \u03c9<sub>f<\/sub> are the RoPE rotation frequencies for each frequency band f, and the coefficients a<sub>f<\/sub> and b<sub>f<\/sub> are determined by the Q\/K centers. This series produces a characteristic attention-vs-distance curve for each head. Some heads prefer nearby keys (local attention), others prefer very distant keys (attention sinks). The centers, computed offline from calibration data, fully determine which distances are preferred.<\/p>\n<p>The research team validated this experimentally across 1,152 attention heads in Qwen3-8B and across Qwen2.5 and Llama3 architectures. The Pearson correlation between the predicted trigonometric curve and the actual attention logits has a mean above 0.5 across all heads, with many heads achieving correlations of 0.6\u20130.9. The research team further validates this on GLM-4.7-Flash, which uses <strong>Multi-head Latent Attention (MLA)<\/strong> rather than standard Grouped-Query Attention \u2014 a meaningfully different attention architecture. On MLA, 96.6% of heads exhibit R &gt; 0.95, compared to 84.7% for GQA, confirming that Q\/K concentration is not specific to one attention design but is a general property of modern LLMs.<\/p>\n<h3 class=\"wp-block-heading\"><strong>How TriAttention Uses This<\/strong><\/h3>\n<p>TriAttention is a KV cache compression method that uses these findings to score keys without needing any live query observations. The scoring function has <strong>two components:<\/strong><\/p>\n<p>The <strong>Trigonometric Series Score<\/strong> (S<sub>trig<\/sub>) uses the Q center computed offline and the actual cached key representation to estimate how much attention the key will receive, based on its positional distance from future queries. Because a key may be attended to by queries at many future positions, TriAttention averages this score over a set of future offsets using geometric spacing.<\/p>\n<p><math data-latex=\"S_{text{trig}}(k, Delta) = sum_{f} |mathbb{E}[q_f]| cdot |k_f| cdot cos(omega_f Delta + phi_f)\"><semantics><mrow><msub><mi>S<\/mi><mtext>trig<\/mtext><\/msub><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi>k<\/mi><mo separator=\"true\">,<\/mo><\/mrow><mrow><mi mathvariant=\"normal\">\u0394<\/mi><\/mrow><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mo>=<\/mo><msub><mo movablelimits=\"false\">\u2211<\/mo><mi>f<\/mi><\/msub><mi>\u2016<\/mi><mi>\ud835\udd3c<\/mi><mo form=\"prefix\" stretchy=\"false\">[<\/mo><msub><mi>q<\/mi><mi>f<\/mi><\/msub><mo form=\"postfix\" stretchy=\"false\">]<\/mo><mi>\u2016<\/mi><mo>\u22c5<\/mo><mi>\u2016<\/mi><msub><mi>k<\/mi><mi>f<\/mi><\/msub><mi>\u2016<\/mi><mo>\u22c5<\/mo><mrow><mi>cos<\/mi><mo>\u2061<\/mo><\/mrow><mo form=\"prefix\" stretchy=\"false\">(<\/mo><msub><mi>\u03c9<\/mi><mi>f<\/mi><\/msub><mrow><mi mathvariant=\"normal\">\u0394<\/mi><\/mrow><mo>+<\/mo><msub><mi>\u03d5<\/mi><mi>f<\/mi><\/msub><mo form=\"postfix\" stretchy=\"false\">)<\/mo><annotation encoding=\"application\/x-tex\">S_{text{trig}}(k, Delta) = sum_{f} |mathbb{E}[q_f]| cdot |k_f| cdot cos(omega_f Delta + phi_f)<\/annotation><\/semantics><\/math><\/p>\n<p>The <strong>Norm-Based Score<\/strong> (S<sub>norm<\/sub>) handles the minority of attention heads where Q\/K concentration is lower. It weights each frequency band by the expected query norm contribution, providing complementary information about token salience beyond distance preference alone.<\/p>\n<p><math data-latex=\"S_{text{norm}}^{(0)}(k) = sum_{f} mathbb{E}[|q_f|] cdot |k_f|\"><semantics><mrow><msubsup><mi>S<\/mi><mtext>norm<\/mtext><mrow><mo form=\"prefix\" stretchy=\"false\" lspace=\"0em\" rspace=\"0em\">(<\/mo><mn>0<\/mn><mo form=\"postfix\" stretchy=\"false\" lspace=\"0em\" rspace=\"0em\">)<\/mo><\/mrow><\/msubsup><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi>k<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mo>=<\/mo><msub><mo movablelimits=\"false\">\u2211<\/mo><mi>f<\/mi><\/msub><mi>\ud835\udd3c<\/mi><mo form=\"prefix\" stretchy=\"false\">[<\/mo><mi>\u2016<\/mi><msub><mi>q<\/mi><mi>f<\/mi><\/msub><mi>\u2016<\/mi><mo form=\"postfix\" stretchy=\"false\">]<\/mo><mo>\u22c5<\/mo><mi>\u2016<\/mi><msub><mi>k<\/mi><mi>f<\/mi><\/msub><mi>\u2016<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">S_{text{norm}}^{(0)}(k) = sum_{f} mathbb{E}[|q_f|] cdot |k_f|<\/annotation><\/semantics><\/math><\/p>\n<p>The two scores are combined using the Mean Resultant Length R as an adaptive weight: when concentration is high, S<sub>trig<\/sub> dominates; when concentration is lower, S<sub>norm<\/sub> contributes more. Every 128 generated tokens, TriAttention scores all keys in the cache and retains only the top-B, evicting the rest.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Results on Mathematical Reasoning<\/strong><\/h3>\n<p>On AIME24 with Qwen3-8B, TriAttention achieves 42.1% accuracy against Full Attention\u2019s 57.1%, while R-KV achieves only 25.4% at the same KV budget of 2,048 tokens. On AIME25, TriAttention achieves 32.9% versus R-KV\u2019s 17.5% \u2014 a 15.4 percentage point gap. On MATH 500 with only 1,024 tokens in the KV cache out of a possible 32,768, TriAttention achieves 68.4% accuracy against Full Attention\u2019s 69.6%.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"2076\" height=\"544\" data-attachment-id=\"78938\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/04\/11\/researchers-from-mit-nvidia-and-zhejiang-university-propose-triattention-a-kv-cache-compression-method-that-matches-full-attention-at-2-5x-higher-throughput\/screenshot-2026-04-11-at-12-55-12-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-11-at-12.55.12-PM-1.png\" data-orig-size=\"2076,544\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-04-11 at 12.55.12\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-11-at-12.55.12-PM-1-1024x268.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-11-at-12.55.12-PM-1.png\" alt=\"\" class=\"wp-image-78938\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2604.04921<\/figcaption><\/figure>\n<\/div>\n<p>The research team also introduces a <strong>Recursive State Query benchmark<\/strong> based on recursive simulation using depth-first search. Recursive tasks stress memory retention because the model must maintain intermediate states across long chains and backtrack to them later \u2014 if any intermediate state is evicted, the error propagates through all subsequent return values, corrupting the final result. Under moderate memory pressure up to depth 16, TriAttention performs comparably to Full Attention, while R-KV shows catastrophic accuracy degradation \u2014 dropping from approximately 61% at depth 14 to 31% at depth 16. This indicates R-KV incorrectly evicts critical intermediate reasoning states.<\/p>\n<p>On throughput, TriAttention achieves 1,405 tokens per second on MATH 500 against Full Attention\u2019s 223 tokens per second, a 6.3\u00d7 speedup. On AIME25, it achieves 563.5 tokens per second against 222.8, a 2.5\u00d7 speedup at matched accuracy.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1062\" height=\"444\" data-attachment-id=\"78940\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/04\/11\/researchers-from-mit-nvidia-and-zhejiang-university-propose-triattention-a-kv-cache-compression-method-that-matches-full-attention-at-2-5x-higher-throughput\/screenshot-2026-04-11-at-12-56-28-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-11-at-12.56.28-PM-1.png\" data-orig-size=\"1062,444\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-04-11 at 12.56.28\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-11-at-12.56.28-PM-1-1024x428.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-11-at-12.56.28-PM-1.png\" alt=\"\" class=\"wp-image-78940\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2604.04921<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Generalization Beyond Mathematical Reasoning<\/strong><\/h3>\n<p>The results extend well beyond math benchmarks. On <strong>LongBench<\/strong> \u2014 a 16-subtask benchmark covering question answering, summarization, few-shot classification, retrieval, counting, and code tasks \u2014 TriAttention achieves the highest average score of 48.1 among all compression methods at a 50% KV budget on Qwen3-8B, winning 11 out of 16 subtasks and surpassing the next best baseline, Ada-KV+SnapKV, by 2.5 points. On the <strong>RULER<\/strong> retrieval benchmark at a 4K context length, TriAttention achieves 66.1, a 10.5-point gap over SnapKV. These results confirm that the method is not tuned to mathematical reasoning alone \u2014 the underlying Q\/K concentration phenomenon transfers to general language tasks.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li><strong>Existing KV cache compression methods have a fundamental blind spot<\/strong>: Methods like SnapKV and R-KV estimate token importance using recent post-RoPE queries, but because RoPE rotates query vectors with position, only a tiny window of queries is usable. This causes important tokens \u2014 especially those needed by retrieval heads \u2014 to be permanently evicted before they become critical.<\/li>\n<li><strong>Pre-RoPE Query and Key vectors cluster around stable, fixed centers across nearly all attention heads<\/strong>: This property, called Q\/K concentration, holds regardless of input content, token position, or domain, and is consistent across Qwen3, Qwen2.5, Llama3, and even Multi-head Latent Attention architectures like GLM-4.7-Flash.<\/li>\n<li><strong>These stable centers make attention patterns mathematically predictable without observing any live queries<\/strong>: When Q\/K vectors are concentrated, the attention score between any query and key reduces to a function that depends only on their positional distance \u2014 encoded as a trigonometric series. TriAttention uses this to score every cached key offline using calibration data alone.<\/li>\n<li><strong>TriAttention matches Full Attention reasoning accuracy at a fraction of the memory and compute cost<\/strong>: On AIME25 with 32K-token generation, it achieves 2.5\u00d7 higher throughput or 10.7\u00d7 KV memory reduction while matching Full Attention accuracy \u2014 nearly doubling R-KV\u2019s accuracy at the same memory budget across both AIME24 and AIME25.<\/li>\n<li><strong>The method generalizes beyond math and works on consumer hardware.<\/strong> TriAttention outperforms all baselines on LongBench across 16 general NLP subtasks and on the RULER retrieval benchmark, and enables a 32B reasoning model to run on a single 24GB RTX 4090 via OpenClaw \u2014 a task that causes out-of-memory errors under Full Attention.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the<strong><a href=\"https:\/\/arxiv.org\/pdf\/2604.04921\" target=\"_blank\" rel=\"noreferrer noopener\">\u00a0Paper<\/a>, <a href=\"https:\/\/github.com\/WeianMao\/triattention\" target=\"_blank\" rel=\"noreferrer noopener\">Repo<\/a> <\/strong>and<strong> <a href=\"https:\/\/weianmao.github.io\/tri-attention-project-page\/\" target=\"_blank\" rel=\"noreferrer noopener\">Project Page<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">120k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/04\/11\/researchers-from-mit-nvidia-and-zhejiang-university-propose-triattention-a-kv-cache-compression-method-that-matches-full-attention-at-2-5x-higher-throughput\/\">Researchers from MIT, NVIDIA, and Zhejiang University Propose TriAttention: A KV Cache Compression Method That Matches Full Attention at 2.5\u00d7 Higher Throughput<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Long-chain reasoning is one of&hellip;<\/p>\n","protected":false},"author":1,"featured_media":707,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-706","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/706","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=706"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/706\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/707"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=706"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=706"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=706"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}