{"id":980,"date":"2026-05-27T15:23:10","date_gmt":"2026-05-27T07:23:10","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=980"},"modified":"2026-05-27T15:23:10","modified_gmt":"2026-05-27T07:23:10","slug":"meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=980","title":{"rendered":"Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Attention Drift in LLM Inference"},"content":{"rendered":"<p class=\"wp-block-paragraph\">Speculative decoding is a technique for speeding up large language model inference. A small, fast draft model proposes several tokens. The large target model verifies them in parallel. If accepted, inference is faster. If rejected, the system falls back gracefully.<\/p>\n<p class=\"wp-block-paragraph\">EAGLE Team, vLLM Team, and TorchSpec Team has launched the EAGLE series including EAGLE 1, EAGLE 2, and EAGLE 3 has become one of the most widely adopted and practically deployed families of speculative decoding algorithms across both research and production systems. Today, that family gets a targeted reliability upgrade with introduction of <a href=\"https:\/\/vllm.ai\/blog\/2026-05-26-eagle-3-1\" target=\"_blank\" rel=\"noreferrer noopener\">EAGLE 3.1<\/a>. <\/p>\n<h2 class=\"wp-block-heading\"><strong>What was Going Wrong<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">While speculative decoding performs well in controlled settings, performance often degrades under different chat templates, long-context inputs, or out-of-distribution system prompts. <\/p>\n<p class=\"wp-block-paragraph\">The EAGLE team traced this fragility to a phenomenon called <strong><a href=\"https:\/\/arxiv.org\/pdf\/2605.09992\" target=\"_blank\" rel=\"noreferrer noopener\">attention drift<\/a><\/strong> as speculation depth increases, the drafter gradually shifts attention away from sink tokens and toward its own generated tokens. <\/p>\n<p class=\"wp-block-paragraph\">In simpler terms: the drafter is a small model that predicts future tokens. As speculation gets deeper, it starts attending to its own prior outputs instead of the original context. This degrades acceptance length and output stability.<\/p>\n<p class=\"wp-block-paragraph\">Two underlying issues were identified. First, the fused input representation becomes increasingly imbalanced as higher-layer hidden states dominate the drafter input. Second, hidden-state magnitude grows across speculation steps due to the unnormalized residual path. Together, these effects make the drafter progressively less stable at deeper speculation depths.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Two Architectural Fixes in EAGLE 3.1<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">To address attention drift, EAGLE 3.1 comes with two key architectural improvements: FC normalization after each target hidden state and before the FC layer, and feeding post-norm hidden states into the next decoding step.<\/p>\n<p class=\"wp-block-paragraph\">FC normalization stabilizes the hidden states that the drafter receives from the target model. Without it, hidden-state magnitude grows across steps, making the drafter increasingly unreliable. Applying normalization at each step keeps the inputs bounded.<\/p>\n<p class=\"wp-block-paragraph\">The post-norm design makes the method behave more like recursively invoking the drafter across decoding steps, rather than simply appending additional layers to the target model. <\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1706\" height=\"664\" data-attachment-id=\"80133\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/27\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/screenshot-2026-05-27-at-12-17-50-am\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-12.17.50-AM.png\" data-orig-size=\"1706,664\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;,&quot;alt&quot;:&quot;&quot;}\" data-image-title=\"https:\/\/vllm.ai\/blog\/2026-05-26-eagle-3-1\" data-image-description=\"&lt;p&gt;https:\/\/vllm.ai\/blog\/2026-05-26-eagle-3-1&lt;\/p&gt;\n\" data-image-caption=\"&lt;p&gt;https:\/\/vllm.ai\/blog\/2026-05-26-eagle-3-1&lt;\/p&gt;\n\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-12.17.50-AM-1024x399.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-27-at-12.17.50-AM.png\" alt=\"https:\/\/vllm.ai\/blog\/2026-05-26-eagle-3-1\" class=\"wp-image-80133\" \/><figcaption class=\"wp-element-caption\">https:\/\/vllm.ai\/blog\/2026-05-26-eagle-3-1<\/figcaption><\/figure>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>What These Fixes Deliver<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Compared with EAGLE 3, EAGLE 3.1 demonstrates: better training-time to inference-time extrapolation, stronger long-context robustness, higher resilience to chat template and system prompt variation, and more stable acceptance length across diverse serving environments. <\/p>\n<p class=\"wp-block-paragraph\">In long-context workloads, EAGLE 3.1 achieves up to 2\u00d7 longer acceptance length compared with EAGLE 3. <\/p>\n<h2 class=\"wp-block-heading\"><strong>Training Infrastructure: TorchSpec<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">TorchSpec now provides efficient training support for EAGLE 3.1 and future speculative decoding algorithms. By lowering training overhead and simplifying experimentation workflows, TorchSpec helps accelerate iteration and exploration for next-generation speculative decoding research and deployment. <\/p>\n<p class=\"wp-block-paragraph\">Based on TorchSpec and vLLM, the research team also trained and open-sourced an EAGLE 3.1 draft model for Kimi K2.6, available on <a href=\"https:\/\/huggingface.co\/lightseekorg\/kimi-k2.6-eagle3-mla\" target=\"_blank\" rel=\"noreferrer noopener\">HuggingFace<\/a>. The model serves as an example of deploying EAGLE 3.1 with TorchSpec training and vLLM serving support on a real-world serving model<\/p>\n<h2 class=\"wp-block-heading\"><strong>vLLM Integration: Config-Driven and Backward-Compatible<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">EAGLE 3.1 lands in vLLM as a config-driven extension of the existing EAGLE 3 implementation. The integration includes FC normalization support, post-norm hidden-state feedback, and removal of hardcoded assumptions around target hidden states. <\/p>\n<p class=\"wp-block-paragraph\">Backward compatibility with existing EAGLE 3 checkpoints is fully preserved. EAGLE 3.1 draft models can be plugged directly through the same speculative-decoding code path. <\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">vllm serve nvidia\/Kimi-K2.6-NVFP4 \n  --trust-remote-code \n  --tensor-parallel-size 4 \n  --tool-call-parser kimi_k2 \n  --enable-auto-tool-choice \n  --reasoning-parser kimi_k2 \n  --attention-backend tokenspeed_mla \n  --speculative-config '{\"model\":\"lightseekorg\/kimi-k2.6-eagle3.1-mla\",\"method\":\"eagle3\",\"num_speculative_tokens\":3}' \n  --language-model-only<\/code><\/pre>\n<\/div>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>Benchmark Results on Kimi K2.6<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">The research team benchmarked the Kimi K2.6 EAGLE 3.1 draft model on Kimi-K2.6-NVFP4 with vLLM (TP=4, GB200, non-disagg) on the SPEED-Bench coding dataset. EAGLE 3.1 delivers 2.03\u00d7 higher per-user output throughput at concurrency 1. The speedup stays meaningful as concurrency scales: 1.71\u00d7 at C=4 and 1.66\u00d7 at C=16. <\/p>\n<h2 class=\"wp-block-heading\"><strong>Marktechpost\u2019s Visual Explainer<\/strong><\/h2>\n<div>\n<div class=\"mtp-progress\"><\/div>\n<div class=\"mtp-track\">\n<p><!-- 1 --><\/p>\n<div class=\"mtp-slide\">\n<span class=\"mtp-snum\">01 \/ 07<\/span>\n<div class=\"mtp-inner\">\n<span class=\"mtp-tag\">vLLM \u00b7 May 26, 2026<\/span>\n<h1 class=\"mtp-h1\">Meet EAGLE 3.1<\/h1>\n<p><span class=\"mtp-divider\"><\/span><br \/>\n<span class=\"mtp-sub\">The EAGLE team, vLLM team, and TorchSpec team jointly released EAGLE 3.1 \u2014 a targeted fix for speculative decoding instability in production LLM serving.<\/span><\/p>\n<div>\n<span class=\"mtp-badge\">#speculative-decoding<\/span><br \/>\n<span class=\"mtp-badge\">#vLLM<\/span><br \/>\n<span class=\"mtp-badge\">#LLM inference<\/span><br \/>\n<span class=\"mtp-badge\">#performance<\/span>\n<\/div>\n<\/div>\n<\/div>\n<p><!-- 2 --><\/p>\n<div class=\"mtp-slide\">\n<span class=\"mtp-snum\">02 \/ 07<\/span>\n<div class=\"mtp-inner\">\n<span class=\"mtp-tag\">Background<\/span>\n<h2 class=\"mtp-h2\">What is Speculative Decoding?<\/h2>\n<p><span class=\"mtp-divider\"><\/span><br \/>\n<span class=\"mtp-sub\">A technique for speeding up LLM inference using two models working together.<\/span><\/p>\n<ul class=\"mtp-list\">\n<li>A small, fast <span class=\"mtp-hi\">draft model<\/span> proposes several tokens ahead<\/li>\n<li>The large <span class=\"mtp-hi\">target model<\/span> verifies all proposed tokens in one pass<\/li>\n<li>Accepted tokens are kept \u2014 rejected tokens fall back gracefully<\/li>\n<li>Result: higher output throughput with no change in output quality<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<p><!-- 3 --><\/p>\n<div class=\"mtp-slide\">\n<span class=\"mtp-snum\">03 \/ 07<\/span>\n<div class=\"mtp-inner\">\n<span class=\"mtp-tag\">The Problem<\/span>\n<h2 class=\"mtp-h2\">Attention Drift in EAGLE 3<\/h2>\n<p><span class=\"mtp-divider\"><\/span><br \/>\n<span class=\"mtp-sub\">EAGLE 3 performance degraded in real-world deployments under three conditions:<\/span><\/p>\n<ul class=\"mtp-list\">\n<li>Different <span class=\"mtp-hi\">chat templates<\/span><\/li>\n<li><span class=\"mtp-hi\">Long-context<\/span> inputs<\/li>\n<li>Out-of-distribution <span class=\"mtp-hi\">system prompts<\/span><\/li>\n<\/ul>\n<p><span class=\"mtp-sub\">Root cause: <span class=\"mtp-hi\">attention drift<\/span> \u2014 as speculation depth increases, the drafter shifts attention away from sink tokens toward its own generated tokens.<\/span>\n<\/p><\/div>\n<\/div>\n<p><!-- 4 --><\/p>\n<div class=\"mtp-slide\">\n<span class=\"mtp-snum\">04 \/ 07<\/span>\n<div class=\"mtp-inner\">\n<span class=\"mtp-tag\">Root Cause<\/span>\n<h2 class=\"mtp-h2\">Two Underlying Issues<\/h2>\n<p><span class=\"mtp-divider\"><\/span><\/p>\n<ul class=\"mtp-list\">\n<li>The <span class=\"mtp-hi\">fused input representation<\/span> becomes increasingly imbalanced \u2014 higher-layer hidden states dominate the drafter input<\/li>\n<li><span class=\"mtp-hi\">Hidden-state magnitude<\/span> grows across speculation steps due to the unnormalized residual path<\/li>\n<li>Together, these make the drafter <span class=\"mtp-hi\">progressively less stable<\/span> at deeper speculation depths<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<p><!-- 5 --><\/p>\n<div class=\"mtp-slide\">\n<span class=\"mtp-snum\">05 \/ 07<\/span>\n<div class=\"mtp-inner\">\n<span class=\"mtp-tag\">Architecture<\/span>\n<h2 class=\"mtp-h2\">Two Architectural Fixes<\/h2>\n<p><span class=\"mtp-divider\"><\/span><\/p>\n<div class=\"mtp-arch\">\n<div class=\"mtp-abox\">\n<span class=\"mtp-atitle\">Fix 1<\/span><br \/>\n<span class=\"mtp-atext\"><span class=\"mtp-hi\">FC normalization<\/span> applied after each target hidden state and before the FC layer. Keeps hidden-state magnitude bounded across decoding steps.<\/span>\n<\/div>\n<div class=\"mtp-abox\">\n<span class=\"mtp-atitle\">Fix 2<\/span><br \/>\n<span class=\"mtp-atext\"><span class=\"mtp-hi\">Post-norm hidden-state feedback<\/span> \u2014 normalized hidden states fed into the next decoding step, making the drafter behave like recursive invocation rather than appended layers.<\/span>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p><!-- 6 --><\/p>\n<div class=\"mtp-slide\">\n<span class=\"mtp-snum\">06 \/ 07<\/span>\n<div class=\"mtp-inner\">\n<span class=\"mtp-tag\">Benchmarks \u00b7 SPEED-Bench Coding \u00b7 GB200 TP=4<\/span>\n<h2 class=\"mtp-h2\">Per-User Throughput vs. No-Spec Baseline<\/h2>\n<p><span class=\"mtp-divider\"><\/span><\/p>\n<div class=\"mtp-metric\">\n<div class=\"mtp-card\"><span class=\"mtp-num\">2.03\u00d7<\/span><span class=\"mtp-label\">Concurrency 1<\/span><\/div>\n<div class=\"mtp-card\"><span class=\"mtp-num\">1.71\u00d7<\/span><span class=\"mtp-label\">Concurrency 4<\/span><\/div>\n<div class=\"mtp-card\"><span class=\"mtp-num\">1.66\u00d7<\/span><span class=\"mtp-label\">Concurrency 16<\/span><\/div>\n<\/div>\n<p><span class=\"mtp-sub\">In long-context workloads, EAGLE 3.1 achieves up to <span class=\"mtp-hi\">2\u00d7 longer acceptance length<\/span> compared with EAGLE 3. Tested on Kimi-K2.6-NVFP4 with vLLM.<\/span>\n<\/p><\/div>\n<\/div>\n<p><!-- 7 --><\/p>\n<div class=\"mtp-slide\">\n<span class=\"mtp-snum\">07 \/ 07<\/span>\n<div class=\"mtp-inner\">\n<span class=\"mtp-tag\">Deployment \u00b7 vLLM v0.22.0<\/span>\n<h2 class=\"mtp-h2\">How to Deploy EAGLE 3.1<\/h2>\n<p><span class=\"mtp-divider\"><\/span><br \/>\n<span class=\"mtp-sub\">Backward-compatible with EAGLE 3 checkpoints. Already merged in vLLM main. Stable release: <span class=\"mtp-hi\">v0.22.0<\/span>.<\/span><\/p>\n<div class=\"mtp-code\">\n<pre>vllm serve nvidia\/Kimi-K2.6-NVFP4 \n  --trust-remote-code \n  --tensor-parallel-size 4 \n  --tool-call-parser kimi_k2 \n  --enable-auto-tool-choice \n  --reasoning-parser kimi_k2 \n  --attention-backend tokenspeed_mla \n  --speculative-config \n    '{\"model\":\"lightseekorg\/kimi-k2.6-eagle3.1-mla\",\n      \"method\":\"eagle3\",\n      \"num_speculative_tokens\":3}' \n  --language-model-only<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"mtp-nav\">\n<button class=\"mtp-btn\" disabled>\u2190 Prev<\/button>\n<div class=\"mtp-dots\"><\/div>\n<p><span class=\"mtp-ctr\">1 \/ 7<\/span><br \/>\n<button class=\"mtp-btn\">Next \u2192<\/button>\n<\/p><\/div>\n<div class=\"mtp-foot\">\n<span class=\"mtp-brand\">Markt<b>ech<\/b>post<\/span><br \/>\n<span class=\"mtp-tagline\">AI &amp; ML Research, Simplified.<\/span>\n<\/div>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h2>\n<ul class=\"wp-block-list\">\n<li>EAGLE 3.1 fixes <strong>attention drift<\/strong> \u2014 a newly identified instability where the drafter loses focus on sink tokens at deeper speculation depths.<\/li>\n<li>Two architectural changes \u2014 <strong>FC normalization<\/strong> and <strong>post-norm hidden-state feedback<\/strong> \u2014 stabilize the drafter across speculation steps.<\/li>\n<li>In long-context workloads, EAGLE 3.1 delivers <strong>up to 2\u00d7 longer acceptance length<\/strong> compared with EAGLE 3.<\/li>\n<li>Benchmarks on Kimi-K2.6-NVFP4 show <strong>2.03\u00d7 per-user output throughput<\/strong> at concurrency 1, dropping to 1.66\u00d7 at C=16.<\/li>\n<li>EAGLE 3.1 is <strong>backward-compatible with EAGLE 3 checkpoints<\/strong> and is already merged into vLLM main, shipping in v0.22.0.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<\/p><p class=\"wp-block-paragraph\">\n<\/p><p class=\"wp-block-paragraph\">Check out\u00a0the\u00a0<strong><a href=\"https:\/\/vllm.ai\/blog\/2026-05-26-eagle-3-1\" target=\"_blank\" rel=\"noreferrer noopener\">Technical details<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p class=\"wp-block-paragraph\">Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/05\/27\/meet-eagle-3-1-the-speculative-decoding-algorithm-that-fixes-attention-drift-in-llm-inference\/\">Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Attention Drift in LLM Inference<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Speculative decoding is a tech&hellip;<\/p>\n","protected":false},"author":1,"featured_media":981,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-980","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/980","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=980"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/980\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/981"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=980"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=980"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=980"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}