{"id":826,"date":"2026-05-01T09:16:07","date_gmt":"2026-05-01T01:16:07","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=826"},"modified":"2026-05-01T09:16:07","modified_gmt":"2026-05-01T01:16:07","slug":"moonshot-ai-open-sources-flashkda-cutlass-kernels-for-kimi-delta-attention-with-variable-length-batching-and-h20-benchmarks","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=826","title":{"rendered":"Moonshot AI Open-Sources FlashKDA: CUTLASS Kernels for Kimi Delta Attention with Variable-Length Batching and H20 Benchmarks"},"content":{"rendered":"<p>The team behind Kimi.ai (Moonshot AI) just made a significant contribution to the open-source AI infrastructure space. The research team has made a significant contribution to the open-source AI infrastructure space. They released <strong>FlashKDA<\/strong> (Flash Kimi Delta Attention), a high-performance CUTLASS-based kernel implementation of the <strong>Kimi Delta Attention (KDA) <\/strong>mechanism. The <strong>FlashKDA<\/strong> library is available on GitHub under an MIT license. It delivers prefill speedups of <strong>1.72\u00d7 to 2.22\u00d7<\/strong> over the <code>flash-linear-attention<\/code> baseline on NVIDIA H20 GPUs, and works as a drop-in backend for the popular <code>flash-linear-attention<\/code> library.<\/p>\n<h3 class=\"wp-block-heading\"><strong>What Is Kimi Delta Attention, and Why Does It Matter?<\/strong><\/h3>\n<p>To understand FlashKDA, it helps to first understand where it sits in the LLM attention landscape.<\/p>\n<p>Standard softmax attention has quadratic complexity with respect to sequence length \u2014 meaning that as you feed longer context into a model, compute costs grow extremely fast. This has driven a wave of research into <strong>linear attention<\/strong> mechanisms, which approximate or replace the softmax operation to achieve linear scaling. <strong>Kimi Delta Attention (KDA)<\/strong> is Moonshot AI\u2019s contribution to this space: a linear attention mechanism that refines the <strong>Gated DeltaNet<\/strong> with a finer-grained, <strong>channel-wise gating<\/strong> mechanism, enabling more effective use of limited finite-state RNN memory.<\/p>\n<p>KDA is not just a research prototype. It is the core attention mechanism in <strong>Kimi Linear<\/strong>, Moonshot AI\u2019s open-source hybrid model with 48B total parameters and 3B activated parameters. Kimi Linear uses a 3:1 KDA-to-MLA (Multi-Head Latent Attention) ratio \u2014 three KDA layers for every one global attention layer \u2014 which reduces KV cache usage by up to 75% during long-sequence generation while achieving up to 6\u00d7 higher decoding throughput at 1 million context length compared to full attention. <strong>FlashKDA<\/strong> is the production-grade CUDA kernel that makes that architecture fast during prefill.<\/p>\n<p>Concretely, the KDA forward pass takes in queries (<code>q<\/code>), keys (<code>k<\/code>), values (<code>v<\/code>), a gate before activation (<code>g<\/code>), and beta logits (<code>beta<\/code>), along with a <code>scale<\/code> factor, an output tensor (<code>out<\/code>), and gate parameters: <code>A_log<\/code> (log-gate parameter per head), <code>dt_bias<\/code> (gate bias), and <code>lower_bound<\/code> (gate lower bound, ranging from -5.0 to 0). The sigmoid activation on <code>beta<\/code> is applied internally by the kernel. The mechanism also supports optional initial and final recurrent states \u2014 useful for multi-turn inference where you want to carry state across requests.<\/p>\n<p>The recurrent formulation means the model can efficiently process long sequences during generation. But efficient <em>prefill<\/em> of these architectures still requires highly optimized GPU kernels \u2014 which is exactly what FlashKDA delivers.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Under the Hood: CUTLASS on Hopper<\/strong><\/h3>\n<p>FlashKDA is built on <strong>CUTLASS<\/strong>, NVIDIA\u2019s open-source library of CUDA C++ template abstractions for high-performance linear algebra and custom kernel development. CUTLASS allows developers to write kernels that take full advantage of NVIDIA\u2019s Tensor Core architecture, and it\u2019s the same foundation used by libraries like FlashAttention-3.<\/p>\n<p>The library targets <strong>SM90 and above<\/strong> \u2014 meaning NVIDIA\u2019s Hopper architecture (H100, H20) and newer. The minimum requirements are CUDA 12.9 and PyTorch 2.4. The codebase is predominantly CUDA (56.4%), with Python (36.2%) bindings and C++ (6.7%) glue code.<\/p>\n<p>The core API is <code>flash_kda.fwd<\/code>, which takes the following inputs:<\/p>\n<ul class=\"wp-block-list\">\n<li><code>q<\/code>, <code>k<\/code>, <code>v<\/code>, <code>g<\/code>: all in <strong>bf16<\/strong> with shape <code>[B, T, H, K]<\/code> or <code>[B, T, H, V]<\/code> (where <code>g<\/code> is the gate <em>before<\/em> activation)<\/li>\n<li><code>beta<\/code>: bf16 beta logits in shape <code>[B, T, H]<\/code> (sigmoid applied internally)<\/li>\n<li><code>scale<\/code>: fp32 scalar scaling factor<\/li>\n<li><code>out<\/code>: bf16 output tensor in shape <code>[B, T, H, V]<\/code><\/li>\n<li><code>A_log<\/code>, <code>dt_bias<\/code>, <code>lower_bound<\/code>: fp32 gate parameters<\/li>\n<li><code>initial_state<\/code>, <code>final_state<\/code>: optional bf16 or fp32 recurrent states<\/li>\n<li><code>cu_seqlens<\/code>: optional int64 cumulative sequence lengths for <strong>variable-length batching<\/strong><\/li>\n<\/ul>\n<p>One current constraint: the kernel requires <code>K = V = 128<\/code> for head dimension.<\/p>\n<p>The variable-length batching support via <code>cu_seqlens<\/code> is particularly notable for production use. In real inference serving, requests in a batch rarely share the same sequence length. Being able to pack multiple sequences of different lengths into a single kernel call is a key requirement for high-throughput serving systems.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Benchmark Results: 1.72\u00d7 to 2.22\u00d7 on H20<\/strong><\/h3>\n<p>The benchmark results (as of April 20, 2026) compare <code>flash_kda<\/code> against <code>fla_chunk_kda<\/code> (the existing <code>flash-linear-attention<\/code> implementation) across a sequence length of <code>T=8192<\/code>, head dimension <code>D=128<\/code>, and two head count configurations: <code>H=96<\/code> and <code>H=64<\/code>. Each benchmark ran with 30 warmup iterations, 200 measurement iterations, and 5 repeats.<\/p>\n<p>For <code>H=96<\/code>:<\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th>Case<\/th>\n<th><code>flash_kda<\/code> (ms)<\/th>\n<th><code>fla_chunk_kda<\/code> (ms)<\/th>\n<th>Speedup<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Fixed<\/td>\n<td>2.6219<\/td>\n<td>4.5052<\/td>\n<td><strong>1.72\u00d7<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Varlen, <code>seq_lens<\/code>=[1300, 547, 2048, 963, 271, 3063]<\/td>\n<td>2.3420<\/td>\n<td>4.5717<\/td>\n<td><strong>1.95\u00d7<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Varlen, <code>seq_lens<\/code>=<code>1024 \u00d7 8<\/code><\/td>\n<td>2.0100<\/td>\n<td>4.4668<\/td>\n<td><strong>2.22\u00d7<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p>For <code>H=64<\/code>:<\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th>Case<\/th>\n<th><code>flash_kda<\/code> (ms)<\/th>\n<th><code>fla_chunk_kda<\/code> (ms)<\/th>\n<th>Speedup<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Fixed<\/td>\n<td>1.6199<\/td>\n<td>2.9587<\/td>\n<td><strong>1.83\u00d7<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Varlen, <code>seq_lens<\/code>=[1300, 547, 2048, 963, 271, 3063]<\/td>\n<td>1.7027<\/td>\n<td>3.0595<\/td>\n<td><strong>1.80\u00d7<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Varlen, <code>seq_lens<\/code>=<code>1024 \u00d7 8<\/code><\/td>\n<td>1.3930<\/td>\n<td>3.0412<\/td>\n<td><strong>2.18\u00d7<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p>The peak speedup of 2.22\u00d7 appears in the uniform variable-length case (<code>seq_lens=1024 \u00d7 8<\/code>, eight sequences of length 1024 summing to T=8192). The fixed-length case delivers the floor of the range at 1.72\u00d7. Across both head configurations and all three sequence scenarios, FlashKDA consistently outperforms the <code>flash-linear-attention<\/code> baseline by a significant margin.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Integration with flash-linear-attention<\/strong><\/h3>\n<p>One of the most practical aspects of FlashKDA is its integration story. Once installed, FlashKDA is <strong>auto-dispatched from flash-linear-attention\u2019s <code>chunk_kda<\/code><\/strong> \u2014 which means existing codebases using <code>flash-linear-attention<\/code> don\u2019t need manual wiring to take advantage of the faster kernel. The integration is tracked in <a href=\"https:\/\/github.com\/fla-org\/flash-linear-attention\/pull\/852\">flash-linear-attention PR #852<\/a>.<\/p>\n<p><strong>Installation is straightforward:<\/strong><\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">git clone https:\/\/github.com\/MoonshotAI\/FlashKDA.git flash-kda\ncd flash-kda\ngit submodule update --init --recursive\npip install -v .<\/code><\/pre>\n<\/div>\n<\/div>\n<p>The correctness test suite (<code>tests\/test_fwd.py<\/code>) runs exact-match verification against a PyTorch reference implementation and cross-validates against <code>flash-linear-attention<\/code>. This gives AI devs a reliable baseline for auditing kernel behavior before deploying in production.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li><strong>FlashKDA is Moonshot AI\u2019s open-source CUTLASS-based CUDA kernel<\/strong> for Kimi Delta Attention (KDA), delivering <strong>1.72\u00d7\u20132.22\u00d7 prefill speedup<\/strong> over the <code>flash-linear-attention<\/code> baseline on NVIDIA H20 GPUs.<\/li>\n<li><strong>KDA extends Gated DeltaNet with fine-grained, channel-wise gating<\/strong> \u2014 it\u2019s the core attention mechanism behind Kimi Linear, a 48B-total \/ 3B-active-parameter hybrid model that reduces KV cache usage by up to 75% and achieves up to 6\u00d7 higher decoding throughput at 1M context length.<\/li>\n<li><strong>The kernel targets SM90+ hardware<\/strong> (NVIDIA Hopper \u2014 H100, H20 and above), requires CUDA 12.9+ and PyTorch 2.4+, and currently supports a fixed head dimension of <code>K = V = 128<\/code>.<\/li>\n<li><strong>Variable-length batching is natively supported<\/strong> via the <code>cu_seqlens<\/code> parameter, allowing multiple sequences of different lengths to be packed into a single kernel call \u2014 a critical feature for high-throughput inference serving.<\/li>\n<li><strong>Once installed, FlashKDA is auto-dispatched from <code>flash-linear-attention<\/code>\u2018s <code>chunk_kda<\/code><\/strong>, making it a drop-in performance upgrade for any existing codebase already using the <code>flash-linear-attention<\/code> library \u2014 no architecture changes required.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator aligncenter has-alpha-channel-opacity is-style-wide\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/github.com\/MoonshotAI\/FlashKDA\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub Repo<\/a><\/strong>.<strong>\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">130k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/04\/30\/moonshot-ai-open-sources-flashkda-cutlass-kernels-for-kimi-delta-attention-with-variable-length-batching-and-h20-benchmarks\/\">Moonshot AI Open-Sources FlashKDA: CUTLASS Kernels for Kimi Delta Attention with Variable-Length Batching and H20 Benchmarks<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>The team behind Kimi.ai (Moons&hellip;<\/p>\n","protected":false},"author":1,"featured_media":29,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-826","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/826","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=826"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/826\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/29"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=826"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=826"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=826"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}