{"id":884,"date":"2026-05-11T16:36:00","date_gmt":"2026-05-11T08:36:00","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=884"},"modified":"2026-05-11T16:36:00","modified_gmt":"2026-05-11T08:36:00","slug":"sakana-ai-and-nvidia-introduce-twell-with-cuda-kernels-for-20-5-inference-and-21-9-training-speedup-in-llms","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=884","title":{"rendered":"Sakana AI and NVIDIA Introduce TwELL with CUDA Kernels for 20.5% Inference and 21.9% Training Speedup in LLMs"},"content":{"rendered":"<p>Scaling large language models (LLMs) is expensive. Every token processed during inference and every gradient computed during training flows through feedforward layers that account for over two-thirds of model parameters and more than 80% of total FLOPs in larger models. A team researchers from Sakana AI and NVIDIA have worked on a new research that directly targets this bottleneck \u2014 not by changing the architecture, but by making the computation inside feedforward layers significantly cheaper through unstructured sparsity.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Sparsity Exists, But GPUs Ignore It<\/strong><\/h3>\n<p>Inside a transformer\u2019s feedforward block, for any given input token, only a small fraction of hidden neurons actually fire \u2014 the rest produce zero after passing through the activation function. This is called activation sparsity, and prior work has documented this phenomenon in models with ReLU activations.<\/p>\n<p>The frustrating reality is that this theoretical savings rarely translates into actual speedups. NVIDIA GPUs are heavily optimized for dense matrix multiplications using Tensor Cores, which operate on large contiguous tiles of data. Traditional sparse formats like ELLPACK (ELL) require a separate kernel pass to convert activations from dense to sparse representation, and that conversion overhead often cancels out what\u2019s saved by skipping the zeros.<\/p>\n<p>Critically, prior work on sparse LLM kernels (including TurboSparse, ProSparse, and Q-Sparse) has focused on memory-bound GEMV operations \u2014 the single- or few-token inference regime. The research team instead targets compute-bound GEMM operations in the batched setting with thousands of input tokens, where dense baselines on modern devices can execute orders-of-magnitude higher FLOP\/s with large tiles and Tensor Cores. That is a fundamentally harder problem, and the reason prior approaches didn\u2019t generalize to batched training or high-throughput inference.<\/p>\n<div>\n<p>  <!-- HEADER --><\/p>\n<div class=\"tw-header\">\n<div class=\"tw-header-icon\">GUIDE<\/div>\n<div class=\"tw-header-text\">\n<div class=\"tw-header-title\">Sparser, Faster, Lighter LLMs \u2014 TwELL &amp; Sparse CUDA Kernels<\/div>\n<div class=\"tw-header-sub\">Sakana AI \u00d7 NVIDIA \u00a0\u2014\u00a0 arXiv:2603.23198 \u00a0\u2014\u00a0 ICML 2026<\/div>\n<\/div>\n<\/div>\n<p>  <!-- STEP NAV --><\/p>\n<div class=\"tw-nav\">\n    <!-- built by JS -->\n  <\/div>\n<p>  <!-- PANELS --><\/p>\n<div class=\"tw-body\">\n<p>    <!-- PANEL 1 --><\/p>\n<div class=\"tw-panel\" data-panel=\"0\">\n<div class=\"tw-panel-tag\">01 \u2014 The Problem<\/div>\n<div class=\"tw-panel-title\">Feedforward layers dominate LLM cost \u2014 and most of that work is wasted.<\/div>\n<div class=\"tw-stats\">\n<div class=\"tw-stat\">\n<div class=\"tw-stat-val\">&gt;\u200a\u200a\u200a\u200a\u2154<\/div>\n<div class=\"tw-stat-lbl\">of all model parameters live in feedforward layers<\/div>\n<\/div>\n<div class=\"tw-stat\">\n<div class=\"tw-stat-val\">80%+<\/div>\n<div class=\"tw-stat-lbl\">of total FLOPs consumed by feedforward layers<\/div>\n<\/div>\n<div class=\"tw-stat\">\n<div class=\"tw-stat-val\">99%+<\/div>\n<div class=\"tw-stat-lbl\">of hidden activations can be zero with no accuracy drop<\/div>\n<\/div>\n<\/div>\n<div class=\"tw-highlight\">\n<p>For any given token, only a tiny fraction of hidden neurons actually fire. The rest output zero after the activation function. This is called <strong>activation sparsity<\/strong> \u2014 and it has historically been impossible to exploit on modern GPUs because sparse operations ran slower than dense ones.<\/p>\n<\/div>\n<div class=\"tw-text\">Prior sparse LLM kernels (TurboSparse, ProSparse, Q-Sparse) only targeted <strong>single-token GEMV operations<\/strong>. Sakana AI and NVIDIA tackle the harder problem: <strong>batched GEMM<\/strong> with thousands of tokens \u2014 the regime that covers both training and high-throughput inference.<\/div>\n<\/div>\n<p>    <!-- PANEL 2 --><\/p>\n<div class=\"tw-panel\" data-panel=\"1\">\n<div class=\"tw-panel-tag\">02 \u2014 The Innovation<\/div>\n<div class=\"tw-panel-title\">TwELL: a sparse format built around how GPU kernels actually work.<\/div>\n<div class=\"tw-cols\">\n<div class=\"tw-card\">\n<div class=\"tw-card-tag\">Old Way \u2014 ELL<\/div>\n<div class=\"tw-card-title\">Row-wide packing, costly to build<\/div>\n<div class=\"tw-card-text\">Standard ELLPACK packs non-zeros row-by-row across the entire matrix. To construct it from a tiled matmul output you need a separate kernel launch, a full global memory read, and synchronization across all CTAs. Those overheads cancel out the savings from skipping zeros.<\/div>\n<\/div>\n<div class=\"tw-card\">\n<div class=\"tw-card-tag\">New Way \u2014 TwELL<\/div>\n<div class=\"tw-card-title\">Tile-wise packing, built in the epilogue<\/div>\n<div class=\"tw-card-text\">TwELL partitions columns into horizontal tiles matching the matmul kernel\u2019s tile size T_n. Non-zeros are packed locally within each tile. By matching dimensions, TwELL is constructed <strong>inside the existing gate projection kernel epilogue<\/strong> \u2014 no extra kernel, no extra memory read, no synchronization overhead.<\/div>\n<\/div>\n<\/div>\n<div class=\"tw-highlight\">\n<p>The inference pipeline uses <strong>one fused kernel<\/strong> that reads gate activations in TwELL format and performs up + down projections together. The intermediate hidden state is never written to global memory, cutting DRAM traffic at every forward pass.<\/p>\n<\/div>\n<div class=\"tw-highlight\">\n<p>For training, a <strong>hybrid sparse format<\/strong> dynamically routes rows into a compact ELL matrix (sparse rows) or a dense backup (overflow rows). Sparsity during training is highly non-uniform \u2014 max non-zeros per row can be orders of magnitude above the average \u2014 so the hybrid design handles this without becoming brittle.<\/p>\n<\/div>\n<\/div>\n<p>    <!-- PANEL 3 --><\/p>\n<div class=\"tw-panel\" data-panel=\"2\">\n<div class=\"tw-panel-tag\">03 \u2014 Training Recipe<\/div>\n<div class=\"tw-panel-title\">Two changes to your training config. Nothing else.<\/div>\n<div class=\"tw-rows\">\n<div class=\"tw-row\">\n<div class=\"tw-row-bullet\">01<\/div>\n<div class=\"tw-row-text\"><strong>Replace SiLU with ReLU<\/strong> as the gate activation function. ReLU produces exact zeros for negative inputs \u2014 this is what enables unstructured sparsity. No other architectural change is needed. (Unregularized ReLU sits slightly below SiLU on task accuracy: 46.4% vs 47.1% on the 1.5B model, offset by the efficiency gains.)<\/div>\n<\/div>\n<div class=\"tw-row\">\n<div class=\"tw-row-bullet\">02<\/div>\n<div class=\"tw-row-text\"><strong>Add an L1 loss term<\/strong> on the hidden feedforward activations, averaged over all tokens and hidden dimensions across all layers. Recommended coefficient: <code>L1 = 2\u00d710\u207b\u2075<\/code>. Add it to your standard cross-entropy loss. No changes to learning rate, weight decay, batch size, or optimizer.<\/div>\n<\/div>\n<div class=\"tw-row\">\n<div class=\"tw-row-bullet\">03<\/div>\n<div class=\"tw-row-text\"><strong>Sparsity stabilizes fast.<\/strong> The non-zero count settles within ~1,000 training steps (~1B tokens). The training kernels deliver memory and throughput benefits for almost the entire training run, not just toward the end.<\/div>\n<\/div>\n<\/div>\n<div class=\"tw-warn\">\n<div class=\"tw-warn-label\">Watch Out<\/div>\n<p>At L1 = 2\u00d710\u207b\u2075, over <strong>30% of neurons become permanently inactive (dead neurons)<\/strong> on average across layers. Downstream accuracy is not visibly affected at this level. The paper explores targeted gate weight reinitialization as a mitigation \u2014 yielding +19.1% speedup vs +17.9% baseline with no accuracy cost.<\/p>\n<\/div>\n<\/div>\n<p>    <!-- PANEL 4 --><\/p>\n<div class=\"tw-panel\" data-panel=\"3\">\n<div class=\"tw-panel-tag\">04 \u2014 Benchmark Results<\/div>\n<div class=\"tw-panel-title\">Accuracy preserved. Efficiency scales up with model size.<\/div>\n<div class=\"tw-table-wrap\">\n<table>\n<thead>\n<tr>\n<th>Model<\/th>\n<th>Accuracy<\/th>\n<th>Inference<\/th>\n<th>Energy \/ tok<\/th>\n<th>Training<\/th>\n<th>Peak Mem<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td class=\"tm\">0.5B<\/td>\n<td>40.4% \u2192 40.4%<\/td>\n<td><span class=\"tg\">+17.0%<\/span><\/td>\n<td><span class=\"tg\">\u221211.8%<\/span><\/td>\n<td><span class=\"tn\">\u22121.5%<\/span><\/td>\n<td><span class=\"tg\">\u221219.2%<\/span><\/td>\n<\/tr>\n<tr>\n<td class=\"tm\">1B<\/td>\n<td>44.6% \u2192 44.7%<\/td>\n<td><span class=\"tg\">+18.1%<\/span><\/td>\n<td><span class=\"tg\">\u221214.6%<\/span><\/td>\n<td><span class=\"tg\">+7.1%<\/span><\/td>\n<td><span class=\"tg\">\u221225.5%<\/span><\/td>\n<\/tr>\n<tr>\n<td class=\"tm\">1.5B<\/td>\n<td>46.4% \u2192 46.2%<\/td>\n<td><span class=\"tg\">+18.8%<\/span><\/td>\n<td><span class=\"tg\">\u221215.0%<\/span><\/td>\n<td><span class=\"tg\">+11.6%<\/span><\/td>\n<td><span class=\"tg\">\u221228.1%<\/span><\/td>\n<\/tr>\n<tr>\n<td class=\"tm\">2B<\/td>\n<td>49.1% \u2192 48.8%<\/td>\n<td><span class=\"tg\">+20.5%<\/span><\/td>\n<td><span class=\"tg\">\u221217.0%<\/span><\/td>\n<td><span class=\"tg\">+21.9%<\/span><\/td>\n<td><span class=\"tn\">+22.3%\u200a*<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/div>\n<div class=\"tw-text\">All results at L1 = 2\u00d710\u207b\u2075 on a single node of eight H100 PCIe GPUs, sequence length 2048. <strong>Efficiency gains grow with scale<\/strong> \u2014 average non-zero activations drop from 39 (0.5B) to 24 (2B), giving the sparse kernels proportionally more computation to skip. * The 2B sparse model uses a larger micro-batch enabled by reduced activation memory, raising peak usage while improving throughput.<\/div>\n<\/div>\n<p>    <!-- PANEL 5 --><\/p>\n<div class=\"tw-panel\" data-panel=\"4\">\n<div class=\"tw-panel-tag\">05 \u2014 Key Findings<\/div>\n<div class=\"tw-panel-title\">What the paper reveals about where sparsity actually lives.<\/div>\n<div class=\"tw-rows\">\n<div class=\"tw-row\">\n<div class=\"tw-row-bullet\">\u25c6<\/div>\n<div class=\"tw-row-text\"><strong>Early layers are least active.<\/strong> In a 28-layer 1.5B model, the first two layers have the fewest non-zero activations. Activity peaks in the early-to-middle layers \u2014 consistent with prior work showing LLM reasoning and knowledge retrieval concentrate there.<\/div>\n<\/div>\n<div class=\"tw-row\">\n<div class=\"tw-row-bullet\">\u25c6<\/div>\n<div class=\"tw-row-text\"><strong>First tokens in a sequence fire far more neurons.<\/strong> The model allocates exponentially more computation to early sequence positions where contextual cues from prior tokens are absent. This non-uniformity is exactly what the sparse kernels exploit for speedups.<\/div>\n<\/div>\n<div class=\"tw-row\">\n<div class=\"tw-row-bullet\">\u25c6<\/div>\n<div class=\"tw-row-text\"><strong>Strong inverse correlation between sparsity and speedup.<\/strong> The paper measures a Pearson correlation of \u22120.996 between each layer\u2019s average non-zero count and its inference speedup contribution. Sparser layers deliver proportionally larger gains.<\/div>\n<\/div>\n<div class=\"tw-row\">\n<div class=\"tw-row-bullet\">\u25c6<\/div>\n<div class=\"tw-row-text\"><strong>Larger gains on less specialized hardware.<\/strong> On NVIDIA RTX PRO 6000 (188 SMs vs 114 on H100), training speedups are significantly higher. Dense GEMM is slower on the RTX 6000, while sparse ops run faster \u2014 widening the relative advantage of sparsity on accessible hardware.<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p>    <!-- PANEL 6 --><\/p>\n<div class=\"tw-panel\" data-panel=\"5\">\n<div class=\"tw-panel-tag\">06 \u2014 Get Started<\/div>\n<div class=\"tw-panel-title\">Open-source. All kernels and training code released.<\/div>\n<div class=\"tw-rows\">\n<div class=\"tw-row\">\n<div class=\"tw-row-bullet\">\u25a0<\/div>\n<div class=\"tw-row-text\"><strong>Architecture:<\/strong> Works with gated feedforward LLMs \u2014 Llama, Qwen, and any Transformer++ design. Non-gated (original transformer) variant also supported: 11.2% inference speedup vs 17.9% for gated at the same L1.<\/div>\n<\/div>\n<div class=\"tw-row\">\n<div class=\"tw-row-bullet\">\u25a0<\/div>\n<div class=\"tw-row-text\"><strong>Hardware:<\/strong> CUDA kernels written for H100 GPUs using TMA-based pipelining and persistent cooperative design. Gains verified on RTX PRO 6000 with even larger speedups.<\/div>\n<\/div>\n<div class=\"tw-row\">\n<div class=\"tw-row-bullet\">\u25a0<\/div>\n<div class=\"tw-row-text\"><strong>Existing models:<\/strong> Fine-tuning via sparsification approaches is flagged as a future direction for bringing these kernels to pretrained dense models \u2014 not yet demonstrated in this paper.<\/div>\n<\/div>\n<\/div>\n<div class=\"tw-links\">\n        <a class=\"tw-link tw-link-solid\" href=\"https:\/\/github.com\/SakanaAI\/sparser-faster-llms\" target=\"_blank\" rel=\"noopener\">\ud83d\udcc4\u00a0 GitHub \u2014 Code &amp; Kernels<\/a><br \/>\n        <a class=\"tw-link tw-link-ghost\" href=\"https:\/\/arxiv.org\/abs\/2603.23198\" target=\"_blank\" rel=\"noopener\">\ud83d\udcd1\u00a0 arXiv Paper<\/a><br \/>\n        <a class=\"tw-link tw-link-ghost\" href=\"https:\/\/pub.sakana.ai\/sparser-faster-llms\/\" target=\"_blank\" rel=\"noopener\">\ud83c\udf10\u00a0 Project Page<\/a>\n      <\/div>\n<\/div>\n<\/div>\n<p><!-- \/tw-body --><\/p>\n<p>  <!-- FOOTER --><\/p>\n<div class=\"tw-footer\">\n<div class=\"tw-footer-nav\">\n      <button class=\"tw-btn\">\u2190 Prev<\/button><br \/>\n      <button class=\"tw-btn\">Next \u2192<\/button>\n    <\/div>\n<div class=\"tw-counter\">1 \/ 6<\/div>\n<div class=\"tw-credit\">Document Created by Marktechpost.com<\/div>\n<\/div>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>So, What Exactly is Proposed<\/strong><\/h3>\n<p>The research team addresses this mismatch with two primary contributions: a new sparse data format called <strong>TwELL (Tile-wise ELLPACK)<\/strong>, and a set of custom CUDA kernels for inference and training built around it.<\/p>\n<p><strong>TwELL<\/strong> is designed around one key insight: modern matmul kernels already divide computation across small 2D tiles (of size T_m \u00d7 T_n) assigned to individual cooperative thread arrays (CTAs). Standard ELL packs non-zeros row-by-row across the entire matrix, which requires global synchronization to construct from tiled matmul outputs. TwELL instead partitions the columns of the gate activation matrix into horizontal tiles of size T, and within each tile stores non-zero values and their indices in a local ELL-style layout. By matching the tile dimension T to the column tile size T_n of the matmul kernel, TwELL can be produced directly in the epilogue of the gate projection kernel \u2014 no extra kernel launch, no additional global memory read, no synchronization across CTAs. The format uses a compression factor C such that T\/C exceeds the maximum non-zeros per tile, and packages values, indices, and non-zero counts into a single 32-bit matrix for locality.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1688\" height=\"972\" data-attachment-id=\"79737\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/11\/sakana-ai-and-nvidia-introduce-twell-with-cuda-kernels-for-20-5-inference-and-21-9-training-speedup-in-llms\/screenshot-2026-05-11-at-1-25-44-am\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-11-at-1.25.44-AM.png\" data-orig-size=\"1688,972\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-05-11 at 1.25.44\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-11-at-1.25.44-AM-1024x590.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-11-at-1.25.44-AM.png\" alt=\"\" class=\"wp-image-79737\" \/><figcaption class=\"wp-element-caption\">https:\/\/pub.sakana.ai\/sparser-faster-llms\/<\/figcaption><\/figure>\n<\/div>\n<p><strong>For inference<\/strong>, a single fused kernel takes the gate activations in TwELL format and performs the up and down projections together. Each CTA handles one row of inputs, iterating first statically over column tiles and then dynamically over each tile&#8217;s non-zero count. For each active neuron at index n, the CTA loads the n-th column of the up projection weight matrix W_u and the n-th row of the down projection weight matrix W_d, computes the dot product, and accumulates into the output. The intermediate hidden state h_u is never materialized in global memory, cutting DRAM traffic significantly.<\/p>\n<p><strong>For training<\/strong>, the situation is more complex because sparsity patterns are highly non-uniform across tokens and layers \u2014 the maximum non-zeros per row can be orders of magnitude above the average, making a pure ELL layout brittle. The research team introduces a <strong>hybrid sparse format<\/strong> that dynamically routes rows either into a compact ELL matrix (for rows below a non-zero threshold) or into a dense backup matrix (for overflow rows). This allows efficient sparse gradient computation in the backward pass without requiring dense-to-dense matmuls for most rows. The team also releases kernels for the original non-gated transformer feedforward block; at the recommended sparsity level, the non-gated variant achieves an 11.2% inference speedup compared to 17.9% for the gated design.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Just ReLU and L1 Regularization<\/strong><\/h3>\n<p>The sparsity induction strategy is deliberately minimal. The research team used ReLU as the gate activation function and add a simple L1 loss term on the hidden feedforward activations, controlled by a coefficient L1. No other architectural changes are required, and the research team reported that adding L1 regularization did not affect other hyperparameters (learning rate, weight decay, optimizer settings).<\/p>\n<p>Models were trained on the fineweb dataset (a deduplicated fineweb-edu split) at chinchilla-optimal token counts \u2014 approximately 10B tokens for a 0.5B model up to 40B tokens for a 2B model \u2014 with a context length of 2048 and a batch size of 1M tokens.<\/p>\n<p>Testing eight L1 coefficient values on a 1.5B parameter model, they find that up to L1 = 3 \u00d7 10<sup>\u22125<\/sup>, there is essentially no drop in mean task accuracy across seven downstream benchmarks (ARC Easy\/Challenge, HellaSwag, OpenBookQA, PIQA, WinoGrande, CommonsenseQA), with final cross-entropy increasing by less than 2% relative to the unregularized baseline. The recommended setting L1 = 2 \u00d7 10<sup>\u22125<\/sup> reduces average non-zero activations from 911 per layer (in the unregularized 1.5B model with a feedforward hidden dimension of 5632) down to just 29 \u2014 roughly 99.5% sparsity \u2014 with no measurable downstream performance loss.<\/p>\n<p>One important key point: at L1 = 2 \u00d7 10<sup>\u22125<\/sup>, over 30% of neurons become permanently inactive (dead neurons) on average across layers. The research team explores two mitigation strategies \u2014 scheduling the L1 warmup and applying targeted reinitialization to dead gate projection columns \u2014 and finds that the reinitialization approach maintains similar sparsity levels while slightly improving both downstream accuracy and efficiency (+19.1% inference speedup vs. +17.9% baseline). This is listed as a direction for future work.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Measured Efficiency Gains<\/strong><\/h3>\n<p>The efficiency results are reported on a single node of eight H100 PCIe GPUs, with a fixed sequence length of 2048 tokens. For the cross-scale comparison, the L1 coefficient is fixed at 2 \u00d7 10<sup>\u22125<\/sup>.<\/p>\n<p>At smaller scales, sparsity delivers clear peak memory reductions during training:<\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th>Model<\/th>\n<th>Dense Peak Memory<\/th>\n<th>Sparse Peak Memory<\/th>\n<th>Change<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>0.5B<\/td>\n<td>26.2 GB<\/td>\n<td>21.2 GB<\/td>\n<td>\u221219.2%<\/td>\n<\/tr>\n<tr>\n<td>1B<\/td>\n<td>44.5 GB<\/td>\n<td>33.1 GB<\/td>\n<td>\u221225.5%<\/td>\n<\/tr>\n<tr>\n<td>1.5B<\/td>\n<td>62.8 GB<\/td>\n<td>45.1 GB<\/td>\n<td>\u221228.1%<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p>At 2B parameters, the sparse model uses a larger micro-batch (enabled by reduced activation memory at that scale), which results in higher peak GPU memory (46.7 \u2192 57.1 GB) but faster training throughput (+21.9%). The efficiency gains on all metrics for the 2B model:<\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Forward execution throughput<\/strong>: 87.8 \u2192 106 input tokens\/ms (<strong>+20.5%<\/strong>)<\/li>\n<li><strong>Energy per token<\/strong>: 7.85 \u2192 6.51 mJ (<strong>\u221217.0%<\/strong>)<\/li>\n<li><strong>Training step throughput<\/strong>: 22.4 \u2192 27.3 input tokens\/ms (<strong>+21.9%<\/strong>)<\/li>\n<\/ul>\n<p>Across the full 0.5B\u20132B range, mean task accuracy of sparse and non-sparse models remains statistically indistinguishable. Efficiency benefits grow with model scale: larger models naturally develop lower average non-zero counts (dropping from 39 at 0.5B to 24 at 2B), which means the sparse kernels skip a proportionally greater share of computation.<\/p>\n<p>Training speedups are also observed on NVIDIA&#8217;s RTX PRO 6000 GPU, where the larger Streaming Multiprocessor count (188 vs. 114 on H100) allows sparse operations to run faster \u2014 suggesting these gains extend to less specialized hardware.<\/p>\n<h3 class=\"wp-block-heading\"><strong>What the Sparsity Patterns Reveal<\/strong><\/h3>\n<p>Sparsity is not uniform: the first two layers of a 28-layer 1.5B model are the least active, followed by a pronounced peak in non-zero activations across early-middle layers \u2014 consistent with prior work suggesting this is where much of LLM reasoning and knowledge retrieval occurs. Separately, the first tokens in an input sequence activate far more neurons than later tokens, with an exponential decrease thereafter. The research team observed an inverse Pearson correlation of \u22120.996 between each layer&#8217;s average non-zero count and its inference speedup contribution, confirming that the sparsest layers provide the greatest per-layer gains.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/arxiv.org\/pdf\/2603.23198\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a>, <a href=\"https:\/\/github.com\/SakanaAI\/sparser-faster-llms\" target=\"_blank\" rel=\"noreferrer noopener\">Repo<\/a> <\/strong>and<strong> <a href=\"https:\/\/pub.sakana.ai\/sparser-faster-llms\/\" target=\"_blank\" rel=\"noreferrer noopener\">Technical details<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/05\/11\/sakana-ai-and-nvidia-introduce-twell-with-cuda-kernels-for-20-5-inference-and-21-9-training-speedup-in-llms\/\">Sakana AI and NVIDIA Introduce TwELL with CUDA Kernels for 20.5% Inference and 21.9% Training Speedup in LLMs<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Scaling large language models &hellip;<\/p>\n","protected":false},"author":1,"featured_media":885,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-884","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/884","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=884"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/884\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/885"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=884"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=884"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=884"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}