{"id":873,"date":"2026-05-08T06:03:47","date_gmt":"2026-05-07T22:03:47","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=873"},"modified":"2026-05-08T06:03:47","modified_gmt":"2026-05-07T22:03:47","slug":"lightseek-foundation-releases-tokenspeed-an-open-source-llm-inference-engine-targeting-tensorrt-llm-level-performance-for-agentic-workloads","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=873","title":{"rendered":"LightSeek Foundation Releases TokenSpeed, an Open-Source LLM Inference Engine Targeting TensorRT-LLM-Level Performance for Agentic Workloads"},"content":{"rendered":"<p>Inference efficiency has quietly become one of the most consequential bottlenecks in AI deployment. As agentic coding systems such as Claude Code, Codex, and Cursor scale from developer tools to infrastructure powering software development at large, the underlying inference engines serving those requests are under increasing strain. The <strong>LightSeek Foundation<\/strong> researchers have released <strong>TokenSpeed<\/strong>, an open-source LLM inference engine released under the MIT license and designed specifically for the demands of agentic workloads. The <strong>TokenSpeed<\/strong> engine is currently in preview status.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Why Agentic Inference is a Different Problem<\/strong><\/h3>\n<p>To understand what makes TokenSpeed\u2019s design choices meaningful, it helps to understand what makes agentic inference hard. Coding agents don\u2019t behave like a typical chatbot turn. Contexts routinely exceed 50K tokens, and conversations often span dozens of turns. This creates simultaneous pressure on two metrics: per-GPU TPM (tokens per minute), which determines how many users a single GPU can serve, and per-user TPS (tokens per second), which determines whether an individual user perceives the system as responsive. Most public benchmarks do not fully capture this behavior.<\/p>\n<p>TokenSpeed has been designed to maximize both. The objective is to maximize per-GPU TPM while maintaining a per-user TPS floor \u2014 typically 70 TPS, and sometimes 200 TPS or higher.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Architecture: Five Interlocking Subsystems<\/strong><\/h3>\n<p>TokenSpeed\u2019s architecture is built around five design pillars: a compiler-backed modeling mechanism for parallelism, a high-performance scheduler, a safe KV resource reuse restriction, a pluggable layered kernel system that supports heterogeneous accelerators, and SMG integration for a low-overhead CPU-side request entrypoint.<\/p>\n<p>The <strong>modeling layer<\/strong> uses a local SPMD (Single Program, Multiple Data) approach. SPMD is a parallel execution model where all processes run the same program but on different subsets of data \u2014 a common pattern in distributed deep learning. Rather than requiring developers to manually implement the communication logic between processes, TokenSpeed enables developers to specify I\/O placement annotations at module boundaries, and a lightweight static compiler then automatically generates the required collective operations during model construction, eliminating the need to manually implement communication logic.<\/p>\n<p>The <strong>scheduler<\/strong> makes a structural split between the control plane and the execution plane. The control plane is implemented in C++ as a finite-state machine that works with the type system to enforce safe resource management \u2014 including KV cache state transfer and usage \u2014 at compile time rather than at runtime. Request lifecycle, KV cache resources, and overlap timing are represented through explicit FSM transitions and ownership semantics, so correctness is enforced by a verifiable control system rather than convention. By encoding these correctness constraints into the type system rather than leaving them to runtime convention, errors in KV cache management \u2014 one of the most error-prone areas in LLM serving \u2014 are caught earlier. The execution plane is implemented in Python to maintain development efficiency, enabling faster feature iteration and lower cognitive load for developers<\/p>\n<p>The <strong>kernel layer<\/strong> treats GPU kernels as a first-class modular subsystem rather than baking them into the engine core. It provides a portable public API, a centralized registry and selection model, and an extensible plugin mechanism to support heterogeneous accelerators \u2014 meaning it isn\u2019t locked to NVIDIA hardware. The dev team has also developed one of the fastest <strong>MLA (Multi-head Latent Attention) kernels<\/strong> for agentic workloads on NVIDIA Blackwell. In the decode kernel, q_seqlen and num_heads are grouped to fully utilize Tensor Cores, as num_heads are small in some of these use cases. The binary prefill kernel includes a fine-tuned softmax implementation. Notably, TokenSpeed MLA has been adopted by vLLM.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1448\" height=\"850\" data-attachment-id=\"79663\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/07\/lightseek-foundation-releases-tokenspeed-an-open-source-llm-inference-engine-targeting-tensorrt-llm-level-performance-for-agentic-workloads\/screenshot-2026-05-07-at-3-03-21-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-07-at-3.03.21-PM-1.png\" data-orig-size=\"1448,850\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-05-07 at 3.03.21\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-07-at-3.03.21-PM-1-1024x601.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-07-at-3.03.21-PM-1.png\" alt=\"\" class=\"wp-image-79663\" \/><figcaption class=\"wp-element-caption\">https:\/\/lightseek.org\/blog\/lightseek-tokenspeed.html<\/figcaption><\/figure>\n<\/div>\n<p>Finally, TokenSpeed integrates <strong>SMG<\/strong> \u2014 a PyTorch-native component \u2014 for a low-overhead CPU-side request entrypoint, reducing the handoff cost between CPU orchestration and GPU execution.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Benchmark Results Against TensorRT-LLM on NVIDIA B200<\/strong><\/h3>\n<p>It is worth noting upfront that these benchmarks cover single (non-disaggregated) deployment only. PD disaggregation support is still undergoing cleanup and may be covered in a dedicated follow-up from the <strong>TokenSpeed<\/strong> team.<\/p>\n<p>Together with the EvalScope team, TokenSpeed was evaluated against SWE-smith traces, which closely mirror production coding-agent traffic, benchmarked against TensorRT-LLM \u2014 the current state of the art on NVIDIA Blackwell. The test model was Kimi K2.5.<\/p>\n<p>For coding agents running above 70 TPS\/User, the best configuration is Attention TP4 + MoE TP4, where TokenSpeed dominates TensorRT-LLM across the entire Pareto frontier: roughly 9% faster in the min-latency case (batch size 1), and roughly 11% higher throughput around 100 TPS\/User. TP4 here refers to tensor parallelism across 4 GPUs, a technique that shards model weights across multiple devices to reduce per-device memory pressure and latency.<\/p>\n<p>On the MLA kernel, the gains are more pronounced at the decode stage. The decode kernel folds the query-sequence axis into the head axis to better fill the BMM1 <code>M<\/code> tile, improving Tensor Core utilization. The binary-version prefill kernel uses NVIDIA-internal knobs to fine-tune the softmax implementation, outperforming TensorRT-LLM\u2019s MLA across all five typical prefill workloads for coding agents with long prefix KV cache. Combined with other optimizations, this nearly halves latency relative to TensorRT-LLM on typical decode workloads with speculative decoding at batch sizes 4, 8, and 16 with long prefix KV cache.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li><strong>TokenSpeed<\/strong> is a new MIT-licensed, open-source LLM inference engine by LightSeek Foundation, built specifically for agentic workloads. (Available in preview mode)<\/li>\n<li><strong>Its scheduler<\/strong> uses a C++ finite-state machine to enforce KV cache safety at compile time, while keeping the execution plane in Python for usability.<\/li>\n<li><strong>On NVIDIA B200<\/strong>, TokenSpeed outperforms TensorRT-LLM by ~9% in min-latency and ~11% in throughput at 100 TPS\/User on Kimi K2.5.<\/li>\n<li><strong>The TokenSpeed MLA kernel<\/strong> nearly halves decode latency vs. TensorRT-LLM on speculative decoding workloads and has already been adopted by vLLM.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/lightseek.org\/blog\/lightseek-tokenspeed.html\" target=\"_blank\" rel=\"noreferrer noopener\">Technical details<\/a>\u00a0<\/strong>and<strong>\u00a0<a href=\"https:\/\/github.com\/lightseekorg\/tokenspeed\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub Repo<\/a><\/strong>.<strong>\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/05\/07\/lightseek-foundation-releases-tokenspeed-an-open-source-llm-inference-engine-targeting-tensorrt-llm-level-performance-for-agentic-workloads\/\">LightSeek Foundation Releases TokenSpeed, an Open-Source LLM Inference Engine Targeting TensorRT-LLM-Level Performance for Agentic Workloads<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Inference efficiency has quiet&hellip;<\/p>\n","protected":false},"author":1,"featured_media":874,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-873","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/873","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=873"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/873\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/874"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=873"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=873"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=873"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}