{"id":756,"date":"2026-04-20T08:51:27","date_gmt":"2026-04-20T00:51:27","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=756"},"modified":"2026-04-20T08:51:27","modified_gmt":"2026-04-20T00:51:27","slug":"moonshot-ai-and-tsinghua-researchers-propose-prfaas-a-cross-datacenter-kvcache-architecture-that-rethinks-how-llms-are-served-at-scale","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=756","title":{"rendered":"Moonshot AI and Tsinghua Researchers Propose PrfaaS: A Cross-Datacenter KVCache Architecture that Rethinks How LLMs are Served at Scale"},"content":{"rendered":"<p>For years, the way large language models handle inference has been stuck inside a box \u2014 literally. The high-bandwidth RDMA networks that make modern LLM serving work have confined both prefill and decode to the same datacenter, sometimes even the same rack. A team of  researchers at Moonshot AI and Tsinghua University is making the case that this constraint is about to break down \u2014 and that the right architecture can already exploit that shift.<\/p>\n<p>The research team introduces Prefill-as-a-Service (PrfaaS), a cross-datacenter serving architecture that selectively offloads long-context prefill to standalone, compute-dense prefill clusters and transfers the resulting KVCache over commodity Ethernet to local PD clusters for decode. The result, in a case study using an internal 1T-parameter hybrid model, is 54% higher serving throughput than a homogeneous PD baseline and 32% higher than a naive heterogeneous setup \u2014 while consuming only a fraction of available cross-datacenter bandwidth. The research team note that when compared at equal hardware cost, the throughput gain is approximately 15%, reflecting that the full 54% advantage comes partly from pairing higher-compute H200 GPUs for prefill with H20 GPUs for decode.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1584\" height=\"836\" data-attachment-id=\"79156\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/04\/19\/moonshot-ai-and-tsinghua-researchers-propose-prfaas-a-cross-datacenter-kvcache-architecture-that-rethinks-how-llms-are-served-at-scale\/screenshot-2026-04-19-at-5-50-12-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-19-at-5.50.12-PM-1.png\" data-orig-size=\"1584,836\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-04-19 at 5.50.12\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-19-at-5.50.12-PM-1-1024x540.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-19-at-5.50.12-PM-1.png\" alt=\"\" class=\"wp-image-79156\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2604.15039v1<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Why the Existing Architecture Has Hit a Wall<\/strong><\/h3>\n<p>To understand what PrfaaS solves, it helps to understand why LLM serving is split into two phases in the first place. Prefill is the step where the model processes all of the input tokens and generates the KVCache \u2014 it is compute-intensive. Decode is where the model generates output tokens one at a time \u2014 it is memory-bandwidth-intensive. Prefill-decode (PD) disaggregation separates these two phases onto different hardware, which improves utilization and allows each phase to be independently optimized.<\/p>\n<p>The problem is that separating prefill from decode creates a transport problem. Once prefill runs on one set of machines and decode runs on another, the KVCache produced by prefill must be transferred to the decode side before output generation can begin. In conventional dense-attention models \u2014 those using Grouped Query Attention (GQA) \u2014 this KVCache is enormous. The research team benchmarks MiniMax-M2.5, a representative dense model with GQA, producing KVCache at roughly 60 Gbps for a 32K-token request on a single 8\u00d7H200 instance. That volume of data requires RDMA-class interconnects to transfer without stalling compute, which is why conventional PD disaggregation is tightly bound to a single datacenter-scale network fabric. Moving prefill and decode to separate clusters, let alone across datacenters, has simply not been feasible.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Hybrid Attention Changes the Math<\/strong><\/h3>\n<p>What makes PrfaaS timely is an architectural shift happening at the model level. A growing class of models \u2014 including Kimi Linear, MiMo-V2-Flash, Qwen3.5-397B, and Ring-2.5-1T \u2014 adopt hybrid attention stacks that interleave a small number of full-attention layers with a larger number of linear-complexity or bounded-state layers such as Kimi Delta Attention (KDA), Multi-head Latent Attention (MLA), and Sliding Window Attention (SWA). In these architectures, only the full-attention layers produce KVCache that scales with sequence length. The linear-complexity layers maintain fixed-size recurrent states whose footprint is negligible at long context.<\/p>\n<p>The KV throughput numbers \u2014 defined as KVCache size divided by prefill latency \u2014 tell the story clearly. At 32K tokens, MiMo-V2-Flash produces KVCache at 4.66 Gbps versus 59.93 Gbps for MiniMax-M2.5, a 13\u00d7 reduction. Qwen3.5-397B reaches 8.25 Gbps versus 33.35 Gbps for Qwen3-235B, a 4\u00d7 reduction. For Ring-2.5-1T specifically, the paper decomposes the savings: MLA contributes roughly a 4.5\u00d7 compression over GQA, and the 7:1 hybrid ratio contributes another approximately 8\u00d7 reduction, yielding an overall KV memory saving of roughly 36\u00d7. For the internal 1T model used in the case study, KV throughput at 32K tokens is just 3.19 Gbps \u2014 a level that modern inter-datacenter Ethernet links can actually sustain.<\/p>\n<p>But the research team is careful to make a distinction that matters for AI devs building real systems: a smaller KVCache is necessary but not sufficient to make cross-datacenter PD disaggregation practical. Real workloads are bursty, request lengths are skewed, prefix caches are distributed unevenly across nodes, and inter-cluster bandwidth fluctuates. A naive design that routes every prefill to a remote cluster still runs into congestion and unstable queuing.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1488\" height=\"898\" data-attachment-id=\"79158\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/04\/19\/moonshot-ai-and-tsinghua-researchers-propose-prfaas-a-cross-datacenter-kvcache-architecture-that-rethinks-how-llms-are-served-at-scale\/screenshot-2026-04-19-at-5-50-40-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-19-at-5.50.40-PM-1.png\" data-orig-size=\"1488,898\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-04-19 at 5.50.40\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-19-at-5.50.40-PM-1-1024x618.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-19-at-5.50.40-PM-1.png\" alt=\"\" class=\"wp-image-79158\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2604.15039v1<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>What PrfaaS Actually Does<\/strong><\/h3>\n<p>The PrfaaS-PD architecture sits on top of <strong>three subsystems<\/strong>: <strong>compute, network, <\/strong>and<strong> storage<\/strong>. The compute subsystem separates clusters into two types \u2014 local PD clusters that handle end-to-end inference for short requests, and PrfaaS clusters with high-compute-throughput accelerators dedicated to long-context prefill. The network subsystem uses intra-cluster RDMA for fast local transfers and commodity Ethernet for cross-cluster KVCache transport. The storage subsystem builds a distributed hybrid prefix cache pool that handles linear attention recurrent states (request-level, fixed-size, exact-match only) and full-attention KVCache blocks (block-level, growing linearly with input length, supporting partial prefix matching) in separate groups backed by a unified block pool.<\/p>\n<p>The key routing mechanism is length-based threshold routing. Let <code>l<\/code> denote the incremental prefill length of a request after subtracting any cached prefix, and <code>t<\/code> a routing threshold. If <code>l &gt; t<\/code>, the request goes to the PrfaaS cluster and its KVCache is shipped over Ethernet to a decode node. If <code>l \u2264 t<\/code>, it stays on the local PD path. In the case study, the optimal threshold is <code>t = 19.4K<\/code> tokens, which routes approximately 50% of all requests \u2014 the longer ones \u2014 to the PrfaaS cluster.<\/p>\n<p>Making the Ethernet path reliable in practice requires more than just low KV throughput. The research team specifies three concrete transport mechanisms: layer-wise prefill pipelining to overlap KVCache generation with transmission, multi-connection TCP transport to fully utilize available bandwidth, and congestion monitoring integrated with the scheduler to detect loss and retransmission signals early and prevent congestion accumulation.<\/p>\n<p>On top of this, the research team introduces a dual-timescale scheduler. At short timescales, it monitors PrfaaS egress utilization and queue depth, adjusting routing when the link approaches its bandwidth ceiling. It also handles cache-affine routing: when bandwidth is scarce, each cluster\u2019s prefix cache is evaluated independently; when bandwidth is abundant, the scheduler considers the best cached prefix across all clusters and performs a cross-cluster cache transfer if it reduces redundant computation. At longer timescales, the scheduler rebalances prefill and decode node counts within the local PD cluster as traffic patterns shift, keeping the system near the throughput-optimal operating point.<\/p>\n<h3 class=\"wp-block-heading\"><strong>The Numbers<\/strong><\/h3>\n<p>In the case study, a PrfaaS cluster of 32 H200 GPUs is paired with a local PD cluster of 64 H20 GPUs, connected by a VPC network providing approximately 100 Gbps of cross-cluster bandwidth. The aggregate PrfaaS egress load under the optimal configuration is approximately 13 Gbps \u2014 just 13% of available Ethernet capacity \u2014 and the paper notes that the PrfaaS cluster remains compute-bound with substantial bandwidth headroom to spare. The research also projects this to larger deployments: even at the scale of a 10,000-GPU datacenter, the aggregate egress bandwidth required for KVCache transfer totals only about 1.8 Tbps, well within the capacity of modern inter-datacenter links.<\/p>\n<p>Mean Time to First Token (TTFT) drops by 50% and P90 TTFT drops by 64% compared to the homogeneous baseline. The naive heterogeneous configuration \u2014 all prefill on H200, all decode on H20, with no routing or scheduling logic \u2014 achieves only 1.16\u00d7 throughput over the homogeneous baseline, compared to 1.54\u00d7 for the full PrfaaS-PD system. The gap between 1.16\u00d7 and 1.54\u00d7 isolates the contribution of the scheduling layer and shows it accounts for the majority of the practical gain.<\/p>\n<p>The research team positions PrfaaS not as a near-future concept but as a design that is viable today for hybrid-architecture models \u2014 and argues that as context windows grow, KVCache compression techniques mature, and phase-specialized hardware such as NVIDIA\u2019s Rubin CPX for prefill and LPU-style chips for decode become more widely available, the case for cross-datacenter PD disaggregation will only strengthen.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the<strong>\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/2604.15039v1\" target=\"_blank\" rel=\"noreferrer noopener\">Paper here<\/a><\/strong>.<strong>\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">130k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/04\/19\/moonshot-ai-and-tsinghua-researchers-propose-prfaas-a-cross-datacenter-kvcache-architecture-that-rethinks-how-llms-are-served-at-scale\/\">Moonshot AI and Tsinghua Researchers Propose PrfaaS: A Cross-Datacenter KVCache Architecture that Rethinks How LLMs are Served at Scale<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>For years, the way large langu&hellip;<\/p>\n","protected":false},"author":1,"featured_media":757,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-756","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/756","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=756"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/756\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/757"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=756"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=756"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=756"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}