{"id":391,"date":"2026-02-11T12:38:57","date_gmt":"2026-02-11T04:38:57","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=391"},"modified":"2026-02-11T12:38:57","modified_gmt":"2026-02-11T04:38:57","slug":"nvidia-researchers-introduce-kvtc-transform-coding-pipeline-to-compress-key-value-caches-by-20x-for-efficient-llm-serving","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=391","title":{"rendered":"NVIDIA Researchers Introduce KVTC Transform Coding Pipeline to Compress Key-Value Caches by 20x for Efficient LLM Serving"},"content":{"rendered":"<p>Serving Large Language Models (LLMs) at scale is a massive engineering challenge because of Key-Value (KV) cache management. As models grow in size and reasoning capability, the KV cache footprint increases and becomes a major bottleneck for throughput and latency. For modern Transformers, this cache can occupy multiple gigabytes.<\/p>\n<p>NVIDIA researchers have introduced <strong>KVTC<\/strong> (KV Cache Transform Coding). This lightweight transform coder compresses KV caches for compact on-GPU and off-GPU storage. It achieves up to <strong>20x<\/strong> compression while maintaining reasoning and long-context accuracy. For specific use cases, it can reach <strong>40x<\/strong> or higher.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1338\" height=\"448\" data-attachment-id=\"77840\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/02\/10\/nvidia-researchers-introduce-kvtc-transform-coding-pipeline-to-compress-key-value-caches-by-20x-for-efficient-llm-serving\/screenshot-2026-02-10-at-8-28-40-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-10-at-8.28.40-PM-1.png\" data-orig-size=\"1338,448\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-02-10 at 8.28.40\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-10-at-8.28.40-PM-1-300x100.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-10-at-8.28.40-PM-1-1024x343.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-10-at-8.28.40-PM-1.png\" alt=\"\" class=\"wp-image-77840\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2511.01815<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>The Memory Dilemma in LLM Inference<\/strong><\/h3>\n<p>In production, inference frameworks treat local KV caches like databases. Strategies like prefix sharing promote the reuse of caches to speed up responses. However, stale caches consume scarce GPU memory. <strong>Developers currently face a difficult choice:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Keep the cache:<\/strong> Occupies memory needed for other users.<\/li>\n<li><strong>Discard the cache:<\/strong> Incurs the high cost of recomputation.<\/li>\n<li><strong>Offload the cache:<\/strong> Moves data to CPU DRAM or SSDs, leading to transfer overheads.<\/li>\n<\/ul>\n<p><strong><\/strong><strong>KVTC<\/strong> largely mitigates this dilemma by lowering the cost of on-chip retention and reducing the bandwidth required for offloading.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1324\" height=\"1058\" data-attachment-id=\"77842\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/02\/10\/nvidia-researchers-introduce-kvtc-transform-coding-pipeline-to-compress-key-value-caches-by-20x-for-efficient-llm-serving\/screenshot-2026-02-10-at-8-29-01-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-10-at-8.29.01-PM-1.png\" data-orig-size=\"1324,1058\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-02-10 at 8.29.01\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-10-at-8.29.01-PM-1-300x240.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-10-at-8.29.01-PM-1-1024x818.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-10-at-8.29.01-PM-1.png\" alt=\"\" class=\"wp-image-77842\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2511.01815<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>How the KVTC Pipeline Works<\/strong>?<\/h3>\n<p>The method is inspired by classical media compression. It applies a learned orthonormal transform, followed by adaptive quantization and entropy coding.<\/p>\n<h4 class=\"wp-block-heading\"><strong>1. Feature Decorrelation (PCA)<\/strong><\/h4>\n<p>Different attention heads often show similar patterns and a high degree of correlation. <strong><\/strong><strong>KVTC<\/strong> uses Principal Component Analysis (PCA) to linearly decorrelate features. Unlike other methods that calculate a separate decomposition for every prompt, <strong><\/strong><strong>KVTC<\/strong> computes the PCA basis matrix <strong>V<\/strong> once on a calibration dataset. This matrix is then reused for all future caches at inference time.<\/p>\n<h4 class=\"wp-block-heading\"><strong>2. Adaptive Quantization<\/strong><\/h4>\n<p>The system exploits the PCA ordering to allocate a fixed bit budget across coordinates. High-variance components receive more bits, while others receive fewer. <strong><\/strong><strong>KVTC<\/strong> uses a dynamic programming (DP) algorithm to find the optimal bit allocation that minimizes reconstruction error. Crucially, the DP often assigns <strong>0 bits<\/strong> to trailing principal components, allowing for early dimensionality reduction and faster performance.<\/p>\n<h4 class=\"wp-block-heading\"><strong>3. Entropy Coding<\/strong><\/h4>\n<p>The quantized symbols are packed and compressed using the <strong>DEFLATE<\/strong> algorithm. To maintain speed, <strong><\/strong><strong><\/strong><strong>KVTC<\/strong> leverages the <strong>nvCOMP<\/strong> library, which enables parallel compression and decompression directly on the GPU.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Protecting Critical Tokens<\/strong><\/h3>\n<p>Not all tokens are compressed equally. <strong>KVTC avoids compressing two specific types of tokens because they contribute disproportionately to attention accuracy:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Attention Sinks:<\/strong> The <strong>4<\/strong> oldest tokens in the sequence.<\/li>\n<li><strong>Sliding Window:<\/strong> The <strong>128<\/strong> most recent tokens.<\/li>\n<\/ul>\n<p>Ablation studies show that compressing these specific tokens can significantly lower or even collapse accuracy at high compression ratios<sup><\/sup>.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Benchmarks and Efficiency<\/strong><\/h3>\n<p>The research team tested <strong><\/strong><strong><\/strong><strong>KVTC<\/strong> with models like Llama-3.1, Mistral-NeMo, and R1-Qwen-2.5.<\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Accuracy:<\/strong> At <strong>16x<\/strong> compression (roughly <strong>20x<\/strong> after DEFLATE), the model consistently maintains results within <strong>1<\/strong> score point of vanilla models.<\/li>\n<li><strong>TTFT Reduction:<\/strong> For an <strong>8K<\/strong> context length, <strong>kvtc<\/strong> can reduce Time-To-First-Token (TTFT) by up to <strong>8x<\/strong> compared to full recomputation.<\/li>\n<li><strong>Speed:<\/strong> Calibration is fast; for a 12B model, it can be completed within <strong>10<\/strong> minutes on an NVIDIA H100 GPU.<\/li>\n<li><strong>Storage Overhead:<\/strong> The extra data stored per model is small, representing only <strong>2.4%<\/strong> of model parameters for Llama-3.3-70B.<\/li>\n<\/ul>\n<p><strong><\/strong><strong><\/strong><strong>KVTC<\/strong> is a practical building block for memory-efficient LLM serving. It does not modify model weights and is directly compatible with other token eviction methods.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1748\" height=\"1004\" data-attachment-id=\"77844\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/02\/10\/nvidia-researchers-introduce-kvtc-transform-coding-pipeline-to-compress-key-value-caches-by-20x-for-efficient-llm-serving\/screenshot-2026-02-10-at-8-29-54-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-10-at-8.29.54-PM-1.png\" data-orig-size=\"1748,1004\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-02-10 at 8.29.54\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-10-at-8.29.54-PM-1-300x172.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-10-at-8.29.54-PM-1-1024x588.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-10-at-8.29.54-PM-1.png\" alt=\"\" class=\"wp-image-77844\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2511.01815<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li><strong>High Compression with Low Accuracy Loss:<\/strong> <strong><\/strong><strong><\/strong><strong>KVTC<\/strong> achieves a standard <strong>20x<\/strong> compression ratio while maintaining results within <strong>1 score point<\/strong> of vanilla (uncompressed) models across most reasoning and long-context benchmarks.<\/li>\n<li><strong>Transform Coding Pipeline:<\/strong> The method utilizes a pipeline inspired by classical media compression, combining <strong>PCA-based feature decorrelation<\/strong>, <strong>adaptive quantization<\/strong> via dynamic programming, and <strong>lossless entropy coding<\/strong> (DEFLATE).<\/li>\n<li><strong>Critical Token Protection:<\/strong> To maintain model performance, <strong><\/strong><strong><\/strong><strong>KVTC<\/strong> avoids compressing the <strong>4<\/strong> oldest \u2018attention sink\u2019 tokens and a \u2018sliding window\u2019 of the <strong>128<\/strong> most recent tokens.<\/li>\n<li><strong>Operational Efficiency:<\/strong> The system is \u2018tuning-free,\u2019 requiring only a brief initial calibration (under <strong>10 minutes<\/strong> for a 12B model) that leaves model parameters unchanged and adds minimal storage overhead\u2014only <strong>2.4%<\/strong> for a 70B model.<\/li>\n<li><strong>Significant Latency Reduction:<\/strong> By reducing the volume of data stored and transferred, <strong><\/strong><strong><\/strong><strong>KVTC<\/strong> can reduce Time-To-First-Token (TTFT) by up to <strong>8x<\/strong> compared to the full recomputation of KV caches for long contexts.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out the\u00a0<strong><a href=\"https:\/\/arxiv.org\/pdf\/2511.01815\" target=\"_blank\" rel=\"noreferrer noopener\">Paper here<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">100k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/02\/10\/nvidia-researchers-introduce-kvtc-transform-coding-pipeline-to-compress-key-value-caches-by-20x-for-efficient-llm-serving\/\">NVIDIA Researchers Introduce KVTC Transform Coding Pipeline to Compress Key-Value Caches by 20x for Efficient LLM Serving<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Serving Large Language Models &hellip;<\/p>\n","protected":false},"author":1,"featured_media":392,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-391","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/391","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=391"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/391\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/392"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=391"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=391"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=391"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}