{"id":977,"date":"2026-05-26T05:24:31","date_gmt":"2026-05-25T21:24:31","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=977"},"modified":"2026-05-26T05:24:31","modified_gmt":"2026-05-25T21:24:31","slug":"together-ai-open-sources-oscar-an-attention-aware-2-bit-kv-cache-quantization-system-for-long-context-llm-serving","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=977","title":{"rendered":"Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving"},"content":{"rendered":"<p class=\"wp-block-paragraph\">Long-context inference makes the KV cache one of the main costs of serving LLMs. During autoregressive decoding, the cache grows with context length, batch size, and model depth. At high batch sizes and long contexts with 100K tokens across dozens of concurrent requests the KV cache consumes a large fraction of GPU memory. Compressing it is a direct way to increase batch size and reduce memory traffic.<\/p>\n<p class=\"wp-block-paragraph\">The obvious approach is quantization. But pushing KV caches to INT2 (2-bit) precision has been largely impractical. Prior methods either collapse in accuracy or require custom serving layouts incompatible with paged KV-cache systems. <strong>Together AI\u2019s OSCAR (Offline Spectral Covariance-Aware Rotation) addresses both problems.<\/strong><\/p>\n<h2 class=\"wp-block-heading\"><strong>Why INT2 KV Cache Quantization is Hard<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">KV activations contain channel-wise outliers. A small subset of channels holds extremely large values. Most channels are well-behaved. When you apply INT2 quantization which has only four representable levels and those outliers dominate the scale factor. The quantizer wastes most of its range on rare spikes. Normal values get compressed into just one or two effective levels. This degrades attention quality substantially.<\/p>\n<p class=\"wp-block-paragraph\">Rotation-based quantization addresses this by applying a fixed orthogonal transform, typically a <strong>Hadamard transform<\/strong>, to redistribute outlier energy across all channels. This approach works reasonably well at INT4. At INT2, a deeper problem remains: the rotation is <em>data-oblivious<\/em>. It can smooth activation ranges, but it does not know which directions the attention mechanism actually reads. Spreading quantization error uniformly is not the same as pushing it into low-importance directions. At INT2, with only four levels, that distinction determines whether the model works at all.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1586\" height=\"1094\" data-attachment-id=\"80109\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/25\/together-ai-open-sources-oscar-an-attention-aware-2-bit-kv-cache-quantization-system-for-long-context-llm-serving\/screenshot-2026-05-25-at-2-24-00-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-25-at-2.24.00-PM-1.png\" data-orig-size=\"1586,1094\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;,&quot;alt&quot;:&quot;&quot;}\" data-image-title=\"Screenshot 2026-05-25 at 2.24.00\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-25-at-2.24.00-PM-1-1024x706.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-25-at-2.24.00-PM-1.png\" alt=\"\" class=\"wp-image-80109\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2605.17757v1<\/figcaption><\/figure>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>What OSCAR Does Differently<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">OSCAR\u2019s key observation is that the rotation applied before quantization should be derived from attention statistics themselves \u2014 not from the raw distribution of KV activations.<\/p>\n<p class=\"wp-block-paragraph\"><strong>For keys<\/strong>, the downstream error that matters is not the Euclidean reconstruction error of K. It is the error in attention logits. The research team showed this error is: <code>\u2016QK<sup>\u22a4<\/sup> \u2212 QK\u0302<sup>\u22a4<\/sup>\u2016\u00b2F = tr((K \u2212 K\u0302)Q<sup>\u22a4<\/sup>Q(K \u2212 K\u0302)<sup>\u22a4<\/sup>)<\/code>. The weighting matrix is the <em>query covariance<\/em> Q<sup>\u22a4<\/sup>Q, not K<sup>\u22a4<\/sup>K. Directions where queries have large energy amplify quantization errors in logits. OSCAR estimates the empirical query covariance <code>CQ = (1\/N) \u03a3 qn<sup>\u22a4<\/sup>qn<\/code> from a calibration set, eigen-decomposes it, and uses the eigenvectors UQ as the key rotation basis.<\/p>\n<p class=\"wp-block-paragraph\"><strong>For values<\/strong>, the relevant error is in the attention output SV. This depends on how the attention score matrix S weights each value row. The research team defines the score-weighted value covariance <code>CS = (1\/N) V<sup>\u22a4<\/sup>S<sup>\u22a4<\/sup>SV<\/code>. Directions that remain large after aggregation by S are the ones quantization error propagates through. OSCAR uses the eigenvectors US of CS as the value rotation basis.<\/p>\n<p class=\"wp-block-paragraph\"><strong>The final composed rotations are:<\/strong><\/p>\n<p class=\"wp-block-paragraph\"><code>RK = UQ \u00b7 HHad \u00b7 Pbr<\/code><br \/><code>RV = US \u00b7 HHad \u00b7 Pbr<\/code><\/p>\n<p class=\"wp-block-paragraph\"><strong>Each of the three factors addresses a distinct failure mode of per-group low-bit quantization:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>UQ \/ US<\/strong> aligns channels with attention-importance directions. This diagonalizes the error-weighting matrix so the most important directions are identifiable.<\/li>\n<li><strong>HHad<\/strong> (Walsh-Hadamard transform) then equalizes channel importance exactly. Lemma 1 in the research paper proves every diagonal entry of <code>H<sub>Had<\/sub><sup>\u22a4<\/sup> \u039b H<sub>Had<\/sub><\/code> equals <code>tr(\u039b)\/d<\/code> \u2014 the peaky eigenspectrum exposed by UQ is compressed to a uniform value across all channels.<\/li>\n<li><strong>Pbr<\/strong> (permuted bit-reversal) reorders channels so that for any power-of-two quantization group size, each group receives one representative from each level of the importance hierarchy.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">The research team provides Theorem 1 proving UQ and US are optimal under a frozen-error surrogate objective with diagonal residual assumptions.<\/p>\n<h2 class=\"wp-block-heading\"><strong>The Serving System: Mixed-Precision Cache Layout<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">OSCAR integrates into SGLang\u2019s production serving stack as an INT2 KV-cache mode with full compatibility with paged attention.<\/p>\n<p class=\"wp-block-paragraph\">The KV cache layout uses three regions per request:<\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Sink tokens<\/strong> (first S0 = 64 tokens): stored in BF16. These function as attention sinks.<\/li>\n<li><strong>Recent tokens<\/strong> (last W = 256 tokens before current position): stored in BF16.<\/li>\n<li><strong>History tokens<\/strong> (everything in between): stored as INT2 after OSCAR rotation and clipping.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">At 128K context length, the BF16 sink and recent windows represent only 0.24% of total tokens. The ablation (Table 5 in the research paper) shows (S=64, R=256) is the accuracy-efficiency knee: smaller windows noticeably hurt accuracy; larger windows give negligible additional benefit at higher BF16 memory cost.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"294\" data-attachment-id=\"80103\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/25\/together-ai-open-sources-oscar-an-attention-aware-2-bit-kv-cache-quantization-system-for-long-context-llm-serving\/screenshot-2026-05-25-at-1-46-54-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-25-at-1.46.54-PM-1.png\" data-orig-size=\"1318,378\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;,&quot;alt&quot;:&quot;&quot;}\" data-image-title=\"Screenshot 2026-05-25 at 1.46.54\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-25-at-1.46.54-PM-1-1024x294.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-25-at-1.46.54-PM-1-1024x294.png\" alt=\"\" class=\"wp-image-80103\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2605.17757<\/figcaption><\/figure>\n<\/div>\n<p class=\"wp-block-paragraph\">Write and read paths use fused Triton kernels. On the write path, each token is rotated, clipped to a calibration-derived percentile threshold (typical values: cK = 0.96, cV = 0.92), then quantized with per-token asymmetric INT2 at a default group size of GK = 64 channels per group. On the read path, the INT2 kernel unpacks bytes, dequantizes, inverse-rotates, and passes results to the attention kernel \u2014 all in one fused pass without extra memory traffic. The value rotation RV is absorbed into the model\u2019s projection weights offline, eliminating its online compute cost.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Outcome<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">The research team evaluated OSCAR on four model configurations: Qwen3-4B-Thinking-2507, Qwen3-8B, Qwen3-32B, and GLM-4.7-FP8 (358B parameters). Benchmarks include AIME25, GPQA-Diamond, HumanEval, LiveCodeBench v6, and MATH500, all at 32K maximum generation length.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Accuracy (at 2.28 bits per KV element):<\/strong><\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th>Model<\/th>\n<th>BF16 Mean<\/th>\n<th>OSCAR Mean<\/th>\n<th>Gap to BF16<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Qwen3-4B-Thinking-2507<\/td>\n<td>75.64<\/td>\n<td>71.86<\/td>\n<td>\u22123.78<\/td>\n<\/tr>\n<tr>\n<td>Qwen3-8B<\/td>\n<td>70.84<\/td>\n<td>69.42<\/td>\n<td>\u22121.42<\/td>\n<\/tr>\n<tr>\n<td>Qwen3-32B<\/td>\n<td>74.19<\/td>\n<td>74.17<\/td>\n<td>\u22120.02<\/td>\n<\/tr>\n<tr>\n<td>GLM-4.7-FP8 (358B)<\/td>\n<td>77.89<\/td>\n<td>78.16<\/td>\n<td>+0.27<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p class=\"wp-block-paragraph\">For context on how competing methods compare: naive INT2 (no rotation) scores 0.00 on both Qwen3-4B and Qwen3-8B. QuaRot-INT2 (Hadamard-only rotation) scores 1.40 on Qwen3-4B and 10.14 on Qwen3-8B. TurboQuant at 3.25 bits drops 43.90 points on Qwen3-4B-Thinking. Saw-INT4 at 4.25 bits reaches 73.11 on Qwen3-4B \u2014 OSCAR at 2.28 bits reaches 71.86.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"694\" height=\"312\" data-attachment-id=\"80105\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/25\/together-ai-open-sources-oscar-an-attention-aware-2-bit-kv-cache-quantization-system-for-long-context-llm-serving\/screenshot-2026-05-25-at-1-48-27-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-25-at-1.48.27-PM-1.png\" data-orig-size=\"694,312\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;,&quot;alt&quot;:&quot;&quot;}\" data-image-title=\"Screenshot 2026-05-25 at 1.48.27\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-25-at-1.48.27-PM-1.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-25-at-1.48.27-PM-1.png\" alt=\"\" class=\"wp-image-80105\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2605.17757<\/figcaption><\/figure>\n<\/div>\n<p class=\"wp-block-paragraph\">The research team also compared against channel-wise methods on AIME25 (Table 1). On Qwen3-8B, OSCAR at 2.38 BPE achieves 66.67\u00b13.33 \u2014 above KIVI-KV2* at 57.67 (2.26 BPE) and Kitty at 59.67 (2.39 BPE). Note that channel-wise methods require residual buffers or custom page layouts that do not fit standard paged-attention serving, so this comparison is limited to the single shared benchmark where results were available.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Long-context robustness (RULER-NIAH):<\/strong><\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th>Model<\/th>\n<th>Method<\/th>\n<th>16K<\/th>\n<th>32K<\/th>\n<th>64K<\/th>\n<th>128K<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Qwen3-4B-Thinking<\/td>\n<td>BF16<\/td>\n<td>99.7<\/td>\n<td>99.3<\/td>\n<td>85.3<\/td>\n<td>81.0<\/td>\n<\/tr>\n<tr>\n<td>Qwen3-4B-Thinking<\/td>\n<td>QuaRot-INT2<\/td>\n<td>0.0<\/td>\n<td>0.0<\/td>\n<td>15.6<\/td>\n<td>0.0<\/td>\n<\/tr>\n<tr>\n<td>Qwen3-4B-Thinking<\/td>\n<td>OSCAR<\/td>\n<td>97.8<\/td>\n<td>87.6<\/td>\n<td>61.9<\/td>\n<td>39.5<\/td>\n<\/tr>\n<tr>\n<td>Qwen3-8B<\/td>\n<td>BF16<\/td>\n<td>98.9<\/td>\n<td>97.3<\/td>\n<td>79.2<\/td>\n<td>78.2<\/td>\n<\/tr>\n<tr>\n<td>Qwen3-8B<\/td>\n<td>QuaRot-INT2<\/td>\n<td>19.0<\/td>\n<td>9.8<\/td>\n<td>0.0<\/td>\n<td>0.0<\/td>\n<\/tr>\n<tr>\n<td>Qwen3-8B<\/td>\n<td>OSCAR<\/td>\n<td>93.9<\/td>\n<td>86.3<\/td>\n<td>61.9<\/td>\n<td>45.0<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p class=\"wp-block-paragraph\">On GLM-4.7-FP8, OSCAR matches the BF16 curve through 128K.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Throughput (H100, 100K context, batch size 1):<\/strong><\/p>\n<p class=\"wp-block-paragraph\">Decode throughput speedup relative to BF16, at increasing context lengths:<\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th>Model<\/th>\n<th>30K<\/th>\n<th>60K<\/th>\n<th>100K<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Qwen3-4B-Thinking<\/td>\n<td>1.98\u00d7<\/td>\n<td>2.52\u00d7<\/td>\n<td>3.08\u00d7<\/td>\n<\/tr>\n<tr>\n<td>Qwen3-8B<\/td>\n<td>1.84\u00d7<\/td>\n<td>2.29\u00d7<\/td>\n<td>2.88\u00d7<\/td>\n<\/tr>\n<tr>\n<td>GLM-4.7-FP8<\/td>\n<td>1.98\u00d7<\/td>\n<td>2.49\u00d7<\/td>\n<td>2.83\u00d7<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p class=\"wp-block-paragraph\">At batch size 32, job-level throughput at 100K context reaches 6.17\u00d7 over BF16 on Qwen3-4B-Thinking and 7.83\u00d7 on GLM-4.7-FP8. The speedup increases with context length because decoding becomes increasingly KV-bandwidth-bound. Reducing KV memory by 8\u00d7 directly reduces that bottleneck. The online rotation overhead is absorbed into the decode kernels.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Marktechpost\u2019s Visual Explainer<\/strong><\/h2>\n<div>\n<p>  <!-- TOP BAR --><\/p>\n<div class=\"mtp-bar\">\n    <span class=\"mtp-bar-brand\">OSCAR \u2014 How-To Guide<\/span><br \/>\n    <span class=\"mtp-bar-counter\">01 \/ 08<\/span>\n  <\/div>\n<p>  <!-- SLIDES --><\/p>\n<div class=\"mtp-slides\">\n<p>    <!-- SLIDE 1: INTRO --><\/p>\n<div class=\"mtp-slide\">\n<div class=\"mtp-slide-num\">01<\/div>\n<p>      <span class=\"mtp-slide-tag\">Overview<\/span><\/p>\n<div class=\"mtp-slide-title\">What is OSCAR?<\/div>\n<p>      <span class=\"mtp-divider\"><\/span><\/p>\n<div class=\"mtp-body\">\n        <strong>OSCAR<\/strong> (Offline Spectral Covariance-Aware Rotation) is a 2-bit KV cache quantization system from Together AI for long-context LLM serving.\n<p>        Instead of applying a generic Hadamard rotation, OSCAR derives <strong>attention-aware rotations<\/strong> from a one-time offline calibration pass \u2014 aligning quantization noise with directions that attention is least sensitive to.<\/p>\n<p>        The result: INT2 precision with near-BF16 accuracy and full compatibility with paged KV-cache serving.\n      <\/p><\/div>\n<div class=\"mtp-stats\">\n<div class=\"mtp-stat\">\n          <span class=\"mtp-stat-val\">8\u00d7<\/span><br \/>\n          <span class=\"mtp-stat-lbl\">KV Memory Reduction<\/span>\n        <\/div>\n<div class=\"mtp-stat\">\n          <span class=\"mtp-stat-val\">3\u00d7<\/span><br \/>\n          <span class=\"mtp-stat-lbl\">Decode Speedup<\/span>\n        <\/div>\n<div class=\"mtp-stat\">\n          <span class=\"mtp-stat-val\">2.28<\/span><br \/>\n          <span class=\"mtp-stat-lbl\">Bits Per KV Element<\/span>\n        <\/div>\n<\/div>\n<\/div>\n<p>    <!-- SLIDE 2: PREREQUISITES --><\/p>\n<div class=\"mtp-slide\">\n<div class=\"mtp-slide-num\">02<\/div>\n<p>      <span class=\"mtp-slide-tag\">Setup<\/span><\/p>\n<div class=\"mtp-slide-title\">Prerequisites<\/div>\n<p>      <span class=\"mtp-divider\"><\/span><\/p>\n<div class=\"mtp-body\">Before getting started, make sure you have the following in place:<\/div>\n<ul class=\"mtp-steps\">\n<li>\n          <span class=\"mtp-step-n\">01<\/span><br \/>\n          <span class=\"mtp-step-text\"><strong>Hardware:<\/strong> NVIDIA H100 GPU (80 GB) recommended. A100 may work for smaller models.<\/span>\n        <\/li>\n<li>\n          <span class=\"mtp-step-n\">02<\/span><br \/>\n          <span class=\"mtp-step-text\"><strong>SGLang installed:<\/strong> OSCAR is integrated into the SGLang serving framework. Install the latest version from source.<\/span>\n        <\/li>\n<li>\n          <span class=\"mtp-step-n\">03<\/span><br \/>\n          <span class=\"mtp-step-text\"><strong>Triton:<\/strong> Custom fused kernels are written in Triton. Triton ships with most recent PyTorch \/ SGLang installs.<\/span>\n        <\/li>\n<li>\n          <span class=\"mtp-step-n\">04<\/span><br \/>\n          <span class=\"mtp-step-text\"><strong>A supported model:<\/strong> Qwen3-4B, Qwen3-8B, Qwen3-32B, GLM-4.7-FP8, or MiniMax-M2.7. Pre-computed rotations are available for all of these.<\/span>\n        <\/li>\n<\/ul>\n<pre>pip install sglang[all] --upgrade\npip install triton<\/pre>\n<\/div>\n<p>    <!-- SLIDE 3: ROTATIONZOO --><\/p>\n<div class=\"mtp-slide\">\n<div class=\"mtp-slide-num\">03<\/div>\n<p>      <span class=\"mtp-slide-tag\">Step 1<\/span><\/p>\n<div class=\"mtp-slide-title\">Download Pre-Computed Rotations via RotationZoo<\/div>\n<p>      <span class=\"mtp-divider\"><\/span><\/p>\n<div class=\"mtp-body\">\n        Together AI publishes pre-computed rotation matrices and clip thresholds for supported models in <strong>RotationZoo<\/strong> on ModelScope. No recalibration needed.\n      <\/div>\n<pre>from modelscope import snapshot_download\n\n# Download RotationZoo for your model\nrotation_path = snapshot_download(\n    'togethercomputer\/OSCAR-RotationZoo'\n)<\/pre>\n<div class=\"mtp-body\">The downloaded artifact contains per-layer <code class=\"inline\">RK<\/code>, <code class=\"inline\">RV<\/code> rotation matrices and clip thresholds <code class=\"inline\">cK<\/code>, <code class=\"inline\">cV<\/code> for each supported model. These are fixed offline parameters \u2014 they are not updated at runtime.<\/div>\n<div class=\"mtp-models\">\n<div class=\"mtp-model\"><span class=\"mtp-model-name\">Qwen3-4B \/ 8B \/ 32B<\/span><span class=\"mtp-model-bpe\">2.28 BPE<\/span><\/div>\n<div class=\"mtp-model\"><span class=\"mtp-model-name\">GLM-4.7-FP8 (358B)<\/span><span class=\"mtp-model-bpe\">2.28 BPE<\/span><\/div>\n<div class=\"mtp-model\"><span class=\"mtp-model-name\">MiniMax-M2.7<\/span><span class=\"mtp-model-bpe\">2.28 BPE<\/span><\/div>\n<div class=\"mtp-model\"><span class=\"mtp-model-name\">Custom (run calibration)<\/span><span class=\"mtp-model-bpe\">any model<\/span><\/div>\n<\/div>\n<\/div>\n<p>    <!-- SLIDE 4: CALIBRATION (OPTIONAL) --><\/p>\n<div class=\"mtp-slide\">\n<div class=\"mtp-slide-num\">04<\/div>\n<p>      <span class=\"mtp-slide-tag\">Step 2 (Optional)<\/span><\/p>\n<div class=\"mtp-slide-title\">Run Offline Calibration for a Custom Model<\/div>\n<p>      <span class=\"mtp-divider\"><\/span><\/p>\n<div class=\"mtp-body\">\n        If your model is not in RotationZoo, run the one-time calibration pass. OSCAR dumps Q, K, V activations from a small dataset, estimates attention-aware covariance, and writes out rotation matrices and clip thresholds.\n      <\/div>\n<pre>python calibrate_oscar.py \n  --model-path \/path\/to\/your-model \n  --calib-data gpqa_diamond \n  --calib-tokens 8192 \n  --output-dir .\/oscar_rotations\/<\/pre>\n<div class=\"mtp-body\">\n        <strong>Calibration is not task-specific.<\/strong> The paper shows that results are low-sensitivity to domain (MMLU, WikiText, GPQA-Diamond all produce similar accuracy). Run it once and reuse across all tasks.\n<p>        Typical values produced: <code class=\"inline\">cK \u2248 0.96<\/code>, <code class=\"inline\">cV \u2248 0.92<\/code> per layer.\n      <\/p><\/div>\n<\/div>\n<p>    <!-- SLIDE 5: SGLANG SERVING --><\/p>\n<div class=\"mtp-slide\">\n<div class=\"mtp-slide-num\">05<\/div>\n<p>      <span class=\"mtp-slide-tag\">Step 3<\/span><\/p>\n<div class=\"mtp-slide-title\">Launch SGLang with INT2 KV Cache Enabled<\/div>\n<p>      <span class=\"mtp-divider\"><\/span><\/p>\n<div class=\"mtp-body\">Pass the rotation path and enable INT2 KV mode when launching the SGLang server.<\/div>\n<pre>python -m sglang.launch_server \n  --model-path Qwen\/Qwen3-8B \n  --kv-cache-dtype int2 \n  --oscar-rotation-path .\/oscar_rotations\/ \n  --oscar-sink-size 64 \n  --oscar-recent-size 256 \n  --tp 1 \n  --port 30000<\/pre>\n<div class=\"mtp-body\">\n        <strong>Tensor parallelism<\/strong> is supported. For Qwen3-32B use <code class=\"inline\">--tp 2<\/code> (2\u00d7H100). For GLM-4.7-FP8 use <code class=\"inline\">--tp 8<\/code> (8\u00d7H100).\n<p>        The server exposes a standard OpenAI-compatible API. No client-side changes are needed.\n      <\/p><\/div>\n<\/div>\n<p>    <!-- SLIDE 6: KEY PARAMETERS --><\/p>\n<div class=\"mtp-slide\">\n<div class=\"mtp-slide-num\">06<\/div>\n<p>      <span class=\"mtp-slide-tag\">Step 4<\/span><\/p>\n<div class=\"mtp-slide-title\">Key Configuration Parameters<\/div>\n<p>      <span class=\"mtp-divider\"><\/span><\/p>\n<table class=\"mtp-table\">\n<thead>\n<tr>\n<th>Parameter<\/th>\n<th>Default<\/th>\n<th>What it controls<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>\u2013oscar-sink-size<\/td>\n<td>64<\/td>\n<td>First N tokens kept in BF16 as attention sinks<\/td>\n<\/tr>\n<tr>\n<td>\u2013oscar-recent-size<\/td>\n<td>256<\/td>\n<td>Last N tokens kept in BF16 before current position<\/td>\n<\/tr>\n<tr>\n<td>cK (clip ratio)<\/td>\n<td>0.96<\/td>\n<td>Percentile clip for rotated key activations<\/td>\n<\/tr>\n<tr>\n<td>cV (clip ratio)<\/td>\n<td>0.92<\/td>\n<td>Percentile clip for rotated value activations<\/td>\n<\/tr>\n<tr>\n<td>Group size GK<\/td>\n<td>64<\/td>\n<td>Channels per INT2 quantization group (head dim)<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<div class=\"mtp-body\">\n        The paper identifies <strong>(sink=64, recent=256)<\/strong> as the accuracy-efficiency knee. Smaller windows reduce accuracy noticeably; larger windows add BF16 memory overhead with negligible gain.\n      <\/div>\n<\/div>\n<p>    <!-- SLIDE 7: VERIFY --><\/p>\n<div class=\"mtp-slide\">\n<div class=\"mtp-slide-num\">07<\/div>\n<p>      <span class=\"mtp-slide-tag\">Step 5<\/span><\/p>\n<div class=\"mtp-slide-title\">Run Inference and Verify<\/div>\n<p>      <span class=\"mtp-divider\"><\/span><\/p>\n<div class=\"mtp-body\">Once the server is running, query it with the standard OpenAI client:<\/div>\n<pre>from openai import OpenAI\n\nclient = OpenAI(\n    base_url=\"http:\/\/localhost:30000\/v1\",\n    api_key=\"none\"\n)\n\nresponse = client.chat.completions.create(\n    model=\"Qwen\/Qwen3-8B\",\n    messages=[{\"role\": \"user\",\n               \"content\": \"Your long-context prompt here\"}],\n    max_tokens=1024\n)\nprint(response.choices[0].message.content)<\/pre>\n<div class=\"mtp-body\">\n        <strong>Prefix caching works out of the box.<\/strong> OSCAR preserves the standard paged KV-cache abstraction, so SGLang\u2019s radix cache and prefix reuse function normally. No application-level changes are needed.\n      <\/div>\n<\/div>\n<p>    <!-- SLIDE 8: RESULTS + TAGLINE --><\/p>\n<div class=\"mtp-slide\">\n<div class=\"mtp-slide-num\">08<\/div>\n<p>      <span class=\"mtp-slide-tag\">Results<\/span><\/p>\n<div class=\"mtp-slide-title\">Accuracy vs BF16 Baseline<\/div>\n<p>      <span class=\"mtp-divider\"><\/span><\/p>\n<div class=\"mtp-body\">Averaged across AIME25, GPQA-Diamond, HumanEval, LiveCodeBench v6, and MATH500 at 32K generation length.<\/div>\n<div class=\"mtp-results\">\n<div class=\"mtp-result-row\">\n          <span class=\"mtp-result-label\">Qwen3-4B-Thinking<\/span>\n<div class=\"mtp-result-bar-wrap\">\n<div class=\"mtp-result-bar\"><\/div>\n<\/div>\n<p>          <span class=\"mtp-result-val\">\u22123.78<\/span>\n        <\/p><\/div>\n<div class=\"mtp-result-row\">\n          <span class=\"mtp-result-label\">Qwen3-8B<\/span>\n<div class=\"mtp-result-bar-wrap\">\n<div class=\"mtp-result-bar\"><\/div>\n<\/div>\n<p>          <span class=\"mtp-result-val\">\u22121.42<\/span>\n        <\/p><\/div>\n<div class=\"mtp-result-row\">\n          <span class=\"mtp-result-label\">Qwen3-32B<\/span>\n<div class=\"mtp-result-bar-wrap\">\n<div class=\"mtp-result-bar\"><\/div>\n<\/div>\n<p>          <span class=\"mtp-result-val\">\u22120.02<\/span>\n        <\/p><\/div>\n<div class=\"mtp-result-row\">\n          <span class=\"mtp-result-label\">GLM-4.7-FP8 (358B)<\/span>\n<div class=\"mtp-result-bar-wrap\">\n<div class=\"mtp-result-bar\"><\/div>\n<\/div>\n<p>          <span class=\"mtp-result-val\">+0.27<\/span>\n        <\/p><\/div>\n<\/div>\n<div class=\"mtp-body\">\n        <strong>Paper:<\/strong> arXiv:2605.17757 \u00a0\u00a0<strong>RotationZoo:<\/strong> modelscope.cn\/models\/togethercomputer\/OSCAR-RotationZoo\n      <\/div>\n<\/div>\n<\/div>\n<p><!-- \/mtp-slides --><\/p>\n<p>  <!-- BOTTOM NAV --><\/p>\n<div class=\"mtp-nav\">\n<div class=\"mtp-nav-btns\">\n      <button class=\"mtp-btn\" disabled>\u2190 Prev<\/button><br \/>\n      <button class=\"mtp-btn\">Next \u2192<\/button>\n    <\/div>\n<div class=\"mtp-dots\"><\/div>\n<\/div>\n<p>  <!-- TAGLINE --><\/p>\n<div class=\"mtp-footer\">\n    Marktechpost <span>\u2014<\/span> AI Research &amp; Technical News <span>\u2014<\/span> marktechpost.com\n  <\/div>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h2>\n<ul class=\"wp-block-list\">\n<li>OSCAR quantizes LLM KV caches to 2-bit precision by rotating activations using attention-aware covariance matrices, not generic Hadamard transforms.<\/li>\n<li>At 2.28 bits per KV element, OSCAR stays within 3.78 points of BF16 accuracy on Qwen3-4B-Thinking while naive INT2 collapses to zero.<\/li>\n<li>KV cache memory drops approximately 8\u00d7, decode speed improves up to 3\u00d7 at 100K context, and job-level throughput reaches up to 7.83\u00d7 at large batch sizes.<\/li>\n<li>Pre-computed rotation matrices for Qwen3-4B\/8B\/32B, GLM-4.7-FP8, and MiniMax-M2.7 are available in RotationZoo \u2014 no recalibration needed.<\/li>\n<li>OSCAR integrates directly into SGLang with full paged KV-cache and prefix cache compatibility, requiring no changes to the inference client.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<\/p><p class=\"wp-block-paragraph\">\n<\/p><p class=\"wp-block-paragraph\">Check out\u00a0the\u00a0<strong><a href=\"https:\/\/github.com\/FutureMLS-Lab\/OSCAR\" target=\"_blank\" rel=\"noreferrer noopener\">Repo on GitHub<\/a>, <a href=\"https:\/\/modelscope.cn\/models\/togethercomputer\/OSCAR-RotationZoo\" target=\"_blank\" rel=\"noreferrer noopener\">Modelscope\u00a0<\/a><\/strong>and<strong>\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/2605.17757v1\" target=\"_blank\" rel=\"noreferrer noopener\">Research Paper<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p class=\"wp-block-paragraph\">Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/05\/25\/together-ai-open-sources-oscar-an-attention-aware-2-bit-kv-cache-quantization-system-for-long-context-llm-serving\/\">Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Long-context inference makes t&hellip;<\/p>\n","protected":false},"author":1,"featured_media":978,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-977","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/977","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=977"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/977\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/978"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=977"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=977"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=977"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}