{"id":1005,"date":"2026-05-30T07:19:14","date_gmt":"2026-05-29T23:19:14","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=1005"},"modified":"2026-05-30T07:19:14","modified_gmt":"2026-05-29T23:19:14","slug":"nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=1005","title":{"rendered":"NVIDIA Introduces X-Token: Projection-Guided Cross-Tokenizer KD That Outperforms GOLD by +3.82 Average Points on Llama-3.2-1B"},"content":{"rendered":"<p class=\"wp-block-paragraph\">Knowledge distillation (KD) transfers \u201cdark knowledge\u201d from a large teacher model to a smaller student. The student learns from the teacher\u2019s full output probability distribution over tokens, not just correct answers. This is done via per-position Kullback\u2013Leibler (KL) divergence over next-token probability distributions.<\/p>\n<p class=\"wp-block-paragraph\">This formulation requires a shared tokenizer. A practitioner committed to Llama-3.2-1B cannot leverage stronger teachers with incompatible tokenizers \u2014 such as Phi-4-mini or Qwen3-4B \u2014 because token positions do not correspond across vocabularies. This also prevents multi-teacher distillation across tokenizer families.<\/p>\n<p class=\"wp-block-paragraph\">NVIDIA researchers introduced <strong>X-Token<\/strong>, a logit-distribution-based method for cross-tokenizer KD (Knowledge distillation). It operates as a drop-in replacement for the standard KD loss, requiring no auxiliary trainable components and no architectural changes.<\/p>\n<h2 class=\"wp-block-heading\"><strong>The Problem X-Token is Solving<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Two prior approaches dominate cross-tokenizer KD. <strong>ULD (Universal Logit Distillation)<\/strong> sidesteps vocabulary alignment by rank-sorting both distributions and minimizing L1 distance. It discards token identity entirely. <strong>GOLD<\/strong> adds span alignment and a hybrid loss. It partitions tokens into a 1-to-1 string-matched common subset, trained with KL divergence, and an uncommon remainder, trained with ULD-style rank matching. GOLD is the current state of the art.<\/p>\n<p class=\"wp-block-paragraph\"><strong>The research team identifies two structural failures in GOLD\u2019s design<\/strong>:<\/p>\n<p class=\"wp-block-paragraph\"><strong>Failure 1: Uncommon-token failure<\/strong>\u2013 When tokenizers fragment text differently, critical tokens fall into the unmatched uncommon subset. Llama-3 packs multi-digit numbers as single tokens \u2014 \u201c201\u201d is one token. Qwen3 splits them digit by digit: \u201c2\u201d, \u201c0\u201d, \u201c1\u201d. Under GOLD, all 1,100 of Llama\u2019s two- and three-digit numerals (100 two-digit, 1,000 three-digit) fall into the uncommon set when Qwen3-4B is the teacher. Those tokens receive two types of harmful signal: identity-agnostic noise from rank-based ULD matching, and suppressive gradients from the common-KL term acting through the full-vocabulary softmax. The result: GSM8k accuracy drops to 2.56 under GOLD with Qwen3-4B, compared to 12.89 for same-tokenizer KD from a weaker Llama-3.2-3B teacher.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Failure 2: Over-conservative matching<\/strong>\u2013 GOLD uses strict string equality to define the common subset. A student token <code>Hundreds<\/code> corresponds to teacher tokens <code>Hund<\/code> followed by <code>reds<\/code> under teacher-side re-tokenization, but strict matching discards this pair. Useful alignment signal is lost even when the correspondence is well-formed.<\/p>\n<p class=\"wp-block-paragraph\">These two failures require opposite remedies: eliminate the partition when critical tokens are misaligned, and relax it when alignment is structurally sound.<\/p>\n<h2 class=\"wp-block-heading\"><strong>How X-Token Works<\/strong><\/h2>\n<p class=\"wp-block-paragraph\"><strong>X-Token has three components:<\/strong> span alignment, a projection matrix W, and two complementary loss formulations \u2014 P-KL and H-KL.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Span Alignment<\/strong><\/h3>\n<p class=\"wp-block-paragraph\">Teacher and student tokenizers produce sequences of different lengths for the same text. X-Token uses dynamic-programming (DP) span alignment, grouping tokens into chunks where each chunk-pair decodes to the same underlying text substring. A chain-rule merge then combines per-token probabilities within each chunk into a single chunk-level distribution for use in the distillation loss. The alignment is cached per sequence and adds no per-step training overhead.<\/p>\n<p class=\"wp-block-paragraph\">The research team also identifies a failure in TRL\u2019s surface-substring alignment, which is used in TRL\u2019s GOLD trainer. TRL accumulates per-side decoded buffers and flushes only when both buffers match as equal raw strings. A byte-level disagreement \u2014 such as Llama-3 auto-prepending <code>&lt;bos&gt;<\/code> while Qwen-3 does not \u2014 prevents future flushes and forces all remaining tokens into one mis-grouped super-group at end of sequence. The DP approach handles this with a single gap move, regardless of sequence length.<\/p>\n<h3 class=\"wp-block-heading\"><strong>The Projection Matrix W<\/strong><\/h3>\n<p class=\"wp-block-paragraph\">After alignment, teacher and student distributions still operate over different vocabularies. The projection matrix W \u2208 \u211d<sup>V<sub>S<\/sub>|\u00d7|V<sub>T<\/sub>|<\/sup> maps each student token to a weighted combination of teacher tokens, bridging the vocabulary mismatch.<\/p>\n<p class=\"wp-block-paragraph\"><strong>W is constructed deterministically in two passes:<\/strong><\/p>\n<p class=\"wp-block-paragraph\"><strong>Pass 1 (exact-match):<\/strong> For every (student token, teacher token) pair whose decoded strings match after canonicalization, set W[s, t] = 1. Canonicalization unifies space prefixes (\u0120, _, \u2423), newlines, byte-fallback tokens of the form <code>&lt;0xHH&gt;<\/code>, and model-specific special tokens across tokenizer families.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Pass 2 (multi-token rule):<\/strong> For each student token without an exact match, re-tokenize its decoded text under the teacher tokenizer. If the resulting sequence has length \u2264 4, assign exponentially-decayed weights: W[s, \u03c4\u1d62] = \u03b2\u00b7\u03b3\u2071 with (\u03b2, \u03b3) = (0.9, 0.1). A length-2 span receives normalized weights (0.909, 0.091). A length-3 span receives (0.9009, 0.0901, 0.0090). A length-4 span receives (0.9000, 0.0900, 0.0090, 0.0009). The leading sub-token receives the highest weight because it typically carries the most informative probability mass \u2014 for example, \u201c_inter\u201d in [\u201c_inter\u201d, \u201cnational\u201d] or \u201c_20\u201d in [\u201c_20\u201d, \u201c24\u201d].<\/p>\n<p class=\"wp-block-paragraph\">Each row is truncated to its top-4 entries and row-normalized. Because each row of W is non-negative and sums to 1, left-multiplication by W\u22a4 is probability-preserving: if p<sub>S<\/sub> is a probability vector, W<sup>\u22a4<\/sup>p<sub>S<\/sub> is also a valid probability vector over V<sub>T<\/sub>. W is constructed once before training and can optionally be jointly refined with the student under P-KL.<\/p>\n<h3 class=\"wp-block-heading\"><strong>P-KL: Addressing Erroneous and Suppressive Gradients<\/strong><\/h3>\n<p class=\"wp-block-paragraph\">P-KL removes the partition entirely. It projects the student distribution p\u0302<sub>S<\/sub><sup>(k)<\/sup> into teacher vocabulary space via W:<\/p>\n<p class=\"wp-block-paragraph\"><math data-latex=\"tilde{p}_S^{(k)}[t] = sum_{sinmathcal{V}_S} W[s, t] cdot hat{p}_S^{(k)}[s]\"><semantics><mrow><msubsup><mover><mi>p<\/mi><mo stretchy=\"false\">~<\/mo><\/mover><mi>S<\/mi><mrow><mo form=\"prefix\" stretchy=\"false\" lspace=\"0em\" rspace=\"0em\">(<\/mo><mi>k<\/mi><mo form=\"postfix\" stretchy=\"false\" lspace=\"0em\" rspace=\"0em\">)<\/mo><\/mrow><\/msubsup><mo form=\"prefix\" stretchy=\"false\">[<\/mo><mi>t<\/mi><mo form=\"postfix\" stretchy=\"false\">]<\/mo><mo>=<\/mo><msub><mo movablelimits=\"false\">\u2211<\/mo><mrow><mi>s<\/mi><mo>\u2208<\/mo><msub><mi class=\"mathcal\">\ud835\udcb1<\/mi><mi>S<\/mi><\/msub><\/mrow><\/msub><mi>W<\/mi><mo form=\"prefix\" stretchy=\"false\">[<\/mo><mi>s<\/mi><mo separator=\"true\">,<\/mo><mi>t<\/mi><mo form=\"postfix\" stretchy=\"false\">]<\/mo><mo>\u22c5<\/mo><msubsup><mover><mi>p<\/mi><mo stretchy=\"false\" class=\"tml-xshift\">^<\/mo><\/mover><mi>S<\/mi><mrow><mo form=\"prefix\" stretchy=\"false\" lspace=\"0em\" rspace=\"0em\">(<\/mo><mi>k<\/mi><mo form=\"postfix\" stretchy=\"false\" lspace=\"0em\" rspace=\"0em\">)<\/mo><\/mrow><\/msubsup><mo form=\"prefix\" stretchy=\"false\">[<\/mo><mi>s<\/mi><mo form=\"postfix\" stretchy=\"false\">]<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">tilde{p}_S^{(k)}[t] = sum_{sinmathcal{V}_S} W[s, t] cdot hat{p}_S^{(k)}[s]<\/annotation><\/semantics><\/math><\/p>\n<p class=\"wp-block-paragraph\">Then it computes KL divergence directly between teacher and projected student:<\/p>\n<p class=\"wp-block-paragraph\"> <math data-latex=\"frac{partialmathcal{L}_{common}}{partial z_{j}} = p_S[j] cdot M_{mathcal{C}}(T)\"><semantics><mrow><mfrac><mrow><mi>\u2202<\/mi><msub><mi class=\"mathcal\">\u2112<\/mi><mrow><mi>c<\/mi><mi>o<\/mi><mi>m<\/mi><mi>m<\/mi><mi>o<\/mi><mi>n<\/mi><\/mrow><\/msub><\/mrow><mrow><mi>\u2202<\/mi><msub><mi>z<\/mi><mi>j<\/mi><\/msub><\/mrow><\/mfrac><mo>=<\/mo><msub><mi>p<\/mi><mi>S<\/mi><\/msub><mo form=\"prefix\" stretchy=\"false\">[<\/mo><mi>j<\/mi><mo form=\"postfix\" stretchy=\"false\">]<\/mo><mo>\u22c5<\/mo><msub><mi>M<\/mi><mi class=\"mathcal\">\ud835\udc9e<\/mi><\/msub><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi>T<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">frac{partialmathcal{L}_{common}}{partial z_{j}} = p_S[j] cdot M_{mathcal{C}}(T)<\/annotation><\/semantics><\/math><\/p>\n<p class=\"wp-block-paragraph\">There is no uncommon set, so rank-based ULD noise is eliminated. The suppressive gradient problem is also eliminated: the projection routes the student\u2019s probability mass for \u201c201\u201d directly onto {2, 0, 1} in the teacher vocabulary via W.<\/p>\n<p class=\"wp-block-paragraph\">The research team formally proves (Proposition 1) that GOLD\u2019s common-KL term induces non-negative gradients on every uncommon student logit. The gradient on an uncommon student logit j is: \u2202\u2112<sub>common<\/sub>\/\u2202z<sub>j<\/sub> = p<sub>S<\/sub>[j] \u00b7 M<sub>C<\/sub>(T), where M<sub>C<\/sub>(T), is the teacher probability mass on the common subset. Under gradient descent, this always drives z<sub>j<\/sub> downward \u2014 suppressing every uncommon token\u2019s probability regardless of the ground-truth token.<\/p>\n<h3 class=\"wp-block-heading\"><strong>H-KL: Relaxing the 1-to-1 Matching<\/strong><\/h3>\n<p class=\"wp-block-paragraph\">H-KL applies when the partition is structurally sound \u2014 that is, when critical tokens land in the common subset. In that case, GOLD\u2019s direct KL on identity-aligned pairs delivers sharper per-pair supervision than P-KL\u2019s projection, which blends student probability mass across multiple teacher tokens. The opportunity is to make the partition less wasteful by relaxing the strict string-equality criterion.<\/p>\n<p class=\"wp-block-paragraph\">H-KL retains GOLD\u2019s hybrid loss structure but expands the common set C using W. For each student token s, it selects the top-ranked teacher token t* = argmax_{t\u2019\u2208V_T} W[s, t\u2019], and adds (s, t*) to C. Exact matches are preserved since they receive weight 1 in W, the highest possible. Near-equivalent pairs like (Hundreds, Hund) \u2014 excluded by GOLD \u2014 are now admitted. The expanded C feeds the same hybrid loss: direct KL on common pairs, ULD on the remainder.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Selecting Between P-KL and H-KL<\/strong><\/h3>\n<p class=\"wp-block-paragraph\">The selection uses a coverage audit over token categories in the student vocabulary. For math tasks, multi-digit numerals are the critical category. Table 8 in the research paper shows: under Qwen3-4B, 0 out of 100 two-digit Llama numerals and 0 out of 1,000 three-digit Llama numerals appear in C. Under Phi-4-mini-Instruct, all 100 two-digit and all 1,000 three-digit numerals appear in C. ASCII punctuation and single-digit numerals are fully covered in both cases.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1124\" height=\"616\" data-attachment-id=\"80195\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/29\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/screenshot-2026-05-29-at-4-09-12-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.09.12-PM-1.png\" data-orig-size=\"1124,616\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;,&quot;alt&quot;:&quot;&quot;}\" data-image-title=\"Screenshot 2026-05-29 at 4.09.12\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.09.12-PM-1-1024x561.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.09.12-PM-1.png\" alt=\"\" class=\"wp-image-80195\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2605.21699<\/figcaption><\/figure>\n<\/div>\n<p class=\"wp-block-paragraph\">The rule: use P-KL when critical tokens fall outside C (Qwen3-4B), and H-KL when the partition is sound (Phi-4-mini-Instruct). Table 2 in the research paper shows the mode reversal is sharp: P-KL outperforms H-KL by +3.55 avg. on Qwen3-4B, while H-KL outperforms P-KL by +1.68 avg. on Phi-4-mini.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"766\" height=\"294\" data-attachment-id=\"80192\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/29\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/screenshot-2026-05-29-at-4-05-17-pm\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.05.17-PM.png\" data-orig-size=\"766,294\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;,&quot;alt&quot;:&quot;&quot;}\" data-image-title=\"Screenshot 2026-05-29 at 4.05.17\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.05.17-PM.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.05.17-PM.png\" alt=\"\" class=\"wp-image-80192\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2605.21699<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Multi-Teacher Distillation<\/strong><\/h3>\n<p class=\"wp-block-paragraph\">X-Token extends to multiple teachers. Each teacher has its own projection matrix W_m and loss selection. For same-tokenizer teachers, standard token-level KL is used. <strong>The multi-teacher loss aggregates per-teacher losses with weights \u03b1<sub>m<\/sub>:<\/strong><\/p>\n<p class=\"wp-block-paragraph\"><math data-latex=\"mathcal{L}_{KD,multi} = sum_{m=1}^{M}alpha_{m}frac{1}{|mathcal{K}_{m}|}sum_{kinmathcal{K}_{m}}mathcal{L}_{*,m}^{(k)}\"><semantics><mrow><msub><mi class=\"mathcal\">\u2112<\/mi><mrow><mi>K<\/mi><mi>D<\/mi><mo separator=\"true\">,<\/mo><mi>m<\/mi><mi>u<\/mi><mi>l<\/mi><mi>t<\/mi><mi>i<\/mi><\/mrow><\/msub><mo>=<\/mo><msubsup><mo movablelimits=\"false\">\u2211<\/mo><mrow><mi>m<\/mi><mo>=<\/mo><mn>1<\/mn><\/mrow><mi>M<\/mi><\/msubsup><msub><mi>\u03b1<\/mi><mi>m<\/mi><\/msub><mfrac><mn>1<\/mn><mrow><mi>|<\/mi><msub><mi class=\"mathcal\">\ud835\udca6<\/mi><mi>m<\/mi><\/msub><mi>|<\/mi><\/mrow><\/mfrac><msub><mo movablelimits=\"false\">\u2211<\/mo><mrow><mi>k<\/mi><mo>\u2208<\/mo><msub><mi class=\"mathcal\">\ud835\udca6<\/mi><mi>m<\/mi><\/msub><\/mrow><\/msub><msubsup><mi class=\"mathcal\">\u2112<\/mi><mrow><mo lspace=\"0em\" rspace=\"0em\">\u2217<\/mo><mo separator=\"true\">,<\/mo><mi>m<\/mi><\/mrow><mrow><mo form=\"prefix\" stretchy=\"false\" lspace=\"0em\" rspace=\"0em\">(<\/mo><mi>k<\/mi><mo form=\"postfix\" stretchy=\"false\" lspace=\"0em\" rspace=\"0em\">)<\/mo><\/mrow><\/msubsup><\/mrow><annotation encoding=\"application\/x-tex\">mathcal{L}_{KD,multi} = sum_{m=1}^{M}alpha_{m}frac{1}{|mathcal{K}_{m}|}sum_{kinmathcal{K}_{m}}mathcal{L}_{*,m}^{(k)}<\/annotation><\/semantics><\/math><\/p>\n<p class=\"wp-block-paragraph\">The research team evaluates static and confidence-adaptive weighting schemes. Confidence-adaptive variants compute \u03b1_m from cross-entropy, Shannon entropy, or maximum predicted probability of the teacher\u2019s distribution. Static weighting outperforms adaptive schemes in both multi-teacher setups evaluated.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1342\" height=\"438\" data-attachment-id=\"80191\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/29\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/screenshot-2026-05-29-at-3-41-08-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-3.41.08-PM-1.png\" data-orig-size=\"1342,438\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;,&quot;alt&quot;:&quot;&quot;}\" data-image-title=\"Screenshot 2026-05-29 at 3.41.08\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-3.41.08-PM-1-1024x334.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-3.41.08-PM-1.png\" alt=\"\" class=\"wp-image-80191\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2605.21699<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Dynamic KD\/CE Scaling<\/strong><\/h3>\n<p class=\"wp-block-paragraph\">Training combines the distillation loss \u2112<sub>KD<\/sub> with next-token cross-entropy \u2112<sub>CE<\/sub>. Because these terms differ in magnitude and shift during training, X-Token rescales the KD term at each step to match the scale of \u2112<sub>CE<\/sub>:<\/p>\n<p class=\"wp-block-paragraph\"><math data-latex=\"mathcal{L} = text{sg}(mathcal{L}_{CE} \/ mathcal{L}_{KD}) cdot mathcal{L}_{KD} + mathcal{L}_{CE}\"><semantics><mrow><mi class=\"mathcal\">\u2112<\/mi><mo>=<\/mo><mtext>sg<\/mtext><mo form=\"prefix\" stretchy=\"false\">(<\/mo><msub><mi class=\"mathcal\">\u2112<\/mi><mrow><mi>C<\/mi><mi>E<\/mi><\/mrow><\/msub><mi>\/<\/mi><msub><mi class=\"mathcal\">\u2112<\/mi><mrow><mi>K<\/mi><mi>D<\/mi><\/mrow><\/msub><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mo>\u22c5<\/mo><msub><mi class=\"mathcal\">\u2112<\/mi><mrow><mi>K<\/mi><mi>D<\/mi><\/mrow><\/msub><mo>+<\/mo><msub><mi class=\"mathcal\">\u2112<\/mi><mrow><mi>C<\/mi><mi>E<\/mi><\/mrow><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">mathcal{L} = text{sg}(mathcal{L}_{CE} \/ mathcal{L}_{KD}) cdot mathcal{L}_{KD} + mathcal{L}_{CE}<\/annotation><\/semantics><\/math><\/p>\n<p class=\"wp-block-paragraph\">where sg(\u00b7) is stop-gradient. Table 4 in the paper shows dynamic scaling outperforms three fixed-weight settings (KD-heavy, balanced, CE-heavy) on the Qwen3-4B (P-KL) pair.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1256\" height=\"596\" data-attachment-id=\"80197\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/29\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/screenshot-2026-05-29-at-4-10-01-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.10.01-PM-1.png\" data-orig-size=\"1256,596\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;,&quot;alt&quot;:&quot;&quot;}\" data-image-title=\"Screenshot 2026-05-29 at 4.10.01\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.10.01-PM-1-1024x486.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-29-at-4.10.01-PM-1.png\" alt=\"\" class=\"wp-image-80197\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2605.21699<\/figcaption><\/figure>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>Experiments and Results<\/strong><\/h2>\n<p class=\"wp-block-paragraph\"><strong>Student:<\/strong> Llama-3.2-1B. <strong>Teachers:<\/strong> Llama-3.2-3B (same tokenizer), Qwen3-4B, and Phi-4-mini-Instruct. <strong>Training data:<\/strong> NemotronClimbMix dataset, 30,000 steps, batch size 768, context length 4096. <strong>Optimizer:<\/strong> AdamW, learning rate 5\u00d710\u207b\u2075, 5% warmup with cosine decay, weight decay 0.1, gradient clipping 1.0. Each experiment is feasible on a single NVIDIA H100 GPU; the research team used 128 H100s to speed up iteration.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Evaluation:<\/strong> 3-shot accuracy on MMLU, GSM8k, MATH-Hendrycks, Winogrande, and HellaSwag.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Key results:<\/strong><\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th>Setting<\/th>\n<th>Method<\/th>\n<th>Avg.<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>No distillation<\/td>\n<td>Llama-1B (base)<\/td>\n<td>33.96<\/td>\n<\/tr>\n<tr>\n<td>No distillation<\/td>\n<td>Continued pre-training<\/td>\n<td>36.63<\/td>\n<\/tr>\n<tr>\n<td>Same tokenizer<\/td>\n<td>Llama-3B \u2192 1B (KL)<\/td>\n<td>38.40<\/td>\n<\/tr>\n<tr>\n<td>Cross-tokenizer<\/td>\n<td>Qwen-4B, ULD<\/td>\n<td>36.77<\/td>\n<\/tr>\n<tr>\n<td>Cross-tokenizer<\/td>\n<td>Qwen-4B, GOLD<\/td>\n<td>35.03<\/td>\n<\/tr>\n<tr>\n<td>Cross-tokenizer<\/td>\n<td><strong>Qwen-4B, X-Token (P-KL)<\/strong><\/td>\n<td><strong>38.85<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Cross-tokenizer<\/td>\n<td>Phi-mini, ULD<\/td>\n<td>38.31<\/td>\n<\/tr>\n<tr>\n<td>Cross-tokenizer<\/td>\n<td>Phi-mini, GOLD<\/td>\n<td>38.66<\/td>\n<\/tr>\n<tr>\n<td>Cross-tokenizer<\/td>\n<td><strong>Phi-mini, X-Token (H-KL)<\/strong><\/td>\n<td><strong>39.18<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Multi-teacher<\/td>\n<td><strong>Phi-mini + Llama-3B (X-Token)<\/strong><\/td>\n<td><strong>40.48<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p class=\"wp-block-paragraph\"><strong>On Qwen-4B (P-KL regime):<\/strong> GOLD reaches 35.03 avg., below even continued pre-training without a teacher (36.63). This confirms the partition is actively harmful when critical tokens are misaligned. Pure ULD (36.77) already improves over GOLD, indicating the partition is the primary failure source. P-KL further improves to 38.85 avg. (+3.82 over GOLD). GSM8k alone moves from 2.56 to 15.54, surpassing same-tokenizer KD from Llama-3.2-3B (12.89) on that benchmark.<\/p>\n<p class=\"wp-block-paragraph\"><strong>On Phi-mini (H-KL regime):<\/strong> GOLD reaches 38.66 avg. \u2014 a reasonable baseline where the partition is structurally sound. H-KL improves to 39.18 avg. (+0.52 over GOLD). P-KL applied to Phi-mini drops to 37.50 avg., confirming that the wrong loss mode hurts even when W is available.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Multi-teacher:<\/strong> Phi-mini (H-KL, \u03b1=0.8) + Llama-3B (standard KL, \u03b1=0.2) under static weighting reaches 40.48 avg. This is +2.08 over same-family KD from Llama-3B alone, and +1.30 over the best single cross-tokenizer result (39.18). Combining Phi-mini + Qwen-4B \u2014 two teachers with overlapping reasoning strengths \u2014 scores only 38.49, below the best single teacher. Adding Qwen-4B as a third teacher yields 40.15, with math\/reasoning degrading (GSM8k 20.39 \u2192 19.18) while commonsense improves slightly. Teacher complementarity, not teacher count, drives gains.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Strengths <\/strong><strong>and What to Watch<\/strong><\/h2>\n<h4 class=\"wp-block-heading\"><strong>Strengths:<\/strong><\/h4>\n<ul class=\"wp-block-list\">\n<li>The suppressive gradient problem in GOLD\u2019s hybrid loss is formally proved (Proposition 1), not just observed empirically<\/li>\n<li>W is constructed rule-based from tokenizer strings alone; no training data or learned parameters needed at initialization<\/li>\n<li>Dynamic KD\/CE scaling removes the need to tune fixed loss weights; it outperforms three fixed-weight baselines in ablations<\/li>\n<li>Multi-teacher extension adds no architectural changes; each teacher uses its own W_m and appropriate loss<\/li>\n<li>The coverage audit for P-KL vs H-KL selection is a defined, reproducible criterion based on per-category token retention in C<\/li>\n<\/ul>\n<h4 class=\"wp-block-heading\"><strong><\/strong><strong><\/strong><strong>What to Watch<\/strong>:<\/h4>\n<ul class=\"wp-block-list\">\n<li>Experiments use only Llama-3.2-1B as the student under continued pre-training; larger students and instruction-tuned settings are not evaluated<\/li>\n<li>Only three teacher pairs are tested; low-overlap tokenizer families (SentencePiece, byte-level BPE) are left for future work<\/li>\n<li>Static weighting outperforms confidence-adaptive weighting in all tested multi-teacher setups, but why?<\/li>\n<li>The multi-token rule in Pass 2 skips student tokens whose decoded text re-tokenizes to sequences longer than 4 under the teacher; those rows remain zero in W<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\"><strong>Marktechpost\u2019s Visual Explainer<\/strong><\/h2>\n<div>\n<p>  <!-- Header --><\/p>\n<div class=\"xt-header\">\n    <span class=\"xt-logo\">\u25a0 X-Token \u2014 NVIDIA Research<\/span><br \/>\n    <span class=\"xt-counter\">1 \/ 8<\/span>\n  <\/div>\n<p>  <!-- SLIDE 1: What is Knowledge Distillation --><\/p>\n<div class=\"xt-slide active\" data-slide=\"0\">\n    <span class=\"xt-label\">01 \u2014 Background<\/span>\n<div class=\"xt-title\">What is Knowledge Distillation?<\/div>\n<div class=\"xt-body\">\n<p>Knowledge distillation (KD) transfers <strong>\u201cdark knowledge\u201d<\/strong> from a large teacher model to a smaller student model. The student learns from the teacher\u2019s full next-token probability distribution, not just the correct answer.<\/p>\n<p>This is done via <strong>per-position KL divergence<\/strong> over the teacher\u2019s output distribution at every token position in the sequence.<\/p>\n<p><strong>The constraint:<\/strong> standard KD requires a shared tokenizer. If Llama-3.2-1B is the student, it cannot learn from Qwen3-4B or Phi-4-mini \u2014 their token vocabularies do not align. Token positions have no correspondence across different tokenizer families.<\/p>\n<\/div>\n<div class=\"xt-stats\">\n<div class=\"xt-stat\">\n        <span class=\"xt-stat-val\">Llama<\/span><br \/>\n        <span class=\"xt-stat-lbl\">Student tokenizer<\/span>\n      <\/div>\n<div class=\"xt-stat\">\n        <span class=\"xt-stat-val\">Qwen \/ Phi<\/span><br \/>\n        <span class=\"xt-stat-lbl\">Incompatible teachers<\/span>\n      <\/div>\n<div class=\"xt-stat\">\n        <span class=\"xt-stat-val\">\u2260 Match<\/span><br \/>\n        <span class=\"xt-stat-lbl\">Vocab mismatch<\/span>\n      <\/div>\n<\/div>\n<\/div>\n<p>  <!-- SLIDE 2: Two Failures in GOLD --><\/p>\n<div class=\"xt-slide\" data-slide=\"1\">\n    <span class=\"xt-label\">02 \u2014 The Problem<\/span>\n<div class=\"xt-title\">Two Structural Failures in GOLD<\/div>\n<div class=\"xt-body\">\n<p><strong>GOLD<\/strong> is the prior state-of-the-art cross-tokenizer KD method. It partitions tokens into a string-matched <em>common subset<\/em> (trained with KL) and an <em>uncommon remainder<\/em> (trained with ULD rank-matching).<\/p>\n<p>NVIDIA researchers identified two distinct failures:<\/p>\n<\/div>\n<div class=\"xt-steps\">\n<div class=\"xt-step\">\n<div class=\"xt-step-num\">1<\/div>\n<div class=\"xt-step-txt\"><strong>Uncommon-token failure:<\/strong> Critical tokens fall into the unmatched subset. Llama packs \u201c201\u201d as one token. Qwen splits it into \u201c2\u201d, \u201c0\u201d, \u201c1\u201d. All 1,100 multi-digit Llama numerals fall into the uncommon set under Qwen3-4B. They receive identity-agnostic noise and suppressive gradients \u2014 GSM8k drops to <strong>2.56<\/strong>.<\/div>\n<\/div>\n<div class=\"xt-step\">\n<div class=\"xt-step-num\">2<\/div>\n<div class=\"xt-step-txt\"><strong>Over-conservative matching:<\/strong> Strict string equality discards well-formed pairs. Student token <span class=\"xt-code\">Hundreds<\/span> maps to teacher tokens <span class=\"xt-code\">Hund<\/span> + <span class=\"xt-code\">reds<\/span>, but GOLD drops this alignment entirely.<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p>  <!-- SLIDE 3: X-Token Overview --><\/p>\n<div class=\"xt-slide\" data-slide=\"2\">\n    <span class=\"xt-label\">03 \u2014 Solution<\/span>\n<div class=\"xt-title\">X-Token: Three Core Components<\/div>\n<div class=\"xt-body\">\n<p>X-Token is a <strong>logit-distribution-based<\/strong> cross-tokenizer KD method. It requires no auxiliary trainable components and no architectural changes \u2014 it is a drop-in replacement for the standard KD loss.<\/p>\n<\/div>\n<div class=\"xt-steps\">\n<div class=\"xt-step\">\n<div class=\"xt-step-num\">1<\/div>\n<div class=\"xt-step-txt\"><strong>Span Alignment:<\/strong> DP-based alignment groups tokens into chunks that decode to the same text substring. Cached per sequence \u2014 zero per-step overhead.<\/div>\n<\/div>\n<div class=\"xt-step\">\n<div class=\"xt-step-num\">2<\/div>\n<div class=\"xt-step-txt\"><strong>Projection Matrix W:<\/strong> A sparse matrix W \u2208 \u211d\u207c|V_S|\u00d7|V_T|\u207d maps each student token to a weighted combination of teacher tokens, bridging the vocabulary gap.<\/div>\n<\/div>\n<div class=\"xt-step\">\n<div class=\"xt-step-num\">3<\/div>\n<div class=\"xt-step-txt\"><strong>Two Loss Modes:<\/strong> <em>P-KL<\/em> removes the partition entirely. <em>H-KL<\/em> retains the partition but relaxes matching via top-1 mappings under W. Each targets a different failure mode.<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p>  <!-- SLIDE 4: Projection Matrix W --><\/p>\n<div class=\"xt-slide\" data-slide=\"3\">\n    <span class=\"xt-label\">04 \u2014 Projection Matrix W<\/span>\n<div class=\"xt-title\">How W is Constructed<\/div>\n<div class=\"xt-body\">\n<p>W is built <strong>deterministically before training<\/strong> in two passes. No training data or learned parameters are required at initialization.<\/p>\n<\/div>\n<div class=\"xt-steps\">\n<div class=\"xt-step\">\n<div class=\"xt-step-num\">1<\/div>\n<div class=\"xt-step-txt\"><strong>Exact-match pass:<\/strong> For every (student, teacher) token pair whose decoded strings match after canonicalization, set W[s,t] = 1. Canonicalization unifies space prefixes, newlines, byte-fallback tokens, and special tokens across families.<\/div>\n<\/div>\n<div class=\"xt-step\">\n<div class=\"xt-step-num\">2<\/div>\n<div class=\"xt-step-txt\"><strong>Multi-token rule pass:<\/strong> For unmatched student tokens, re-tokenize their decoded text under the teacher. Assign decayed weights W[s,\u03c4\u1d62] = \u03b2\u00b7\u03b3\u2071 with (\u03b2,\u03b3) = (0.9, 0.1). A 2-token span gets (0.909, 0.091). Each row is truncated to top-4 entries and row-normalized.<\/div>\n<\/div>\n<\/div>\n<div class=\"xt-body\">\n<p>Because each row sums to 1, W\u1d40 is <strong>probability-preserving<\/strong>: W\u1d40p_S is a valid probability vector over V_T without additional normalization.<\/p>\n<\/div>\n<\/div>\n<p>  <!-- SLIDE 5: P-KL vs H-KL --><\/p>\n<div class=\"xt-slide\" data-slide=\"4\">\n    <span class=\"xt-label\">05 \u2014 Loss Formulations<\/span>\n<div class=\"xt-title\">P-KL vs H-KL: When to Use Each<\/div>\n<div class=\"xt-body\">\n<p>Selection is based on a <strong>coverage audit<\/strong>: measure what fraction of critical token categories (e.g. multi-digit numerals) appear in the common set C.<\/p>\n<\/div>\n<table class=\"xt-table\">\n<tr>\n<th>Property<\/th>\n<th>P-KL<\/th>\n<th>H-KL<\/th>\n<\/tr>\n<tr>\n<td>Partition<\/td>\n<td>Removed entirely<\/td>\n<td>Retained, relaxed<\/td>\n<\/tr>\n<tr>\n<td>Matching<\/td>\n<td>Full vocab via W<\/td>\n<td>Top-1 under W<\/td>\n<\/tr>\n<tr>\n<td>Use when<\/td>\n<td>Critical tokens fall outside C<\/td>\n<td>Partition is sound<\/td>\n<\/tr>\n<tr>\n<td>Teacher example<\/td>\n<td>Qwen3-4B<\/td>\n<td>Phi-4-mini-Instruct<\/td>\n<\/tr>\n<tr>\n<td>Avg. gain vs GOLD<\/td>\n<td class=\"xt-best\">+3.82<\/td>\n<td class=\"xt-best\">+0.52<\/td>\n<\/tr>\n<\/table>\n<div class=\"xt-body\">\n<p>Applying the wrong mode reverses results: P-KL on Phi-mini drops to <strong>37.50<\/strong> avg. vs H-KL\u2019s 39.18.<\/p>\n<\/div>\n<\/div>\n<p>  <!-- SLIDE 6: Results --><\/p>\n<div class=\"xt-slide\" data-slide=\"5\">\n    <span class=\"xt-label\">06 \u2014 Results<\/span>\n<div class=\"xt-title\">Benchmark Results on Llama-3.2-1B (3-shot)<\/div>\n<div class=\"xt-body\">\n<p>Student: <strong>Llama-3.2-1B<\/strong> \u2014 trained on NemotronClimbMix, 30K steps, batch 768, context 4096.<\/p>\n<\/div>\n<table class=\"xt-table\">\n<tr>\n<th>Method<\/th>\n<th>GSM8k<\/th>\n<th>Avg.<\/th>\n<\/tr>\n<tr>\n<td>Llama-1B (base)<\/td>\n<td>5.69<\/td>\n<td>33.96<\/td>\n<\/tr>\n<tr>\n<td>Continued pre-training<\/td>\n<td>10.25<\/td>\n<td>36.63<\/td>\n<\/tr>\n<tr>\n<td>Same-tokenizer KD (Llama-3B)<\/td>\n<td>12.89<\/td>\n<td>38.40<\/td>\n<\/tr>\n<tr>\n<td class=\"xt-bad\">Qwen-4B, GOLD<\/td>\n<td class=\"xt-bad\">2.56<\/td>\n<td class=\"xt-bad\">35.03<\/td>\n<\/tr>\n<tr>\n<td class=\"xt-best\">Qwen-4B, X-Token (P-KL)<\/td>\n<td class=\"xt-best\">15.54<\/td>\n<td class=\"xt-best\">38.85<\/td>\n<\/tr>\n<tr>\n<td>Phi-mini, GOLD<\/td>\n<td>16.50<\/td>\n<td>38.66<\/td>\n<\/tr>\n<tr>\n<td class=\"xt-best\">Phi-mini, X-Token (H-KL)<\/td>\n<td class=\"xt-best\">19.11<\/td>\n<td class=\"xt-best\">39.18<\/td>\n<\/tr>\n<tr>\n<td class=\"xt-best\">Phi-mini + Llama-3B (Multi)<\/td>\n<td class=\"xt-best\">20.39<\/td>\n<td class=\"xt-best\">40.48<\/td>\n<\/tr>\n<\/table><\/div>\n<p>  <!-- SLIDE 7: Multi-Teacher --><\/p>\n<div class=\"xt-slide\" data-slide=\"6\">\n    <span class=\"xt-label\">07 \u2014 Multi-Teacher Distillation<\/span>\n<div class=\"xt-title\">Teacher Complementarity Drives Gains<\/div>\n<div class=\"xt-body\">\n<p>X-Token extends to multiple teachers. Each gets its own projection matrix W_m and loss mode. The aggregated loss uses per-teacher weights \u03b1_m.<\/p>\n<p>Key finding: <strong>static weighting outperforms confidence-adaptive weighting<\/strong> in all tested setups. Phi-mini (\u03b1=0.8) + Llama-3B (\u03b1=0.2) achieves the best result.<\/p>\n<\/div>\n<table class=\"xt-table\">\n<tr>\n<th>Teacher Combination<\/th>\n<th>Avg.<\/th>\n<th>Note<\/th>\n<\/tr>\n<tr>\n<td>Phi-mini only (H-KL)<\/td>\n<td>39.18<\/td>\n<td>Best single<\/td>\n<\/tr>\n<tr>\n<td class=\"xt-best\">Phi-mini + Llama-3B<\/td>\n<td class=\"xt-best\">40.48<\/td>\n<td class=\"xt-best\">Complementary<\/td>\n<\/tr>\n<tr>\n<td>Phi-mini + Qwen-4B<\/td>\n<td>38.49<\/td>\n<td>Overlapping<\/td>\n<\/tr>\n<tr>\n<td>Phi-mini + Qwen-4B + Llama-3B<\/td>\n<td>40.15<\/td>\n<td>3rd teacher hurts math<\/td>\n<\/tr>\n<\/table>\n<div class=\"xt-body\">\n<p>Combining two reasoning-heavy teachers (Phi-mini + Qwen-4B) scores <strong>below<\/strong> the best single teacher. Teacher diversity matters more than teacher count.<\/p>\n<\/div>\n<\/div>\n<p>  <!-- SLIDE 8: Key Takeaways --><\/p>\n<div class=\"xt-slide\" data-slide=\"7\">\n    <span class=\"xt-label\">08 \u2014 Key Takeaways<\/span>\n<div class=\"xt-title\">What to Remember About X-Token<\/div>\n<div class=\"xt-steps\">\n<div class=\"xt-step\">\n<div class=\"xt-step-num\">1<\/div>\n<div class=\"xt-step-txt\">GOLD\u2019s partition actively harms training when critical tokens (e.g., multi-digit numerals) fall into the uncommon set \u2014 <strong>P-KL eliminates the partition entirely<\/strong> using projection matrix W.<\/div>\n<\/div>\n<div class=\"xt-step\">\n<div class=\"xt-step-num\">2<\/div>\n<div class=\"xt-step-txt\"><strong>H-KL<\/strong> retains the partition but relaxes matching to top-1 mappings under W \u2014 best when the partition is structurally sound.<\/div>\n<\/div>\n<div class=\"xt-step\">\n<div class=\"xt-step-num\">3<\/div>\n<div class=\"xt-step-txt\">The projection matrix W is <strong>built rule-based before training<\/strong> from tokenizer strings alone; no learned parameters required at init.<\/div>\n<\/div>\n<div class=\"xt-step\">\n<div class=\"xt-step-num\">4<\/div>\n<div class=\"xt-step-txt\">Multi-teacher gains (+1.3 over single-teacher) come from <strong>teacher complementarity<\/strong>, not from adding more teachers with overlapping strengths.<\/div>\n<\/div>\n<div class=\"xt-step\">\n<div class=\"xt-step-num\">5<\/div>\n<div class=\"xt-step-txt\">GSM8k recovers from <strong>2.56<\/strong> (GOLD) to <strong>15.54<\/strong> (P-KL) \u2014 a 6\u00d7 gain that exceeds same-tokenizer KD from a stronger Llama-3.2-3B teacher.<\/div>\n<\/div>\n<\/div>\n<div class=\"xt-body\">\n<p><strong>arXiv:<\/strong> <em>2605.21699<\/em> \u00a0\u2014\u00a0 <strong>Institution:<\/strong> <em>NVIDIA<\/em><\/p>\n<\/div>\n<\/div>\n<p>  <!-- Navigation --><\/p>\n<div class=\"xt-nav\">\n    <button class=\"xt-btn\" disabled>\u2190 Prev<\/button>\n<div class=\"xt-dots\"><\/div>\n<p>    <button class=\"xt-btn\">Next \u2192<\/button>\n  <\/p><\/div>\n<p>  <!-- Footer --><\/p>\n<div class=\"xt-footer\">NVIDIA Research \u2014 X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation \u2014 arXiv:2605.21699 \u2014 marktechpost.com<\/div>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h2>\n<ul class=\"wp-block-list\">\n<li>X-Token identifies two distinct, opposite failure modes in GOLD: uncommon-token suppression (fix: remove the partition with P-KL) and over-conservative matching (fix: relax it with H-KL).<\/li>\n<li>The projection matrix W is built rule-based from tokenizer strings before training; it can optionally be jointly refined with the student for additional gains.<\/li>\n<li>P-KL on Qwen3-4B improves over GOLD by +3.82 avg. and recovers GSM8k from 2.56 to 15.54.<\/li>\n<li>Multi-teacher distillation gains (+1.3 over single-teacher) come from teacher complementarity, not just from adding more teachers.<\/li>\n<li>Loss mode selection (P-KL vs H-KL) is determined by a coverage audit on token categories; applying the wrong mode reverses the ranking.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<\/p><p class=\"wp-block-paragraph\">Check out\u00a0the\u00a0<strong><a href=\"https:\/\/arxiv.org\/pdf\/2605.21699\" target=\"_blank\" rel=\"noreferrer noopener\">Research Paper<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p class=\"wp-block-paragraph\">Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/wbash1wF6efRj8G58\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/05\/29\/nvidia-introduces-x-token-projection-guided-cross-tokenizer-kd-that-outperforms-gold-by-3-82-average-points-on-llama-3-2-1b\/\">NVIDIA Introduces X-Token: Projection-Guided Cross-Tokenizer KD That Outperforms GOLD by +3.82 Average Points on Llama-3.2-1B<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Knowledge distillation (KD) tr&hellip;<\/p>\n","protected":false},"author":1,"featured_media":1006,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-1005","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/1005","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1005"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/1005\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/1006"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1005"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1005"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1005"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}