{"id":900,"date":"2026-05-14T13:46:32","date_gmt":"2026-05-14T05:46:32","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=900"},"modified":"2026-05-14T13:46:32","modified_gmt":"2026-05-14T05:46:32","slug":"nous-research-releases-token-superposition-training-to-speed-up-llm-pre-training-by-up-to-2-5x-across-270m-to-10b-parameter-models","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=900","title":{"rendered":"Nous Research Releases Token Superposition Training to Speed Up LLM Pre-Training by Up to 2.5x Across 270M to 10B Parameter Models"},"content":{"rendered":"<p>Pre-training large language models is expensive enough that even modest efficiency improvements can translate into meaningful cost and time savings. <strong>Nous Research is releasing Token Superposition Training (TST)<\/strong>, a method that substantially reduces pre-training wall-clock time at fixed compute without touching the model architecture, optimizer, tokenizer, parallelism strategy, or training data. <\/p>\n<p>At the 10B-A1B mixture-of-experts scale, TST reaches a lower final training loss than a matched-FLOPs baseline while consuming 4,768 B200-GPU-hours versus the baseline\u2019s 12,311 \u2014 roughly a 2.5x reduction in total pre-training time.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1332\" height=\"776\" data-attachment-id=\"79825\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/13\/nous-research-releases-token-superposition-training-to-speed-up-llm-pre-training-by-up-to-2-5x-across-270m-to-10b-parameter-models\/screenshot-2026-05-13-at-10-33-23-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-13-at-10.33.23-PM-1.png\" data-orig-size=\"1332,776\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-05-13 at 10.33.23\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-13-at-10.33.23-PM-1-1024x597.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-13-at-10.33.23-PM-1.png\" alt=\"\" class=\"wp-image-79825\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2605.06546<\/figcaption><\/figure>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>The Problem TST is Solving<\/strong><\/h2>\n<p>Modern LLM pre-training is heavily data-driven. Recent training regimes routinely overtrain well beyond compute-optimal estimates, and raw text throughput. How much data a model can process per FLOP has become a key lever. Subword tokenizers like BPE already improve throughput by compressing sequences; and the research suggests much of the BPE advantage over byte-level models comes simply from shorter sequences, which means the model sees more text per unit of compute.<\/p>\n<p>TST asks whether that throughput lever can be pulled further during training, independently of the tokenizer and without permanently changing the model.<\/p>\n<h2 class=\"wp-block-heading\"><strong>How TST Works: Two Phases<\/strong><\/h2>\n<p><strong>TST modifies the standard pre-training loop in two sequential phases:<\/strong><\/p>\n<p><strong>Phase 1 \u2014 Superposition:<\/strong> For the first <code>r<\/code> fraction of total training steps (the paper finds <code>r \u2208 [0.2, 0.4]<\/code> to be close to optimal across tested scales), the model does not receive individual tokens. Instead, the input sequence of length <code>L<\/code> is segmented into non-overlapping bags of <code>s<\/code> contiguous tokens. In the embedding layer, each bag is collapsed into a single latent \u201cs-token\u201d by averaging the <code>s<\/code> token embeddings. The transformer then processes a sequence of length <code>L\/s<\/code>.<\/p>\n<p>Crucially, each TST step is kept equal-FLOPs to a standard training step by <em>increasing the data sequence length by s times<\/em> during the superposition phase. Because each latent position corresponds to <code>s<\/code> source tokens, the model ingests <code>s<\/code> times as much text per unit of compute \u2014 this is what drives the throughput gain.<\/p>\n<p>On the output side, each latent position predicts the next bag of <code>s<\/code> tokens rather than a single next token. The standard cross-entropy loss is replaced with a multi-hot cross-entropy (MCE) loss, which assigns equal probability mass <code>1\/s<\/code> to each token in the target bag. The MCE loss reduces to a simple mean of standard cross-entropy terms over the <code>s<\/code> targets \u2014 it can be implemented using the existing fused CE kernels already present in any major pre-training library, without writing a new kernel or adding an auxiliary head.<\/p>\n<p><strong>Phase 2 \u2014 Recovery:<\/strong> After the superposition phase, training resumes from the saved checkpoint with standard next-token prediction for the remaining <code>1 - r<\/code> steps. The TST code is fully removed at this boundary to avoid any experimental contamination. A transient loss spike occurs at the transition, typically between 1 and 2 nats, which resolves within a few thousand steps. After that, the recovered model crosses below the equal-FLOPs baseline and remains there.<\/p>\n<p>The model produced at the end of Phase 2 is architecturally identical to one produced by conventional pre-training, with the same next-token prediction inference behavior.<\/p>\n<h2 class=\"wp-block-heading\"><strong>What the Experiments Show<\/strong><\/h2>\n<p><strong>TST was validated at four scales:<\/strong> 270M and 600M dense (SmolLM2 shapes adapted to the Llama3 modeling code, with the Llama3-8B tokenizer and untied input\/output embeddings \u2014 which makes the 270M model equivalent in size to SmolLM2-135M and the 600M to SmolLM2-360M), 3B dense (SmolLM3 shape), and a 10B-A1B MoE in the Qwen3 family. Training used the DCLM dataset for the smaller runs and a 50\/50 mix of DCLM and FineWeb-Edu for the MoE run. All runs used AdamW with the Warmup-Stable-Decay learning rate schedule and were run in TorchTitan under FSDP parallelism, on 64 NVIDIA B200 GPUs for the larger models and 8 B200 GPUs for the smaller ones.<\/p>\n<p>At the 3B scale with bag size <code>s = 6<\/code> and step ratio <code>r = 0.3<\/code>, TST at 20,000 steps reaches a final loss of 2.676 \u2014 nearly matching a 36,000-step baseline at 2.677 \u2014 while using 247 B200-GPU-hours versus 443. The 20k-step TST run scores 62.4 on HellaSwag and 66.3 on ARC-Easy, versus 62.3 and 65.9 for the 36k baseline.<\/p>\n<p>At the 10B-A1B MoE scale with <code>s = 16<\/code> and <code>r \u2248 0.25<\/code>, the TST run processes 2T data tokens and achieves a final loss of 2.236, below the baseline\u2019s 2.252 after 1.05T tokens, while beating it on all four reported benchmarks: HellaSwag (71.2 vs. 70.1), ARC-Easy (74.2 vs. 73.8), ARC-Challenge (47.3 vs. 46.3), and MMLU (39.0 vs. 37.4).<\/p>\n<p>The research team presents three comparison views against the baseline \u2014 equal-FLOPs, equal-loss, and equal-data. Under equal-FLOPs and equal-loss conditions, TST consistently wins. Under equal total token consumption, the baseline wins, because TST\u2019s effective compute budget per data token is smaller. This is an important boundary condition that determines where TST applies.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Two Distinct Mechanisms<\/strong><\/h2>\n<p>An ablation study isolates the input-side and output-side components. Both independently outperform the baseline; combining them produces further improvement without signs of interference. The authors interpret this as evidence that TST is two orthogonal mechanisms rather than a single trick.<\/p>\n<p>The output-side mechanism \u2014 next-bag-of-tokens prediction \u2014 is conceptually related to multi-token prediction (MTP). Unlike MTP, which adds <code>k<\/code> independent prediction heads and extra parameters, TST keeps a single output head and replaces only the target. This makes it the least expensive member of a growing class of future-signal auxiliary objectives. Unlike MTP, it shows consistent gains across all tested scales including small models where MTP has been shown to degrade performance.<\/p>\n<p>The input-side mechanism has no direct analog in the recent pre-training literature. The research team offers two plausible explanations: it may implicitly regularize the embedding geometry (since many random s-grams of tokens must remain linearly separable once averaged), or it may act as a form of pre-pre-training, exposing the model to a coarser version of the real data before fine-resolution language modeling begins.<\/p>\n<p>A targeted ablation directly tests what happens when representation continuity is broken. The research team runs a 3B TST experiment where the input embedding and output LM head are randomly re-initialized at the start of Phase 2. The result: final loss jumps to 2.938 \u2014 worse than both the TST run (2.676) and the standard baseline (2.808). The Phase 1 TST steps contributed nothing to the final model. This confirms that shared representations across both phases are not incidental to TST\u2019s success \u2014 they are what makes it work.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Marktechpost\u2019s Visual Explainer<\/strong><\/h2>\n<div>\n<p>  <!-- top bar --><\/p>\n<div class=\"tst-topbar\">\n    <span class=\"tst-topbar-label\">Token Superposition Training \u2014 Practical Guide<\/span><br \/>\n    <span class=\"tst-topbar-badge\">arXiv 2605.06546<\/span>\n  <\/div>\n<p>  <!-- slides viewport --><\/p>\n<div class=\"tst-viewport\">\n<div class=\"tst-track\">\n<p>      <!-- SLIDE 1: What is TST? --><\/p>\n<div class=\"tst-slide\">\n        <span class=\"tst-tag\">01 \/ Overview<\/span>\n<h3 class=\"tst-slide-title\">What Is Token Superposition Training?<\/h3>\n<p class=\"tst-body\">\n          Token Superposition Training (TST) is a two-phase pre-training method from Nous Research that increases token throughput per FLOP without changing the model architecture, optimizer, tokenizer, parallelism, or training data.\n        <\/p>\n<div class=\"tst-callout\">\n          <strong>The core idea:<\/strong> Instead of feeding one token at a time, average <strong>s<\/strong> contiguous token embeddings into one \u201cs-token,\u201d train on that for the first <strong>r<\/strong> fraction of steps, then switch back to standard next-token prediction. The final model is architecturally identical to one trained normally.\n        <\/div>\n<ul class=\"tst-list\">\n<li><strong>Phase 1 (Superposition)<\/strong> \u2014 model reads bags of s tokens, predicts the next bag<\/li>\n<li><strong>Phase 2 (Recovery)<\/strong> \u2014 standard next-token prediction resumes from the checkpoint<\/li>\n<li><strong>Inference<\/strong> \u2014 completely unchanged; no new heads, no new parameters<\/li>\n<li><strong>Validated at<\/strong> 270M, 600M, 3B dense and 10B\u2013A1B MoE<\/li>\n<\/ul>\n<div class=\"tst-warn\">TST trades compute efficiency for higher data consumption. Best suited for compute-bound pre-training, not data-bound.<\/div>\n<\/div>\n<p>      <!-- SLIDE 2: Phase 1 --><\/p>\n<div class=\"tst-slide\">\n        <span class=\"tst-tag\">02 \/ Phase 1<\/span>\n<h3 class=\"tst-slide-title\">Phase 1 \u2014 The Superposition Phase<\/h3>\n<p class=\"tst-body\">\n          For the first <code>r<\/code> fraction of total training steps, the input sequence of length <code>L<\/code> is split into non-overlapping bags of <code>s<\/code> contiguous tokens. Their embeddings are averaged into a single latent s-token. The transformer processes a sequence of length <code>L\/s<\/code> \u2014 but each position corresponds to <code>s<\/code> real tokens, so throughput is <code>s\u00d7<\/code> higher at the same FLOPs.\n        <\/p>\n<div class=\"tst-callout\">\n          <strong>Equal-FLOPs trick:<\/strong> To keep each step equal-FLOPs to baseline, the data sequence length is increased by <code>s\u00d7<\/code> \u2014 not the batch size. Every TST step costs the same compute as a standard step.\n        <\/div>\n<p class=\"tst-body\">\n          On the output side, the loss target shifts from a single next token to the next <strong>bag of s tokens<\/strong>. The multi-hot cross-entropy (MCE) loss assigns equal probability mass <code>1\/s<\/code> to each token in the target bag:\n        <\/p>\n<pre><span class=\"cm\"># L_MCE = mean of s standard CE terms<\/span>\n<span class=\"kw\">for<\/span> i <span class=\"kw\">in<\/span> range(superposition_bag_size):\n    target = labels[..., i].flatten(0, 1)\n    loss += torch.nn.functional.cross_entropy(pred, target)\nloss = loss \/ superposition_bag_size<\/pre>\n<p class=\"tst-body\">No new kernel needed \u2014 reuses the existing fused CE kernel in your pre-training library.<\/p>\n<\/div>\n<p>      <!-- SLIDE 3: Phase 2 --><\/p>\n<div class=\"tst-slide\">\n        <span class=\"tst-tag\">03 \/ Phase 2<\/span>\n<h3 class=\"tst-slide-title\">Phase 2 \u2014 The Recovery Phase<\/h3>\n<p class=\"tst-body\">\n          After <code>r \u00d7 total_steps<\/code> of superposition training, resume from the checkpoint with the TST code <em>fully removed<\/em>. Standard next-token prediction runs for the remaining <code>(1 \u2014 r) \u00d7 total_steps<\/code>.\n        <\/p>\n<div class=\"tst-callout\">\n          <strong>What happens at the switch:<\/strong> A loss spike of 1\u20132 nats occurs at the phase boundary. It resolves within a few thousand steps. After that, the model crosses below the equal-FLOPs baseline and stays there.\n        <\/div>\n<ul class=\"tst-list\">\n<li>Remove TST code fully \u2014 do not keep it as an auxiliary loss during Phase 2<\/li>\n<li>Do <strong>not<\/strong> re-initialize the input embedding or LM head at the boundary<\/li>\n<li>Shared representations across both phases are what make TST work<\/li>\n<\/ul>\n<div class=\"tst-warn\">\n          Re-initializing the embedding or LM head at the phase boundary completely breaks TST. In a 3B ablation, this raised final loss from 2.676 to 2.938 \u2014 worse than the 2.808 baseline. The Phase 1 steps contributed nothing.\n        <\/div>\n<\/div>\n<p>      <!-- SLIDE 4: Code --><\/p>\n<div class=\"tst-slide\">\n        <span class=\"tst-tag\">04 \/ Implementation<\/span>\n<h3 class=\"tst-slide-title\">PyTorch Implementation<\/h3>\n<p class=\"tst-body\">Three changes to the standard training loop \u2014 input folding, averaged embedding lookup, and MCE loss.<\/p>\n<pre><span class=\"cm\"># 1. Input folding (inside train loop)<\/span>\n<span class=\"kw\">if<\/span> superposition_bag_size <span class=\"kw\">is not<\/span> None <span class=\"kw\">and<\/span> superposition_bag_size &gt; 1:\n    bs, seq = inputs.shape\n    inputs = inputs.reshape(\n        bs, seq \/\/ superposition_bag_size, superposition_bag_size\n    )<\/pre>\n<pre><span class=\"cm\"># 2. Averaged embedding lookup (inside model forward)<\/span>\n<span class=\"kw\">if<\/span> len(tokens.shape) == 3:\n    bs, sp_seq, superposition_bag_size = tokens.shape\n    h = self.tok_embeddings(tokens[..., 0]).float()\n    <span class=\"kw\">for<\/span> i <span class=\"kw\">in<\/span> range(1, superposition_bag_size):\n        h = h + self.tok_embeddings(tokens[..., i]).float()\n    h = (h \/ superposition_bag_size).to(h_dtype)\n<span class=\"kw\">else<\/span>:\n    h = self.tok_embeddings(tokens)<\/pre>\n<div class=\"tst-callout\">\n          <strong>Note:<\/strong> Sum in <code>float32<\/code> for numerical precision, then cast back to training dtype. The embedding layer is the only forward-pass change.\n        <\/div>\n<\/div>\n<p>      <!-- SLIDE 5: Hyperparameters --><\/p>\n<div class=\"tst-slide\">\n        <span class=\"tst-tag\">05 \/ Hyperparameters<\/span>\n<h3 class=\"tst-slide-title\">Tuning Bag Size <code>s<\/code> and Step Ratio <code>r<\/code><\/h3>\n<p class=\"tst-body\">Two hyperparameters control TST. Both have well-defined practical ranges validated across model scales.<\/p>\n<div class=\"tst-grid\">\n<div class=\"tst-card\">\n            <span class=\"tst-card-label\">Step Ratio r<\/span><br \/>\n            <span class=\"tst-card-val\">0.2 \u2014 0.4<\/span><br \/>\n            <span class=\"tst-card-sub\">Fraction of total steps run in superposition mode. Robust across all tested scales. Below 0.2, throughput gain is too small. Above 0.5, Phase 2 cannot fully recover.<\/span>\n          <\/div>\n<div class=\"tst-card\">\n            <span class=\"tst-card-label\">Bag Size s<\/span><br \/>\n            <span class=\"tst-card-val\">3 \u2014 16<\/span><br \/>\n            <span class=\"tst-card-sub\">U-shaped optimum that shifts with model size. Start in the flat basin; overshooting makes the bag target too lossy to recover from.<\/span>\n          <\/div>\n<\/div>\n<div class=\"tst-table-wrap\">\n<table class=\"tst-table\">\n<tr>\n<th>Model Size<\/th>\n<th>Recommended s<\/th>\n<th>Recommended r<\/th>\n<\/tr>\n<tr>\n<td>270M<\/td>\n<td>3 \u2014 8<\/td>\n<td>0.2 \u2014 0.4<\/td>\n<\/tr>\n<tr>\n<td>600M<\/td>\n<td>6 \u2014 10<\/td>\n<td>0.2 \u2014 0.4<\/td>\n<\/tr>\n<tr>\n<td>3B<\/td>\n<td>6 (tested)<\/td>\n<td>0.3 (tested)<\/td>\n<\/tr>\n<tr>\n<td>10B\u2013A1B MoE<\/td>\n<td>16 (tested)<\/td>\n<td>\u223c0.25 (tested)<\/td>\n<\/tr>\n<\/table><\/div>\n<div class=\"tst-callout\">\n          <strong>Large bag sizes (s \u2265 8):<\/strong> Switch from uniform MCE loss weighting to power-law weighting (<code>1\/i<\/code> per position). Motivated by mutual information between token pairs decaying as a power law with distance (fitted exponent k \u2248 \u22121.25 on DCLM).\n        <\/div>\n<\/div>\n<p>      <!-- SLIDE 6: What Not To Do --><\/p>\n<div class=\"tst-slide\">\n        <span class=\"tst-tag\">06 \/ Negative Results<\/span>\n<h3 class=\"tst-slide-title\">What Doesn\u2019t Work<\/h3>\n<p class=\"tst-body\">The paper documents several variants that were tested and failed. Save yourself the compute.<\/p>\n<ul class=\"tst-list\">\n<li><strong>Positional encodings before averaging<\/strong> \u2014 adding RoPE or sinusoidal encodings to tokens before the mean consistently hurt performance. Within-bag permutation invariance appears to be a feature, not a bug.<\/li>\n<li><strong>RoPE rescaling at phase transition<\/strong> \u2014 accelerated early Phase 2 recovery but sometimes raised final loss. Leave RoPE unchanged across the boundary.<\/li>\n<li><strong>s independent heads<\/strong> \u2014 replacing the single MCE head with s separate heads predicting s positions gave no consistent gain at higher parameter cost and implementation complexity.<\/li>\n<li><strong>Binary cross-entropy \/ hinge loss<\/strong> \u2014 both significantly underperformed the MCE formulation and even fell below the baseline.<\/li>\n<li><strong>Retaining TST head in Phase 2<\/strong> \u2014 not yet benchmarked but identified as future work; do not assume it helps.<\/li>\n<\/ul>\n<div class=\"tst-callout\">\n          <strong>Bottom line:<\/strong> The simplest version works best \u2014 mean embeddings in, mean CE loss out, hard switch at the phase boundary, no extra parameters.\n        <\/div>\n<\/div>\n<p>      <!-- SLIDE 7: Results --><\/p>\n<div class=\"tst-slide\">\n        <span class=\"tst-tag\">07 \/ Results<\/span>\n<h3 class=\"tst-slide-title\">Key Results &amp; When to Use TST<\/h3>\n<p class=\"tst-body\">At equal wall-clock \u2014 same compute, better loss:<\/p>\n<div class=\"tst-table-wrap\">\n<table class=\"tst-table\">\n<tr>\n<th>Scale<\/th>\n<th>B200-hrs<\/th>\n<th>TST Loss<\/th>\n<th>Baseline Loss<\/th>\n<\/tr>\n<tr>\n<td>3B dense<\/td>\n<td>247<\/td>\n<td class=\"win\">2.676<\/td>\n<td>2.808<\/td>\n<\/tr>\n<tr>\n<td>10B\u2013A1B MoE<\/td>\n<td>4,768<\/td>\n<td class=\"win\">2.236<\/td>\n<td>2.252 (@ 12,311 hrs)<\/td>\n<\/tr>\n<\/table><\/div>\n<p class=\"tst-body\">At equal final loss \u2014 wall-clock saved:<\/p>\n<div class=\"tst-table-wrap\">\n<table class=\"tst-table\">\n<tr>\n<th>Scale<\/th>\n<th>TST (B200-hrs)<\/th>\n<th>Baseline (B200-hrs)<\/th>\n<th>Speedup<\/th>\n<\/tr>\n<tr>\n<td>3B dense<\/td>\n<td class=\"win\">247<\/td>\n<td>443<\/td>\n<td class=\"win\">\u223c1.8\u00d7<\/td>\n<\/tr>\n<tr>\n<td>10B\u2013A1B MoE<\/td>\n<td class=\"win\">4,768<\/td>\n<td>12,311<\/td>\n<td class=\"win\">\u223c2.5\u00d7<\/td>\n<\/tr>\n<\/table><\/div>\n<div class=\"tst-grid\">\n<div class=\"tst-card\">\n            <span class=\"tst-card-label\">Use TST when<\/span><br \/>\n            <span class=\"tst-card-sub\">\u2713 You are compute-bound<br \/>\u2713 You have ample data<br \/>\u2713 You want lower loss at the same FLOPs<br \/>\u2713 You need the same inference model<\/span>\n          <\/div>\n<div class=\"tst-card\">\n            <span class=\"tst-card-label\">Avoid TST when<\/span><br \/>\n            <span class=\"tst-card-sub\">\u2715 Data is the bottleneck (TST uses s\u00d7 more tokens in Phase 1)<br \/>\u2715 You compare at equal token consumption<br \/>\u2715 Under equal-data conditions, baseline wins<\/span>\n          <\/div>\n<\/div>\n<p class=\"tst-body\">Paper: arXiv 2605.06546 \u00a0\u2022\u00a0 nousresearch.com\/token-superposition<\/p>\n<\/div>\n<\/div>\n<p><!-- \/tst-track -->\n  <\/p><\/div>\n<p><!-- \/tst-viewport --><\/p>\n<p>  <!-- navigation --><\/p>\n<div class=\"tst-nav\">\n    <button class=\"tst-btn\" disabled>\u2190 Prev<\/button>\n<div class=\"tst-dots\"><\/div>\n<p>    <button class=\"tst-btn\">Next \u2192<\/button>\n  <\/p><\/div>\n<p>  <!-- attribution --><\/p>\n<div class=\"tst-footer\">\n    <span>Created &amp; designed by \u00a0<\/span><a href=\"https:\/\/marktechpost.com\/\" target=\"_blank\" rel=\"noopener\">marktechpost.com<\/a>\n  <\/div>\n<\/div>\n<p><!-- \/tst-guide --><\/p>\n<h2 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h2>\n<ul class=\"wp-block-list\">\n<li>Nous Research&#8217;s Token Superposition Training (TST) cuts LLM pre-training time by up to 2.5x at matched FLOPs \u2014 no architecture, tokenizer, or optimizer changes required.<\/li>\n<li>Phase 1 averages contiguous token embeddings into bags and predicts the next bag via multi-hot cross-entropy; Phase 2 reverts to standard next-token prediction from the same checkpoint.<\/li>\n<li>Validated at 270M, 600M, 3B dense, and 10B-A1B MoE \u2014 TST beats the baseline on loss and downstream evals (HellaSwag, ARC, MMLU) across all scales.<\/li>\n<li>Optimal hyperparameters: bag size s \u2208 [3\u20138] for smaller models, step ratio r \u2208 [0.2, 0.4]; shared embeddings across both phases are critical \u2014 re-initializing them makes TST worse than the baseline.<\/li>\n<li>Trade-off: TST consumes more raw data tokens per compute budget \u2014 best suited for compute-bound training; the output-only variant is the alternative for data-bound settings.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/arxiv.org\/pdf\/2605.06546\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a> <\/strong>and <strong><a href=\"https:\/\/nousresearch.com\/token-superposition\" target=\"_blank\" rel=\"noreferrer noopener\">Project<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/05\/13\/nous-research-releases-token-superposition-training-to-speed-up-llm-pre-training-by-up-to-2-5x-across-270m-to-10b-parameter-models\/\">Nous Research Releases Token Superposition Training to Speed Up LLM Pre-Training by Up to 2.5x Across 270M to 10B Parameter Models<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Pre-training large language mo&hellip;<\/p>\n","protected":false},"author":1,"featured_media":901,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-900","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/900","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=900"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/900\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/901"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=900"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=900"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=900"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}