Home NovaAstrax 360 Nous Research Releases Token Superposition Training to Speed Up LLM Pre-Training by...

    Nous Research Releases Token Superposition Training to Speed Up LLM Pre-Training by Up to 2.5x Across 270M to 10B Parameter Models

    5
    0


    Pre-training large language models is expensive enough that even modest efficiency improvements can translate into meaningful cost and time savings. Nous Research is releasing Token Superposition Training (TST), a method that substantially reduces pre-training wall-clock time at fixed compute without touching the model architecture, optimizer, tokenizer, parallelism strategy, or training data.

    At the 10B-A1B mixture-of-experts scale, TST reaches a lower final training loss than a matched-FLOPs baseline while consuming 4,768 B200-GPU-hours versus the baseline’s 12,311 — roughly a 2.5x reduction in total pre-training time.

    The Problem TST is Solving

    Modern LLM pre-training is heavily data-driven. Recent training regimes routinely overtrain well beyond compute-optimal estimates, and raw text throughput. How much data a model can process per FLOP has become a key lever. Subword tokenizers like BPE already improve throughput by compressing sequences; and the research suggests much of the BPE advantage over byte-level models comes simply from shorter sequences, which means the model sees more text per unit of compute.

    TST asks whether that throughput lever can be pulled further during training, independently of the tokenizer and without permanently changing the model.

    How TST Works: Two Phases

    TST modifies the standard pre-training loop in two sequential phases:

    Phase 1 — Superposition: For the first r fraction of total training steps (the paper finds r ∈ [0.2, 0.4] to be close to optimal across tested scales), the model does not receive individual tokens. Instead, the input sequence of length L is segmented into non-overlapping bags of s contiguous tokens. In the embedding layer, each bag is collapsed into a single latent “s-token” by averaging the s token embeddings. The transformer then processes a sequence of length L/s.

    Crucially, each TST step is kept equal-FLOPs to a standard training step by increasing the data sequence length by s times during the superposition phase. Because each latent position corresponds to s source tokens, the model ingests s times as much text per unit of compute — this is what drives the throughput gain.

    On the output side, each latent position predicts the next bag of s tokens rather than a single next token. The standard cross-entropy loss is replaced with a multi-hot cross-entropy (MCE) loss, which assigns equal probability mass 1/s to each token in the target bag. The MCE loss reduces to a simple mean of standard cross-entropy terms over the s targets — it can be implemented using the existing fused CE kernels already present in any major pre-training library, without writing a new kernel or adding an auxiliary head.

    Phase 2 — Recovery: After the superposition phase, training resumes from the saved checkpoint with standard next-token prediction for the remaining 1 - r steps. The TST code is fully removed at this boundary to avoid any experimental contamination. A transient loss spike occurs at the transition, typically between 1 and 2 nats, which resolves within a few thousand steps. After that, the recovered model crosses below the equal-FLOPs baseline and remains there.

    The model produced at the end of Phase 2 is architecturally identical to one produced by conventional pre-training, with the same next-token prediction inference behavior.

    What the Experiments Show

    TST was validated at four scales: 270M and 600M dense (SmolLM2 shapes adapted to the Llama3 modeling code, with the Llama3-8B tokenizer and untied input/output embeddings — which makes the 270M model equivalent in size to SmolLM2-135M and the 600M to SmolLM2-360M), 3B dense (SmolLM3 shape), and a 10B-A1B MoE in the Qwen3 family. Training used the DCLM dataset for the smaller runs and a 50/50 mix of DCLM and FineWeb-Edu for the MoE run. All runs used AdamW with the Warmup-Stable-Decay learning rate schedule and were run in TorchTitan under FSDP parallelism, on 64 NVIDIA B200 GPUs for the larger models and 8 B200 GPUs for the smaller ones.

    At the 3B scale with bag size s = 6 and step ratio r = 0.3, TST at 20,000 steps reaches a final loss of 2.676 — nearly matching a 36,000-step baseline at 2.677 — while using 247 B200-GPU-hours versus 443. The 20k-step TST run scores 62.4 on HellaSwag and 66.3 on ARC-Easy, versus 62.3 and 65.9 for the 36k baseline.

    At the 10B-A1B MoE scale with s = 16 and r ≈ 0.25, the TST run processes 2T data tokens and achieves a final loss of 2.236, below the baseline’s 2.252 after 1.05T tokens, while beating it on all four reported benchmarks: HellaSwag (71.2 vs. 70.1), ARC-Easy (74.2 vs. 73.8), ARC-Challenge (47.3 vs. 46.3), and MMLU (39.0 vs. 37.4).

    The research team presents three comparison views against the baseline — equal-FLOPs, equal-loss, and equal-data. Under equal-FLOPs and equal-loss conditions, TST consistently wins. Under equal total token consumption, the baseline wins, because TST’s effective compute budget per data token is smaller. This is an important boundary condition that determines where TST applies.

    Two Distinct Mechanisms

    An ablation study isolates the input-side and output-side components. Both independently outperform the baseline; combining them produces further improvement without signs of interference. The authors interpret this as evidence that TST is two orthogonal mechanisms rather than a single trick.

    The output-side mechanism — next-bag-of-tokens prediction — is conceptually related to multi-token prediction (MTP). Unlike MTP, which adds k independent prediction heads and extra parameters, TST keeps a single output head and replaces only the target. This makes it the least expensive member of a growing class of future-signal auxiliary objectives. Unlike MTP, it shows consistent gains across all tested scales including small models where MTP has been shown to degrade performance.

    The input-side mechanism has no direct analog in the recent pre-training literature. The research team offers two plausible explanations: it may implicitly regularize the embedding geometry (since many random s-grams of tokens must remain linearly separable once averaged), or it may act as a form of pre-pre-training, exposing the model to a coarser version of the real data before fine-resolution language modeling begins.

    A targeted ablation directly tests what happens when representation continuity is broken. The research team runs a 3B TST experiment where the input embedding and output LM head are randomly re-initialized at the start of Phase 2. The result: final loss jumps to 2.938 — worse than both the TST run (2.676) and the standard baseline (2.808). The Phase 1 TST steps contributed nothing to the final model. This confirms that shared representations across both phases are not incidental to TST’s success — they are what makes it work.

    Marktechpost’s Visual Explainer

    /* wpautop suppression */
    #tst-guide hr,
    #tst-guide p:empty,
    #tst-guide del,
    #tst-guide s { display:none!important; }

    #tst-guide .tst-step-line { height:1px!important; border:none!important; background:#76B900!important; opacity:.2!important; margin:18px 0!important; }

    /* base */
    #tst-guide {
    background:#111!important;
    color:#e0e0e0!important;
    border:1px solid #76B900!important;
    border-radius:6px!important;
    max-width:780px!important;
    margin:0 auto!important;
    overflow:hidden!important;
    font-family:’JetBrains Mono’,’Courier New’,monospace!important;
    box-sizing:border-box!important;
    }

    /* header bar */
    #tst-guide .tst-topbar {
    background:#76B900!important;
    color:#111!important;
    display:flex!important;
    align-items:center!important;
    justify-content:space-between!important;
    padding:10px 22px!important;
    }
    #tst-guide .tst-topbar-label {
    font-size:10px!important;
    letter-spacing:2.5px!important;
    text-transform:uppercase!important;
    font-weight:700!important;
    color:#111!important;
    }
    #tst-guide .tst-topbar-badge {
    font-size:10px!important;
    background:#111!important;
    color:#76B900!important;
    padding:2px 9px!important;
    letter-spacing:1px!important;
    }

    /* slides viewport */
    #tst-guide .tst-viewport {
    overflow:hidden!important;
    position:relative!important;
    }
    #tst-guide .tst-track {
    display:flex!important;
    transition:transform .38s cubic-bezier(.4,0,.2,1)!important;
    will-change:transform!important;
    }
    #tst-guide .tst-slide {
    min-width:100%!important;
    box-sizing:border-box!important;
    padding:28px 30px 24px!important;
    }

    /* slide header */
    #tst-guide .tst-tag {
    display:inline-block!important;
    border:1px solid #76B900!important;
    color:#76B900!important;
    font-size:9px!important;
    letter-spacing:2px!important;
    text-transform:uppercase!important;
    padding:3px 10px!important;
    margin-bottom:14px!important;
    background:transparent!important;
    }
    #tst-guide .tst-slide-title {
    color:#fff!important;
    font-size:17px!important;
    font-weight:700!important;
    margin:0 0 14px!important;
    line-height:1.35!important;
    letter-spacing:-.2px!important;
    }
    #tst-guide .tst-body {
    color:#bbb!important;
    font-size:13px!important;
    line-height:1.85!important;
    margin:0 0 14px!important;
    font-family:Georgia,’Times New Roman’,serif!important;
    }
    #tst-guide .tst-body strong {
    color:#fff!important;
    font-family:’JetBrains Mono’,’Courier New’,monospace!important;
    font-size:12px!important;
    }

    /* callout box */
    #tst-guide .tst-callout {
    border-left:3px solid #76B900!important;
    background:#161616!important;
    padding:12px 16px!important;
    margin:14px 0!important;
    color:#aaa!important;
    font-size:12.5px!important;
    line-height:1.75!important;
    font-family:Georgia,serif!important;
    }
    #tst-guide .tst-callout strong {
    color:#76B900!important;
    font-family:’JetBrains Mono’,monospace!important;
    font-size:11.5px!important;
    }

    /* code */
    #tst-guide pre {
    background:#161616!important;
    color:#76B900!important;
    border:1px solid #2a2a2a!important;
    border-radius:4px!important;
    padding:16px 18px!important;
    font-size:11.5px!important;
    line-height:1.65!important;
    margin:14px 0!important;
    overflow-x:auto!important;
    white-space:pre!important;
    font-family:’JetBrains Mono’,’Courier New’,monospace!important;
    display:block!important;
    }
    #tst-guide code {
    background:#1d1d1d!important;
    color:#76B900!important;
    border:1px solid #2a2a2a!important;
    padding:2px 6px!important;
    border-radius:3px!important;
    font-size:11.5px!important;
    font-family:’JetBrains Mono’,’Courier New’,monospace!important;
    white-space:nowrap!important;
    }
    #tst-guide .cm { color:#555!important; } /* comment */
    #tst-guide .kw { color:#c8e6a0!important; } /* keyword */
    #tst-guide .st { color:#aad4f5!important; } /* string/num */

    /* list */
    #tst-guide .tst-list {
    list-style:none!important;
    padding:0!important;
    margin:10px 0!important;
    }
    #tst-guide .tst-list li {
    color:#bbb!important;
    font-size:13px!important;
    padding:5px 0 5px 20px!important;
    position:relative!important;
    line-height:1.65!important;
    font-family:Georgia,serif!important;
    }
    #tst-guide .tst-list li::before {
    content:’→’!important;
    color:#76B900!important;
    position:absolute!important;
    left:0!important;
    font-family:’JetBrains Mono’,monospace!important;
    }
    #tst-guide .tst-list li strong {
    color:#fff!important;
    font-family:’JetBrains Mono’,monospace!important;
    font-size:11.5px!important;
    }

    /* result table */
    #tst-guide .tst-table-wrap {
    overflow-x:auto!important;
    margin:14px 0!important;
    }
    #tst-guide .tst-table {
    width:100%!important;
    border-collapse:collapse!important;
    font-size:11.5px!important;
    white-space:nowrap!important;
    }
    #tst-guide .tst-table th {
    background:#76B900!important;
    color:#111!important;
    padding:8px 12px!important;
    text-align:left!important;
    font-weight:700!important;
    letter-spacing:.5px!important;
    }
    #tst-guide .tst-table td {
    background:#161616!important;
    color:#ccc!important;
    padding:7px 12px!important;
    border-bottom:1px solid #222!important;
    }
    #tst-guide .tst-table td.win { color:#76B900!important; font-weight:700!important; }

    /* grid 2-col */
    #tst-guide .tst-grid {
    display:grid!important;
    grid-template-columns:1fr 1fr!important;
    gap:12px!important;
    margin:14px 0!important;
    }
    #tst-guide .tst-card {
    background:#161616!important;
    border:1px solid #2a2a2a!important;
    padding:14px!important;
    }
    #tst-guide .tst-card-label {
    color:#76B900!important;
    font-size:9px!important;
    letter-spacing:2px!important;
    text-transform:uppercase!important;
    margin-bottom:8px!important;
    display:block!important;
    }
    #tst-guide .tst-card-val {
    color:#fff!important;
    font-size:20px!important;
    font-weight:700!important;
    margin-bottom:4px!important;
    display:block!important;
    }
    #tst-guide .tst-card-sub {
    color:#666!important;
    font-size:10.5px!important;
    font-family:Georgia,serif!important;
    }

    /* warning box */
    #tst-guide .tst-warn {
    border:1px solid #555!important;
    background:#161616!important;
    padding:12px 16px!important;
    margin:14px 0!important;
    color:#999!important;
    font-size:12px!important;
    line-height:1.7!important;
    font-family:Georgia,serif!important;
    }
    #tst-guide .tst-warn::before {
    content:’⚠ ‘!important;
    color:#76B900!important;
    }

    /* nav */
    #tst-guide .tst-nav {
    display:flex!important;
    align-items:center!important;
    justify-content:space-between!important;
    padding:14px 22px!important;
    border-top:1px solid #1e1e1e!important;
    background:#0e0e0e!important;
    }
    #tst-guide .tst-btn {
    background:transparent!important;
    color:#76B900!important;
    border:1px solid #76B900!important;
    padding:7px 16px!important;
    font-size:10px!important;
    letter-spacing:1.5px!important;
    text-transform:uppercase!important;
    cursor:pointer!important;
    font-family:’JetBrains Mono’,monospace!important;
    transition:background .18s,color .18s!important;
    outline:none!important;
    }
    #tst-guide .tst-btn:hover:not(:disabled) {
    background:#76B900!important;
    color:#111!important;
    }
    #tst-guide .tst-btn:disabled {
    opacity:.25!important;
    cursor:not-allowed!important;
    }

    /* dots */
    #tst-guide .tst-dots {
    display:flex!important;
    gap:6px!important;
    align-items:center!important;
    }
    #tst-guide .tst-dot {
    width:6px!important;
    height:6px!important;
    border-radius:50%!important;
    background:#2e2e2e!important;
    cursor:pointer!important;
    transition:all .2s!important;
    border:none!important;
    padding:0!important;
    }
    #tst-guide .tst-dot.active {
    background:#76B900!important;
    width:20px!important;
    border-radius:3px!important;
    }

    #tst-guide .tst-footer {
    background:#0a0a0a!important;
    border-top:1px solid #1a1a1a!important;
    padding:9px 22px!important;
    text-align:center!important;
    }
    #tst-guide .tst-footer a {
    color:#76B900!important;
    text-decoration:none!important;
    font-size:10px!important;
    letter-spacing:1.5px!important;
    text-transform:uppercase!important;
    font-family:’JetBrains Mono’,’Courier New’,monospace!important;
    }
    #tst-guide .tst-footer a:hover {
    text-decoration:underline!important;
    }
    #tst-guide .tst-footer span {
    color:#333!important;
    font-size:10px!important;
    letter-spacing:1px!important;
    font-family:’JetBrains Mono’,monospace!important;
    }
    @media(max-width:640px){
    #tst-guide .tst-slide { padding:20px 16px 18px!important; }
    #tst-guide .tst-nav { padding:12px 14px!important; }
    #tst-guide .tst-slide-title { font-size:14px!important; }
    #tst-guide .tst-body { font-size:12px!important; }
    #tst-guide pre { font-size:10.5px!important; overflow-x:auto!important; }
    #tst-guide .tst-grid { grid-template-columns:1fr!important; }
    #tst-guide .tst-dots { display:none!important; }
    #tst-guide .tst-topbar-badge { display:none!important; }
    #tst-guide .tst-table { font-size:10.5px!important; }
    #tst-guide .tst-btn { padding:7px 11px!important; font-size:9px!important; }
    #tst-guide .tst-callout { font-size:11.5px!important; }
    }

    Token Superposition Training — Practical Guide
    arXiv 2605.06546

    01 / Overview

    What Is Token Superposition Training?

    Token Superposition Training (TST) is a two-phase pre-training method from Nous Research that increases token throughput per FLOP without changing the model architecture, optimizer, tokenizer, parallelism, or training data.

    The core idea: Instead of feeding one token at a time, average s contiguous token embeddings into one “s-token,” train on that for the first r fraction of steps, then switch back to standard next-token prediction. The final model is architecturally identical to one trained normally.
    • Phase 1 (Superposition) — model reads bags of s tokens, predicts the next bag
    • Phase 2 (Recovery) — standard next-token prediction resumes from the checkpoint
    • Inference — completely unchanged; no new heads, no new parameters
    • Validated at 270M, 600M, 3B dense and 10B–A1B MoE
    TST trades compute efficiency for higher data consumption. Best suited for compute-bound pre-training, not data-bound.

    02 / Phase 1

    Phase 1 — The Superposition Phase

    For the first r fraction of total training steps, the input sequence of length L is split into non-overlapping bags of s contiguous tokens. Their embeddings are averaged into a single latent s-token. The transformer processes a sequence of length L/s — but each position corresponds to s real tokens, so throughput is higher at the same FLOPs.

    Equal-FLOPs trick: To keep each step equal-FLOPs to baseline, the data sequence length is increased by — not the batch size. Every TST step costs the same compute as a standard step.

    On the output side, the loss target shifts from a single next token to the next bag of s tokens. The multi-hot cross-entropy (MCE) loss assigns equal probability mass 1/s to each token in the target bag:

    # L_MCE = mean of s standard CE terms
    for i in range(superposition_bag_size):
        target = labels[..., i].flatten(0, 1)
        loss += torch.nn.functional.cross_entropy(pred, target)
    loss = loss / superposition_bag_size

    No new kernel needed — reuses the existing fused CE kernel in your pre-training library.

    03 / Phase 2

    Phase 2 — The Recovery Phase

    After r × total_steps of superposition training, resume from the checkpoint with the TST code fully removed. Standard next-token prediction runs for the remaining (1 — r) × total_steps.

    What happens at the switch: A loss spike of 1–2 nats occurs at the phase boundary. It resolves within a few thousand steps. After that, the model crosses below the equal-FLOPs baseline and stays there.
    • Remove TST code fully — do not keep it as an auxiliary loss during Phase 2
    • Do not re-initialize the input embedding or LM head at the boundary
    • Shared representations across both phases are what make TST work
    Re-initializing the embedding or LM head at the phase boundary completely breaks TST. In a 3B ablation, this raised final loss from 2.676 to 2.938 — worse than the 2.808 baseline. The Phase 1 steps contributed nothing.

    04 / Implementation

    PyTorch Implementation

    Three changes to the standard training loop — input folding, averaged embedding lookup, and MCE loss.

    # 1. Input folding (inside train loop)
    if superposition_bag_size is not None and superposition_bag_size > 1:
        bs, seq = inputs.shape
        inputs = inputs.reshape(
            bs, seq // superposition_bag_size, superposition_bag_size
        )
    # 2. Averaged embedding lookup (inside model forward)
    if len(tokens.shape) == 3:
        bs, sp_seq, superposition_bag_size = tokens.shape
        h = self.tok_embeddings(tokens[..., 0]).float()
        for i in range(1, superposition_bag_size):
            h = h + self.tok_embeddings(tokens[..., i]).float()
        h = (h / superposition_bag_size).to(h_dtype)
    else:
        h = self.tok_embeddings(tokens)
    Note: Sum in float32 for numerical precision, then cast back to training dtype. The embedding layer is the only forward-pass change.

    05 / Hyperparameters

    Tuning Bag Size s and Step Ratio r

    Two hyperparameters control TST. Both have well-defined practical ranges validated across model scales.

    Step Ratio r
    0.2 — 0.4
    Fraction of total steps run in superposition mode. Robust across all tested scales. Below 0.2, throughput gain is too small. Above 0.5, Phase 2 cannot fully recover.
    Bag Size s
    3 — 16
    U-shaped optimum that shifts with model size. Start in the flat basin; overshooting makes the bag target too lossy to recover from.

    Model SizeRecommended sRecommended r
    270M3 — 80.2 — 0.4
    600M6 — 100.2 — 0.4
    3B6 (tested)0.3 (tested)
    10B–A1B MoE16 (tested)∼0.25 (tested)
    Large bag sizes (s ≥ 8): Switch from uniform MCE loss weighting to power-law weighting (1/i per position). Motivated by mutual information between token pairs decaying as a power law with distance (fitted exponent k ≈ −1.25 on DCLM).

    06 / Negative Results

    What Doesn’t Work

    The paper documents several variants that were tested and failed. Save yourself the compute.

    • Positional encodings before averaging — adding RoPE or sinusoidal encodings to tokens before the mean consistently hurt performance. Within-bag permutation invariance appears to be a feature, not a bug.
    • RoPE rescaling at phase transition — accelerated early Phase 2 recovery but sometimes raised final loss. Leave RoPE unchanged across the boundary.
    • s independent heads — replacing the single MCE head with s separate heads predicting s positions gave no consistent gain at higher parameter cost and implementation complexity.
    • Binary cross-entropy / hinge loss — both significantly underperformed the MCE formulation and even fell below the baseline.
    • Retaining TST head in Phase 2 — not yet benchmarked but identified as future work; do not assume it helps.
    Bottom line: The simplest version works best — mean embeddings in, mean CE loss out, hard switch at the phase boundary, no extra parameters.

    07 / Results

    Key Results & When to Use TST

    At equal wall-clock — same compute, better loss:

    ScaleB200-hrsTST LossBaseline Loss
    3B dense2472.6762.808
    10B–A1B MoE4,7682.2362.252 (@ 12,311 hrs)

    At equal final loss — wall-clock saved:

    ScaleTST (B200-hrs)Baseline (B200-hrs)Speedup
    3B dense247443∼1.8×
    10B–A1B MoE4,76812,311∼2.5×
    Use TST when
    ✓ You are compute-bound
    ✓ You have ample data
    ✓ You want lower loss at the same FLOPs
    ✓ You need the same inference model
    Avoid TST when
    ✕ Data is the bottleneck (TST uses s× more tokens in Phase 1)
    ✕ You compare at equal token consumption
    ✕ Under equal-data conditions, baseline wins

    Paper: arXiv 2605.06546  •  nousresearch.com/token-superposition

    (function(){
    var current = 0;
    var total = document.querySelectorAll(‘#tst-guide .tst-slide’).length;
    var track = document.getElementById(‘tst-track’);
    var dotsWrap = document.getElementById(‘tst-dots’);
    var btnPrev = document.getElementById(‘tst-prev’);
    var btnNext = document.getElementById(‘tst-next’);

    // build dots
    for(var i = 0; i 40) tstNav(dx < 0 ? 1 : -1);
    }, {passive:true});
    })();

    Key Takeaways

    • Nous Research’s Token Superposition Training (TST) cuts LLM pre-training time by up to 2.5x at matched FLOPs — no architecture, tokenizer, or optimizer changes required.
    • Phase 1 averages contiguous token embeddings into bags and predicts the next bag via multi-hot cross-entropy; Phase 2 reverts to standard next-token prediction from the same checkpoint.
    • Validated at 270M, 600M, 3B dense, and 10B-A1B MoE — TST beats the baseline on loss and downstream evals (HellaSwag, ARC, MMLU) across all scales.
    • Optimal hyperparameters: bag size s ∈ [3–8] for smaller models, step ratio r ∈ [0.2, 0.4]; shared embeddings across both phases are critical — re-initializing them makes TST worse than the baseline.
    • Trade-off: TST consumes more raw data tokens per compute budget — best suited for compute-bound training; the output-only variant is the alternative for data-bound settings.

    Check out the Paper and Project. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

    Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

    The post Nous Research Releases Token Superposition Training to Speed Up LLM Pre-Training by Up to 2.5x Across 270M to 10B Parameter Models appeared first on MarkTechPost.

    LEAVE A REPLY

    Please enter your comment!
    Please enter your name here