Home NovaAstrax 360 Meta and Stanford Researchers Propose Fast Byte Latent Transformer That Reduces Inference...

    Meta and Stanford Researchers Propose Fast Byte Latent Transformer That Reduces Inference Memory Bandwidth by Over 50% Without Tokenization

    4
    0


    A team of researchers from Meta, Stanford University, and the University of Washington have introduced three new methods that substantially accelerate generation in the Byte Latent Transformer (BLT) — a language model architecture that operates directly on raw bytes instead of tokens.

    Byte-Level Models Are Slow at Inference

    To understand what this new research solves, you need to understand the tradeoff at the center of byte-level language modeling.

    Most language models today work on tokens — chunks of text produced by subword tokenizers like byte-pair encoding (BPE). A token typically represents several characters or even a whole word. While this is efficient, tokenization comes with known downsides: sensitivity to input noise, poor handling of multilingual text, weak character-level understanding, and fragility on structured inputs like code and numbers.

    Byte-level models sidestep all of this by operating directly on raw bytes — the lowest-level representation of text. The Byte Latent Transformer (BLT) was a major step forward: it matched the performance of tokenization-based models at scale by grouping bytes dynamically into variable-length patches using an entropy-based segmentation strategy. High-entropy (harder-to-predict) regions get shorter patches; more predictable spans get longer ones. The bulk of computation runs over latent token representations, not raw bytes — using three components: a local encoder, a large global Transformer, and a local decoder — with an average patch size of 4 bytes and a maximum of 8.

    The remaining problem is inference speed. Even with BLT’s hierarchical design, the local decoder still generates one byte at a time autoregressively. Since a typical subword token corresponds to several bytes, BLT needs multiple decoder forward passes to produce the same amount of text that a token-level model produces in one step. In modern LLM serving, the bottleneck is often not compute but memory bandwidth — repeatedly loading model weights and key-value caches from memory. More decoder forward passes means more memory loads, which directly translates to slower generation.

    Three Methods, One Goal: Fewer Forward Passes

    The research team introduces three techniques that reduce this bottleneck, each trading speed against generation quality differently.

    BLT Diffusion (BLT-D)

    It is the core contribution and the fastest variant. The key idea is to replace autoregressive byte-by-byte decoding with block-wise discrete diffusion in the local decoder.

    During training, the decoder receives two inputs: a clean byte sequence (the original text) and a corrupted sequence of fixed-length byte blocks. For each block, a continuous diffusion timestep t is sampled from U(0,1), and each byte in the block is independently replaced with a [MASK] token with probability t. This means the degree of masking varies per training example — a lower t leaves most bytes visible; a higher t masks most of them. The block size B (set to 4, 8, or 16 bytes in experiments) typically extends beyond BLT’s average patch size of 4 bytes, teaching the decoder to predict bytes further into the future than it normally would. The total training loss combines the standard autoregressive next-byte prediction loss on the clean sequence and a masked-byte prediction loss on the corrupted blocks — conceptually similar to how masked language modeling in BERT works, but applied at the byte level within BLT’s hierarchical architecture.

    At inference, BLT-D initializes a block of [MASK] positions and iteratively unmasks multiple byte positions per decoder step using one of two strategies: confidence-based unmasking (unmask positions whose predicted probability exceeds a threshold α) or entropy-bounded (EB) sampling (select the largest subset of positions whose cumulative entropy stays below a threshold γ). Both strategies generate multiple bytes per forward pass rather than one. The encoder and global model — BLT’s expensive components — are invoked once per block rather than once per patch, further reducing total model calls. BLT-D also supports KV caching, benefiting from any techniques that reduce KV-cache memory footprint.

    At 3B parameters, BLT-D-4 (block size 4) nearly matches BLT’s task scores while requiring less than half the memory bandwidth. BLT-D-16 (block size 16) achieves an 87–92% reduction in estimated memory-bandwidth cost compared to BLT, making it the fastest configuration evaluated — though with lower pass@1 scores on coding benchmarks (HumanEval, MBPP).

    BLT Self-Speculation (BLT-S)

    It takes a different route, drawing on speculative decoding — a technique where a cheap draft model proposes tokens and a larger model verifies them in parallel. What makes BLT-S unusual is that it requires no separate draft model and no architectural changes or additional training. It repurposes BLT’s existing lightweight local decoder as the drafter.

    In standard BLT inference, the decoder stops generating whenever the entropy-based patcher determines that a new patch boundary has been reached — typically every four bytes. BLT-S instead lets the decoder autoregressively generate up to a fixed window size k (8 or 16 bytes in experiments) regardless of entropy spikes, conditioning on the last available latent token. After producing a draft of k bytes, the full model re-encodes the candidate sequence through the encoder, global model, and decoder and produces next-byte predictions. Drafted bytes are accepted up to the first mismatch; the first mismatched byte is replaced with the verified prediction.

    Under greedy decoding, this procedure guarantees that verified outputs are identical to standard autoregressive BLT decoding — no quality loss. BLT-S increases decoder forward passes slightly but substantially reduces encoder and global model calls. At 3B parameters with k=16, BLT-S may achieve up to 77% memory-bandwidth reduction with no loss in task performance.

    BLT Diffusion+Verification (BLT-DV)

    It sits in the middle. Because BLT-D is trained with both a diffusion objective and a standard next-byte prediction objective, the same model weights can run autoregressively using causal decoder masks — no separate model and no additional training needed. BLT-DV exploits this: diffusion drafts a block of bytes first, then a single autoregressive forward pass verifies the draft, accepting bytes up to the first mismatch. Empirically, one-step diffusion combined with verification yielded the fastest BLT-DV configuration. While one-step diffusion alone typically leads to rapid degradation in generation quality, the verification step effectively prevents this. At 3B parameters, BLT-DV may achieve up to 81% memory-bandwidth reduction compared to BLT.

    Understanding the Numbers

    All models were trained on the BLT-1T dataset (1 trillion tokens from public sources including a subset of Datacomp-LM), with 1B-parameter models trained for 240,000 steps and 3B-parameter models for 480,000 steps. Evaluation covered four generation tasks: French-to-English and German-to-English translation using the FLORES-101 benchmark (4-shot, SentencePiece BLEU) and two coding benchmarks — HumanEval (0-shot, pass@1) and MBPP (3-shot, pass@1).

    Beyond generation tasks, the research team also evaluates BLT-D on five likelihood-based benchmarks: ARC-Easy, ARC-Challenge, PIQA, HellaSwag, and MMLU. Since BLT-D is trained with a next-byte prediction objective alongside the diffusion objective, it can compute autoregressive likelihoods by applying a causal mask to the decoder — the same mechanism BLT-DV’s verification step relies on. The results show BLT-D variants achieve scores approaching BLT’s baseline on all five benchmarks, confirming that integrating block diffusion does not compromise the model’s autoregressive reasoning capability.

    Efficiency is reported via three proxy metrics: decoder network function evaluations (NFEs), encoder/global model NFEs, and an estimated memory-bandwidth figure in gigabytes derived from parameter counts and forward-pass counts under 16-bit precision. The research team is explicit that these are proxy metrics — converting NFE reductions into actual wall-clock improvements requires a highly optimized inference implementation, which the research team flags as the most important direction for future work.

    Translation tasks benefit most from BLT-D across all block sizes. Coding tasks show more sensitivity to block size: BLT-D-16 offers the largest efficiency gains but shows meaningful score drops on HumanEval and MBPP. A notable additional finding comes from the generation diversity analysis: when using entropy-bounded sampling with top-p sampling at inference, more decoder NFEs correlate with higher type-token ratio (a measure of lexical diversity). This means the efficiency–diversity tradeoff is tunable at inference time without any retraining.

    Key Takeaways

    • BLT-D introduces block-wise discrete diffusion into BLT’s local decoder, training with a combined next-byte prediction and masked-byte prediction loss to generate multiple bytes per forward pass instead of one at a time
    • BLT-S uses BLT’s own lightweight decoder as a speculative drafter — no separate model, no architectural changes, no additional training — and produces output identical to standard BLT under greedy decoding
    • BLT-DV combines diffusion drafting with an autoregressive verification step using the same BLT-D model weights, recovering quality lost in diffusion-only decoding without extra training
    • All methods may achieve an estimated memory-bandwidth cost over 50% lower than BLT on generation tasks; BLT-D-16 may reach 87–92% reduction
    • BLT-D’s autoregressive capability remains robust on likelihood-based benchmarks (ARC-Easy, ARC-Challenge, PIQA, HellaSwag, MMLU), and its generation diversity is tunable at inference time via entropy-bounded sampling thresholds

    .mtp-wrap{font-family:var(–font-mono,’JetBrains Mono’,’Courier New’,monospace);max-width:720px;margin:0 auto;display:flex;flex-direction:column;gap:10px}
    .mtp-card{background:#111;border:1px solid #222;border-radius:10px;padding:18px 22px}
    .mtp-section-label{font-size:10px;font-weight:700;letter-spacing:.14em;text-transform:uppercase;color:#76B900;margin:0 0 12px}
    .mtp-links{display:flex;flex-wrap:wrap;gap:7px}
    .mtp-btn{display:inline-flex;align-items:center;gap:7px;background:#1a1a1a;border:1px solid #2e2e2e;color:#ccc;text-decoration:none;font-size:12px;font-family:inherit;padding:7px 13px;border-radius:6px;transition:border-color .18s,color .18s;white-space:nowrap}
    .mtp-btn:hover{border-color:#76B900;color:#76B900}
    .mtp-btn i{font-size:14px;color:inherit}
    .mtp-sponsor-card{background:#111;border:2px solid #76B900;border-radius:10px;padding:20px 22px;position:relative;overflow:hidden}
    .mtp-sponsor-card::before{content:”;position:absolute;top:0;left:0;right:0;height:3px;background:#76B900;border-radius:10px 10px 0 0}
    .mtp-sponsor-badge{display:inline-block;background:#76B900;color:#111;font-size:9px;font-weight:700;letter-spacing:.14em;text-transform:uppercase;padding:3px 9px;border-radius:4px;margin-bottom:12px}
    .mtp-sponsor-headline{color:#fff;font-size:14px;font-weight:700;margin:0 0 5px}
    .mtp-sponsor-sub{color:#888;font-size:11px;margin:0 0 14px;line-height:1.6}
    .mtp-sponsor-tags{display:flex;flex-wrap:wrap;gap:6px;margin-bottom:16px}
    .mtp-tag{font-size:10px;color:#76B900;border:1px solid #2e5a00;background:#0e1f00;padding:3px 9px;border-radius:4px;letter-spacing:.06em}
    .mtp-cta{display:inline-flex;align-items:center;gap:8px;background:#76B900;color:#111;font-size:13px;font-weight:700;font-family:inherit;text-decoration:none;padding:10px 20px;border-radius:7px;transition:background .18s,transform .1s;letter-spacing:.03em}
    .mtp-cta:hover{background:#8fd400;transform:translateY(-1px)}
    .mtp-cta i{font-size:15px;color:#111}
    .mtp-stat{display:flex;align-items:center;gap:5px;color:#555;font-size:11px}
    .mtp-stat strong{color:#76B900}
    .mtp-stats-row{display:flex;flex-wrap:wrap;gap:14px;margin-top:14px;padding-top:14px;border-top:1px solid #222}

    Promote with us

    Reach 1M+ AI/ML engineers & researchers

    Promote your GitHub repo, Hugging Face page, product launch, or webinar to one of the most engaged AI practitioner audiences on the web.

    GitHub repo
    HF page
    Product launch
    Webinar
    Research paper

    Connect with us

    1M+ monthly readers
    50K+ newsletter
    150K+ subreddit
    75% US + Europe

    The post Meta and Stanford Researchers Propose Fast Byte Latent Transformer That Reduces Inference Memory Bandwidth by Over 50% Without Tokenization appeared first on MarkTechPost.

    LEAVE A REPLY

    Please enter your comment!
    Please enter your name here