Scaling large language models (LLMs) is expensive. Every token processed during inference and every gradient computed during training flows through feedforward layers that account for over two-thirds of model parameters and more than 80% of total FLOPs in larger models. A team researchers from Sakana AI and NVIDIA have worked on a new research that directly targets this bottleneck — not by changing the architecture, but by making the computation inside feedforward layers significantly cheaper through unstructured sparsity.
Sparsity Exists, But GPUs Ignore It
Inside a transformer’s feedforward block, for any given input token, only a small fraction of hidden neurons actually fire — the rest produce zero after passing through the activation function. This is called activation sparsity, and prior work has documented this phenomenon in models with ReLU activations.
The frustrating reality is that this theoretical savings rarely translates into actual speedups. NVIDIA GPUs are heavily optimized for dense matrix multiplications using Tensor Cores, which operate on large contiguous tiles of data. Traditional sparse formats like ELLPACK (ELL) require a separate kernel pass to convert activations from dense to sparse representation, and that conversion overhead often cancels out what’s saved by skipping the zeros.
Critically, prior work on sparse LLM kernels (including TurboSparse, ProSparse, and Q-Sparse) has focused on memory-bound GEMV operations — the single- or few-token inference regime. The research team instead targets compute-bound GEMM operations in the batched setting with thousands of input tokens, where dense baselines on modern devices can execute orders-of-magnitude higher FLOP/s with large tiles and Tensor Cores. That is a fundamentally harder problem, and the reason prior approaches didn’t generalize to batched training or high-throughput inference.
@import url(‘
#mtp-twell *,
#mtp-twell *::before,
#mtp-twell *::after {
box-sizing: border-box !important;
margin: 0 !important;
padding: 0 !important;
}
#mtp-twell hr,
#mtp-twell p:empty,
#mtp-twell del,
#mtp-twell s { display: none !important; }
#mtp-twell {
background: #0d0d0d !important;
border: 1px solid #1e1e1e !important;
border-radius: 12px !important;
font-family: ‘IBM Plex Sans’, sans-serif !important;
color: #e2e2e2 !important;
max-width: 860px !important;
margin: 32px auto !important;
overflow: hidden !important;
box-shadow: 0 0 0 1px #1a1a1a, 0 24px 80px rgba(0,0,0,0.55) !important;
}
/* ── HEADER ── */
#mtp-twell .tw-header {
background: #111111 !important;
border-bottom: 1px solid #222 !important;
padding: 28px 36px !important;
display: flex !important;
align-items: flex-start !important;
gap: 18px !important;
}
#mtp-twell .tw-header-icon {
background: #76B900 !important;
color: #000 !important;
font-family: ‘JetBrains Mono’, monospace !important;
font-size: 11px !important;
font-weight: 700 !important;
letter-spacing: 0.06em !important;
padding: 5px 11px !important;
border-radius: 5px !important;
white-space: nowrap !important;
margin-top: 3px !important;
}
#mtp-twell .tw-header-text {}
#mtp-twell .tw-header-title {
font-size: 20px !important;
font-weight: 700 !important;
color: #fff !important;
line-height: 1.3 !important;
margin-bottom: 5px !important;
}
#mtp-twell .tw-header-sub {
font-size: 13px !important;
color: #666 !important;
font-family: ‘JetBrains Mono’, monospace !important;
letter-spacing: 0.03em !important;
}
/* ── STEP NAV ── */
#mtp-twell .tw-nav {
background: #111 !important;
border-bottom: 1px solid #1e1e1e !important;
padding: 0 36px !important;
display: flex !important;
align-items: center !important;
overflow-x: auto !important;
scrollbar-width: none !important;
}
#mtp-twell .tw-nav::-webkit-scrollbar { display: none !important; }
#mtp-twell .tw-step-pill {
display: flex !important;
align-items: center !important;
gap: 0 !important;
flex-shrink: 0 !important;
}
#mtp-twell .tw-step-btn {
display: flex !important;
align-items: center !important;
gap: 9px !important;
padding: 14px 4px !important;
cursor: pointer !important;
background: none !important;
border: none !important;
outline: none !important;
white-space: nowrap !important;
transition: opacity 0.2s !important;
}
#mtp-twell .tw-step-btn:hover .tw-step-label { color: #ccc !important; }
#mtp-twell .tw-step-num {
width: 24px !important;
height: 24px !important;
border-radius: 50% !important;
border: 1.5px solid #333 !important;
font-family: ‘JetBrains Mono’, monospace !important;
font-size: 10px !important;
font-weight: 700 !important;
color: #555 !important;
display: flex !important;
align-items: center !important;
justify-content: center !important;
transition: all 0.2s !important;
flex-shrink: 0 !important;
}
#mtp-twell .tw-step-label {
font-size: 12px !important;
font-weight: 600 !important;
color: #555 !important;
transition: color 0.2s !important;
}
#mtp-twell .tw-step-line {
width: 28px !important;
height: 1px !important;
background: #222 !important;
flex-shrink: 0 !important;
margin: 0 2px !important;
}
#mtp-twell .tw-step-pill.active .tw-step-num {
background: #76B900 !important;
border-color: #76B900 !important;
color: #000 !important;
}
#mtp-twell .tw-step-pill.active .tw-step-label { color: #76B900 !important; }
#mtp-twell .tw-step-pill.done .tw-step-num {
background: #1e2e0a !important;
border-color: #76B900 !important;
color: #76B900 !important;
}
#mtp-twell .tw-step-pill.done .tw-step-label { color: #555 !important; }
/* ── CONTENT AREA ── */
#mtp-twell .tw-body {
padding: 36px !important;
min-height: 320px !important;
}
#mtp-twell .tw-panel { display: none !important; }
#mtp-twell .tw-panel.active {
display: block !important;
animation: tw-fade 0.3s ease !important;
}
@keyframes tw-fade { from { opacity: 0; transform: translateY(6px); } to { opacity: 1; transform: translateY(0); } }
#mtp-twell .tw-panel-tag {
font-family: ‘JetBrains Mono’, monospace !important;
font-size: 10px !important;
font-weight: 700 !important;
color: #76B900 !important;
letter-spacing: 0.12em !important;
text-transform: uppercase !important;
margin-bottom: 8px !important;
}
#mtp-twell .tw-panel-title {
font-size: 22px !important;
font-weight: 700 !important;
color: #fff !important;
line-height: 1.3 !important;
margin-bottom: 18px !important;
}
/* ── STAT GRID ── */
#mtp-twell .tw-stats {
display: grid !important;
grid-template-columns: repeat(3, 1fr) !important;
gap: 12px !important;
margin-bottom: 20px !important;
}
#mtp-twell .tw-stat {
background: #131313 !important;
border: 1px solid #1e1e1e !important;
border-radius: 8px !important;
padding: 16px !important;
text-align: center !important;
}
#mtp-twell .tw-stat-val {
font-family: ‘JetBrains Mono’, monospace !important;
font-size: 26px !important;
font-weight: 700 !important;
color: #76B900 !important;
line-height: 1 !important;
margin-bottom: 6px !important;
}
#mtp-twell .tw-stat-lbl {
font-size: 11px !important;
color: #555 !important;
line-height: 1.4 !important;
}
/* ── TEXT ── */
#mtp-twell .tw-text {
font-size: 14px !important;
color: #999 !important;
line-height: 1.75 !important;
margin-bottom: 14px !important;
}
#mtp-twell .tw-text strong { color: #ddd !important; font-weight: 600 !important; }
/* ── TWO COL ── */
#mtp-twell .tw-cols {
display: grid !important;
grid-template-columns: 1fr 1fr !important;
gap: 14px !important;
margin-bottom: 18px !important;
}
#mtp-twell .tw-card {
background: #131313 !important;
border: 1px solid #1e1e1e !important;
border-radius: 8px !important;
padding: 18px !important;
}
#mtp-twell .tw-card-tag {
font-family: ‘JetBrains Mono’, monospace !important;
font-size: 9px !important;
font-weight: 700 !important;
color: #76B900 !important;
letter-spacing: 0.12em !important;
text-transform: uppercase !important;
margin-bottom: 6px !important;
}
#mtp-twell .tw-card-title {
font-size: 13px !important;
font-weight: 700 !important;
color: #e8e8e8 !important;
margin-bottom: 8px !important;
}
#mtp-twell .tw-card-text {
font-size: 12.5px !important;
color: #666 !important;
line-height: 1.6 !important;
}
/* ── HIGHLIGHT ── */
#mtp-twell .tw-highlight {
background: #111 !important;
border-left: 3px solid #76B900 !important;
border-radius: 0 6px 6px 0 !important;
padding: 14px 18px !important;
margin-bottom: 16px !important;
}
#mtp-twell .tw-highlight p {
font-size: 13.5px !important;
color: #999 !important;
line-height: 1.65 !important;
display: block !important;
}
#mtp-twell .tw-highlight p strong { color: #ddd !important; font-weight: 600 !important; }
/* ── WARN ── */
#mtp-twell .tw-warn {
background: rgba(255,90,90,0.05) !important;
border: 1px solid rgba(255,90,90,0.2) !important;
border-left: 3px solid #e05555 !important;
border-radius: 0 6px 6px 0 !important;
padding: 12px 16px !important;
margin-bottom: 16px !important;
}
#mtp-twell .tw-warn-label {
font-family: ‘JetBrains Mono’, monospace !important;
font-size: 9px !important;
font-weight: 700 !important;
color: #e05555 !important;
letter-spacing: 0.12em !important;
text-transform: uppercase !important;
margin-bottom: 5px !important;
}
#mtp-twell .tw-warn p {
font-size: 13px !important;
color: #999 !important;
line-height: 1.6 !important;
display: block !important;
}
/* ── ROWS ── */
#mtp-twell .tw-rows {
display: flex !important;
flex-direction: column !important;
gap: 10px !important;
margin-bottom: 18px !important;
}
#mtp-twell .tw-row {
display: flex !important;
gap: 12px !important;
align-items: flex-start !important;
}
#mtp-twell .tw-row-bullet {
font-family: ‘JetBrains Mono’, monospace !important;
font-size: 11px !important;
font-weight: 700 !important;
color: #76B900 !important;
flex-shrink: 0 !important;
width: 20px !important;
padding-top: 2px !important;
}
#mtp-twell .tw-row-text {
font-size: 13.5px !important;
color: #888 !important;
line-height: 1.65 !important;
}
#mtp-twell .tw-row-text strong { color: #ddd !important; font-weight: 600 !important; }
#mtp-twell .tw-row-text code {
font-family: ‘JetBrains Mono’, monospace !important;
font-size: 12px !important;
background: #1a1a1a !important;
border: 1px solid #2a2a2a !important;
color: #76B900 !important;
padding: 1px 6px !important;
border-radius: 3px !important;
}
/* ── TABLE ── */
#mtp-twell .tw-table-wrap {
overflow-x: auto !important;
border: 1px solid #1e1e1e !important;
border-radius: 8px !important;
margin-bottom: 16px !important;
}
#mtp-twell table {
width: 100% !important;
border-collapse: collapse !important;
min-width: 480px !important;
}
#mtp-twell thead tr {
background: #111 !important;
border-bottom: 1px solid #76B900 !important;
}
#mtp-twell th {
font-family: ‘JetBrains Mono’, monospace !important;
font-size: 10px !important;
font-weight: 700 !important;
color: #76B900 !important;
letter-spacing: 0.08em !important;
text-transform: uppercase !important;
padding: 10px 14px !important;
text-align: left !important;
white-space: nowrap !important;
}
#mtp-twell td {
padding: 9px 14px !important;
font-size: 13px !important;
color: #888 !important;
border-bottom: 1px solid #181818 !important;
white-space: nowrap !important;
}
#mtp-twell tbody tr { background: #0d0d0d !important; }
#mtp-twell tbody tr:nth-child(even) { background: #111 !important; }
#mtp-twell .tg {
color: #76B900 !important;
font-family: ‘JetBrains Mono’, monospace !important;
font-size: 12px !important;
font-weight: 700 !important;
}
#mtp-twell .tm {
font-family: ‘JetBrains Mono’, monospace !important;
font-size: 12px !important;
color: #fff !important;
font-weight: 600 !important;
}
#mtp-twell .tn {
color: #444 !important;
font-family: ‘JetBrains Mono’, monospace !important;
font-size: 11px !important;
}
/* ── LINKS ── */
#mtp-twell .tw-links {
display: flex !important;
gap: 10px !important;
flex-wrap: wrap !important;
margin-bottom: 14px !important;
}
#mtp-twell .tw-link {
display: inline-flex !important;
align-items: center !important;
gap: 6px !important;
border-radius: 6px !important;
padding: 9px 16px !important;
font-family: ‘JetBrains Mono’, monospace !important;
font-size: 12px !important;
font-weight: 600 !important;
text-decoration: none !important;
transition: opacity 0.2s !important;
cursor: pointer !important;
}
#mtp-twell .tw-link:hover { opacity: 0.8 !important; text-decoration: none !important; }
#mtp-twell .tw-link-solid { background: #76B900 !important; color: #000 !important; }
#mtp-twell .tw-link-ghost { background: transparent !important; border: 1px solid #2a2a2a !important; color: #76B900 !important; }
/* ── FOOTER ── */
#mtp-twell .tw-footer {
background: #111 !important;
border-top: 1px solid #1e1e1e !important;
padding: 16px 36px !important;
display: flex !important;
align-items: center !important;
justify-content: space-between !important;
}
#mtp-twell .tw-footer-nav {
display: flex !important;
gap: 8px !important;
align-items: center !important;
}
#mtp-twell .tw-btn {
display: inline-flex !important;
align-items: center !important;
gap: 6px !important;
background: #1a1a1a !important;
border: 1px solid #2a2a2a !important;
border-radius: 6px !important;
padding: 8px 16px !important;
font-family: ‘JetBrains Mono’, monospace !important;
font-size: 12px !important;
font-weight: 600 !important;
color: #aaa !important;
cursor: pointer !important;
transition: all 0.18s !important;
outline: none !important;
}
#mtp-twell .tw-btn:hover {
background: #76B900 !important;
border-color: #76B900 !important;
color: #000 !important;
}
#mtp-twell .tw-btn:disabled {
opacity: 0.25 !important;
cursor: default !important;
pointer-events: none !important;
}
#mtp-twell .tw-counter {
font-family: ‘JetBrains Mono’, monospace !important;
font-size: 11px !important;
color: #3a3a3a !important;
letter-spacing: 0.06em !important;
}
#mtp-twell .tw-credit {
font-size: 11px !important;
color: #333 !important;
font-style: italic !important;
}
/* ── MOBILE ── */
@media (max-width: 640px) {
#mtp-twell .tw-header { padding: 20px 20px !important; }
#mtp-twell .tw-nav { padding: 0 16px !important; }
#mtp-twell .tw-body { padding: 22px 18px !important; min-height: auto !important; }
#mtp-twell .tw-footer { padding: 14px 18px !important; }
#mtp-twell .tw-stats { grid-template-columns: 1fr 1fr !important; }
#mtp-twell .tw-cols { grid-template-columns: 1fr !important; }
#mtp-twell .tw-panel-title { font-size: 18px !important; }
#mtp-twell .tw-step-label { display: none !important; }
#mtp-twell .tw-links { flex-direction: column !important; }
#mtp-twell .tw-link { justify-content: center !important; }
}
For any given token, only a tiny fraction of hidden neurons actually fire. The rest output zero after the activation function. This is called activation sparsity — and it has historically been impossible to exploit on modern GPUs because sparse operations ran slower than dense ones.
The inference pipeline uses one fused kernel that reads gate activations in TwELL format and performs up + down projections together. The intermediate hidden state is never written to global memory, cutting DRAM traffic at every forward pass.
For training, a hybrid sparse format dynamically routes rows into a compact ELL matrix (sparse rows) or a dense backup (overflow rows). Sparsity during training is highly non-uniform — max non-zeros per row can be orders of magnitude above the average — so the hybrid design handles this without becoming brittle.
L1 = 2×10⁻⁵. Add it to your standard cross-entropy loss. No changes to learning rate, weight decay, batch size, or optimizer.At L1 = 2×10⁻⁵, over 30% of neurons become permanently inactive (dead neurons) on average across layers. Downstream accuracy is not visibly affected at this level. The paper explores targeted gate weight reinitialization as a mitigation — yielding +19.1% speedup vs +17.9% baseline with no accuracy cost.
| Model | Accuracy | Inference | Energy / tok | Training | Peak Mem |
|---|---|---|---|---|---|
| 0.5B | 40.4% → 40.4% | +17.0% | −11.8% | −1.5% | −19.2% |
| 1B | 44.6% → 44.7% | +18.1% | −14.6% | +7.1% | −25.5% |
| 1.5B | 46.4% → 46.2% | +18.8% | −15.0% | +11.6% | −28.1% |
| 2B | 49.1% → 48.8% | +20.5% | −17.0% | +21.9% | +22.3% * |
📑 arXiv Paper
🌐 Project Page
(function () {
var STEPS = [
{ label: ‘Problem’ },
{ label: ‘Innovation’ },
{ label: ‘Recipe’ },
{ label: ‘Results’ },
{ label: ‘Findings’ },
{ label: ‘Start’ }
];
var cur = 0;
var nav = document.getElementById(‘mtp-twell-nav’);
var panels = document.querySelectorAll(‘#mtp-twell .tw-panel’);
var prevBtn = document.getElementById(‘mtp-twell-prev’);
var nextBtn = document.getElementById(‘mtp-twell-next’);
var counter = document.getElementById(‘mtp-twell-counter’);
/* Build nav pills */
STEPS.forEach(function (s, i) {
var pill = document.createElement(‘div’);
pill.className=”tw-step-pill”;
pill.setAttribute(‘data-i’, i);
var btn = document.createElement(‘button’);
btn.className=”tw-step-btn”;
btn.innerHTML =
‘‘ + (i + 1) + ‘‘ +
‘‘ + s.label + ‘‘;
btn.addEventListener(‘click’, function () { go(i); });
pill.appendChild(btn);
if (i < STEPS.length – 1) {
var line = document.createElement('div');
line.className = 'tw-step-line';
pill.appendChild(line);
}
nav.appendChild(pill);
});
function go(n) {
cur = Math.max(0, Math.min(STEPS.length – 1, n));
/* pills */
var pills = nav.querySelectorAll('.tw-step-pill');
pills.forEach(function (p, i) {
p.classList.remove('active', 'done');
if (i === cur) p.classList.add('active');
else if (i 44) go(dx < 0 ? cur + 1 : cur – 1);
}, { passive: true });
go(0);
})();
So, What Exactly is Proposed
The research team addresses this mismatch with two primary contributions: a new sparse data format called TwELL (Tile-wise ELLPACK), and a set of custom CUDA kernels for inference and training built around it.
TwELL is designed around one key insight: modern matmul kernels already divide computation across small 2D tiles (of size T_m × T_n) assigned to individual cooperative thread arrays (CTAs). Standard ELL packs non-zeros row-by-row across the entire matrix, which requires global synchronization to construct from tiled matmul outputs. TwELL instead partitions the columns of the gate activation matrix into horizontal tiles of size T, and within each tile stores non-zero values and their indices in a local ELL-style layout. By matching the tile dimension T to the column tile size T_n of the matmul kernel, TwELL can be produced directly in the epilogue of the gate projection kernel — no extra kernel launch, no additional global memory read, no synchronization across CTAs. The format uses a compression factor C such that T/C exceeds the maximum non-zeros per tile, and packages values, indices, and non-zero counts into a single 32-bit matrix for locality.

For inference, a single fused kernel takes the gate activations in TwELL format and performs the up and down projections together. Each CTA handles one row of inputs, iterating first statically over column tiles and then dynamically over each tile’s non-zero count. For each active neuron at index n, the CTA loads the n-th column of the up projection weight matrix W_u and the n-th row of the down projection weight matrix W_d, computes the dot product, and accumulates into the output. The intermediate hidden state h_u is never materialized in global memory, cutting DRAM traffic significantly.
For training, the situation is more complex because sparsity patterns are highly non-uniform across tokens and layers — the maximum non-zeros per row can be orders of magnitude above the average, making a pure ELL layout brittle. The research team introduces a hybrid sparse format that dynamically routes rows either into a compact ELL matrix (for rows below a non-zero threshold) or into a dense backup matrix (for overflow rows). This allows efficient sparse gradient computation in the backward pass without requiring dense-to-dense matmuls for most rows. The team also releases kernels for the original non-gated transformer feedforward block; at the recommended sparsity level, the non-gated variant achieves an 11.2% inference speedup compared to 17.9% for the gated design.
Just ReLU and L1 Regularization
The sparsity induction strategy is deliberately minimal. The research team used ReLU as the gate activation function and add a simple L1 loss term on the hidden feedforward activations, controlled by a coefficient L1. No other architectural changes are required, and the research team reported that adding L1 regularization did not affect other hyperparameters (learning rate, weight decay, optimizer settings).
Models were trained on the fineweb dataset (a deduplicated fineweb-edu split) at chinchilla-optimal token counts — approximately 10B tokens for a 0.5B model up to 40B tokens for a 2B model — with a context length of 2048 and a batch size of 1M tokens.
Testing eight L1 coefficient values on a 1.5B parameter model, they find that up to L1 = 3 × 10−5, there is essentially no drop in mean task accuracy across seven downstream benchmarks (ARC Easy/Challenge, HellaSwag, OpenBookQA, PIQA, WinoGrande, CommonsenseQA), with final cross-entropy increasing by less than 2% relative to the unregularized baseline. The recommended setting L1 = 2 × 10−5 reduces average non-zero activations from 911 per layer (in the unregularized 1.5B model with a feedforward hidden dimension of 5632) down to just 29 — roughly 99.5% sparsity — with no measurable downstream performance loss.
One important key point: at L1 = 2 × 10−5, over 30% of neurons become permanently inactive (dead neurons) on average across layers. The research team explores two mitigation strategies — scheduling the L1 warmup and applying targeted reinitialization to dead gate projection columns — and finds that the reinitialization approach maintains similar sparsity levels while slightly improving both downstream accuracy and efficiency (+19.1% inference speedup vs. +17.9% baseline). This is listed as a direction for future work.
Measured Efficiency Gains
The efficiency results are reported on a single node of eight H100 PCIe GPUs, with a fixed sequence length of 2048 tokens. For the cross-scale comparison, the L1 coefficient is fixed at 2 × 10−5.
At smaller scales, sparsity delivers clear peak memory reductions during training:
| Model | Dense Peak Memory | Sparse Peak Memory | Change |
|---|---|---|---|
| 0.5B | 26.2 GB | 21.2 GB | −19.2% |
| 1B | 44.5 GB | 33.1 GB | −25.5% |
| 1.5B | 62.8 GB | 45.1 GB | −28.1% |
At 2B parameters, the sparse model uses a larger micro-batch (enabled by reduced activation memory at that scale), which results in higher peak GPU memory (46.7 → 57.1 GB) but faster training throughput (+21.9%). The efficiency gains on all metrics for the 2B model:
- Forward execution throughput: 87.8 → 106 input tokens/ms (+20.5%)
- Energy per token: 7.85 → 6.51 mJ (−17.0%)
- Training step throughput: 22.4 → 27.3 input tokens/ms (+21.9%)
Across the full 0.5B–2B range, mean task accuracy of sparse and non-sparse models remains statistically indistinguishable. Efficiency benefits grow with model scale: larger models naturally develop lower average non-zero counts (dropping from 39 at 0.5B to 24 at 2B), which means the sparse kernels skip a proportionally greater share of computation.
Training speedups are also observed on NVIDIA’s RTX PRO 6000 GPU, where the larger Streaming Multiprocessor count (188 vs. 114 on H100) allows sparse operations to run faster — suggesting these gains extend to less specialized hardware.
What the Sparsity Patterns Reveal
Sparsity is not uniform: the first two layers of a 28-layer 1.5B model are the least active, followed by a pronounced peak in non-zero activations across early-middle layers — consistent with prior work suggesting this is where much of LLM reasoning and knowledge retrieval occurs. Separately, the first tokens in an input sequence activate far more neurons than later tokens, with an exponential decrease thereafter. The research team observed an inverse Pearson correlation of −0.996 between each layer’s average non-zero count and its inference speedup contribution, confirming that the sparsest layers provide the greatest per-layer gains.
Check out the Paper, Repo and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us
The post Sakana AI and NVIDIA Introduce TwELL with CUDA Kernels for 20.5% Inference and 21.9% Training Speedup in LLMs appeared first on MarkTechPost.









