Model & Shard Sizes
DeepSeek-V2-Lite has 64 routed experts per MoE layer. Partial training keeps K=8 of the 64 routed experts trainable, plus the 2 shared experts and the backbone resident on GPU = 3.11 B held for training, of which 1.80 B is updated.
| Value | How it’s computed | |
|---|---|---|
| Full model parameters | 15.71 B | all 64 routed + 2 shared experts + backbone |
| Total must hold for training (partial model) | 3.11 B | backbone + 2 shared + K=8 trainable experts (= full 15.71 B − the 56 parked routed experts) |
| · Backbone (attention + embeds + LM head + norms + gates + dense FFN) | 861.9 M | everything outside the routed MoE experts |
| · Shared-expert weights (held, not trained) | 449.8 M | n_shared=2 × 26 MoE layers × per-expert params |
| · Per-miner trainable shard | 1.80 B | K=8 of the 64 routed experts (per layer) × 26 MoE layers × per-expert params |
| Per-expert parameters | 8.7 M | gate + up + down (3 × hidden × moe_intermediate) |
| MoE layers | 26 | num_hidden_layers (27) − first_k_dense_replace (1) |
| Routed experts / layer (full model) | 64 | config.n_routed_experts — all present in the full model; partial training keeps K=8 of these on GPU and parks the other 56 on CPU |
Network Throughput (tokens/sec)
| Value | How it’s computed | |
|---|---|---|
| Tokens per second (estimated) | 22,500 | A6000 tokens/sec for 2B-param training (7,500) × 3 |
| Hours per round | 1.747 | cycle_length_blocks (524) × 12 / 3600 |
| Tokens per round (assumed) | 141,480,000 | tokens_per_second × hours_per_round × 3600 |
| Rounds counted | 209.1 | window_hours / hours_per_round |
Estimated — both the per-GPU A6000 rate (7,500 tokens/sec) and the ×3 multiplier for merged miner outputs are assumed constants.
Active Miners
225.2 distinct active miners per cycle on average across 48 cycles. Each running at least an A6000 GPU.
Active miners per cycle
Training Loss
One-line takeawayOur in-place ESFT approaches the same GSM8K accuracy (≈39.5%, matching the C · ESFT-classic baseline) while training ~4× faster — the masked-routing shard recovers performance once merged back into the full model.
About the variants
All three cells fine-tune K=8 routed experts of DeepSeek-V2-Lite on the same metamath + alpaca SFT mix, starting from the same base model and the same expert assignment. The only thing that changes is how the forward pass is set up during training — making the comparison a clean isolation of training-time routing strategy.
| Cell | Training-time forward pass | Memory on GPU | What it tests |
|---|---|---|---|
| A · in-place ESFT | Gate masked to the K=8 trainable experts only; the other 56 routed experts parked on CPU. | ≈3.11 B · ~5× smaller | The masked-routing baseline — quantifies the train↔eval gap. Lightest memory and fastest. |
| B · frozen-kept (N=16) | Gate masked to K=8 trainable + N=16 resident-but-frozen experts (the K+N routable set). | ≈6.7 B · ~2.3× smaller | Whether widening the routable set (without more trainable experts) lets the masked forward overlap natural routing. |
| C · ESFT-classic | Unmodified natural top-k routing over all 64 experts; no masking, every expert resident. | ≈15.7 B · full model | Whether matching the train and eval forwards makes the K=8 gains transfer. Highest memory and slowest. |
GPU memory footprint — what actually fits
All three partial-MoE cells train the same 1.8 B parameters and fit on a single A6000 (48 GB). A full fine-tune of the same model trains 15.7 B parameters and peaks near ~126 GB — roughly 6× the memory, requiring multiple datacenter GPUs.
| Setup | Trainable params | GPU peak | Fits on… |
|---|---|---|---|
| A · in-place ESFT | 1.8 B | ~21 GB | Single A6000 (48 GB) |
| B · frozen-kept (N=16) | 1.8 B | ~28–32 GB | Single A6000 |
| C · ESFT-classic | 1.8 B | ~44 GB | Single A6000 (just barely) |
| Full fine-tune | 15.7 B | ~126 GB | Multi-GPU FSDP / ZeRO-3, or 2+ × A100 80 GB |
Training loss
Each step is one optimizer update; lower is better.
A higher training loss here for A/B is expected — each cell trains on only a slice of the model (A parks ~80% of the routed experts on CPU), so its loss curve naturally sits above a full-model run. That number is not the goal. What matters is the eval loss: once the trained shard is merged back into the full model, do we recover or beat the base model’s performance on the held-out evals below.
- A · in-place ESFT
- B · frozen-kept (N=16)
- C · ESFT-classic
Evaluation — GSM8K-CoT exact-match (strict)
Per-cell GSM8K-CoT accuracy at each eval. Higher = better; step 0 is the base model.
- A · in-place ESFT
- B · frozen-kept (N=16)
- C · ESFT-classic
B · frozen-kept edges out A here: the 16 extra resident experts give the masked forward more of the natural routing to learn against, so its gains transfer best (42.5% GSM8K strict). Encouragingly, A · in-place ESFT lands on par with the C · ESFT-classic baseline (both ≈39.5%) while using ~5× less training memory — the cheapest variant gives up almost nothing on this eval.
Training cost — total wallclock
A is the efficiency winner. It finishes in 3h 07m — about 4× faster than C and roughly half of B — because parking ~80% of the routed experts on CPU means far less compute per step. Yet its eval accuracy lands on par with the C baseline (≈39.5%), so it buys that 4× speedup and ~5× memory saving at almost no quality cost.
Retention across general benchmarks (no-forgetting check)
Fine-tuning on a narrow math mix risks catastrophic forgetting — the model gets better at the target task but quietly loses general ability it had before. This check re-runs each trained cell on three broad benchmarks (MMLU, ARC-Challenge, HellaSwag) and compares against the base model: scores should hold roughly flat. Small deltas (within ±0.5 pp) mean the math gains came with no meaningful loss of general capability.
| Benchmark | Base (step 0) | A · in-place ESFT (step 500) | B · frozen-kept (N=16) (step 500) | C · ESFT-classic (step 500) |
|---|---|---|---|---|
| MMLU acc | 58.5% | 58.6% (+0.1 pp) | 58.3% (-0.2 pp) | 58.0% (-0.5 pp) |
| ARC-Challenge acc | 55.1% | 54.6% (-0.5 pp) | 54.6% (-0.5 pp) | 54.6% (-0.5 pp) |
| HellaSwag acc | 52.5% | 52.6% (+0.1 pp) | 52.4% (-0.1 pp) | 52.2% (-0.3 pp) |