Subnet Status Review

§ 01 · Metric

Model & Shard Sizes

Full architecture

15.71B params

64 routed experts / layer

→−80%

Held on GPU for training

3.11B params

backbone + 2 shared + 8 trainable experts × 26 layers

DeepSeek-V2-Lite has 64 routed experts per MoE layer. Partial training keeps K=8 of the 64 routed experts trainable, plus the 2 shared experts and the backbone resident on GPU = 3.11 B held for training, of which 1.80 B is updated.

	Value	How it’s computed
Full model parameters	15.71 B	all 64 routed + 2 shared experts + backbone
Total must hold for training (partial model)	3.11 B	backbone + 2 shared + K=8 trainable experts (= full 15.71 B − the 56 parked routed experts)
· Backbone (attention + embeds + LM head + norms + gates + dense FFN)	861.9 M	everything outside the routed MoE experts
· Shared-expert weights (held, not trained)	449.8 M	n_shared=2 × 26 MoE layers × per-expert params
· Per-miner trainable shard	1.80 B	K=8 of the 64 routed experts (per layer) × 26 MoE layers × per-expert params
Per-expert parameters	8.7 M	gate + up + down (3 × hidden × moe_intermediate)
MoE layers	26	num_hidden_layers (27) − first_k_dense_replace (1)
Routed experts / layer (full model)	64	config.n_routed_experts — all present in the full model; partial training keeps K=8 of these on GPU and parks the other 56 on CPU

§ 02 · Metric

Network Throughput (tokens/sec)

Network throughput · across the window

22,500tokens / sec

A6000 2B-param rate (7,500) × 3 · estimated

	Value	How it’s computed
Tokens per second (estimated)	22,500	A6000 tokens/sec for 2B-param training (7,500) × 3
Hours per round	1.747	cycle_length_blocks (524) × 12 / 3600
Tokens per round (assumed)	141,480,000	tokens_per_second × hours_per_round × 3600
Rounds counted	209.1	window_hours / hours_per_round

Estimated — both the per-GPU A6000 rate (7,500 tokens/sec) and the ×3 multiplier for merged miner outputs are assumed constants.

§ 03 · Metric

Active Miners

225.2 distinct active miners per cycle on average across 48 cycles. Each running at least an A6000 GPU.

Active miners per cycle

§ 04 · Metric

Training Loss

One-line takeawayOur in-place ESFT approaches the same GSM8K accuracy (≈39.5%, matching the C · ESFT-classic baseline) while training ~4× faster — the masked-routing shard recovers performance once merged back into the full model.

About the variants

All three cells fine-tune K=8 routed experts of DeepSeek-V2-Lite on the same metamath + alpaca SFT mix, starting from the same base model and the same expert assignment. The only thing that changes is how the forward pass is set up during training — making the comparison a clean isolation of training-time routing strategy.

Cell	Training-time forward pass	Memory on GPU	What it tests
A · in-place ESFT	Gate masked to the K=8 trainable experts only; the other 56 routed experts parked on CPU.	≈3.11 B · ~5× smaller	The masked-routing baseline — quantifies the train↔eval gap. Lightest memory and fastest.
B · frozen-kept (N=16)	Gate masked to K=8 trainable + N=16 resident-but-frozen experts (the K+N routable set).	≈6.7 B · ~2.3× smaller	Whether widening the routable set (without more trainable experts) lets the masked forward overlap natural routing.
C · ESFT-classic	Unmodified natural top-k routing over all 64 experts; no masking, every expert resident.	≈15.7 B · full model	Whether matching the train and eval forwards makes the K=8 gains transfer. Highest memory and slowest.

GPU memory footprint — what actually fits

All three partial-MoE cells train the same 1.8 B parameters and fit on a single A6000 (48 GB). A full fine-tune of the same model trains 15.7 B parameters and peaks near ~126 GB — roughly 6× the memory, requiring multiple datacenter GPUs.

Full fine-tune · GPU peak

126GB

15.7 B trainable · needs multi-GPU FSDP / ZeRO-3

→−83%

In-place ESFT · GPU peak

21GB

1.8 B trainable · fits one A6000 (48 GB)

Setup	Trainable params	GPU peak	Fits on…
A · in-place ESFT	1.8 B	~21 GB	Single A6000 (48 GB)
B · frozen-kept (N=16)	1.8 B	~28–32 GB	Single A6000
C · ESFT-classic	1.8 B	~44 GB	Single A6000 (just barely)
Full fine-tune	15.7 B	~126 GB	Multi-GPU FSDP / ZeRO-3, or 2+ × A100 80 GB

Training loss

Each step is one optimizer update; lower is better.

A higher training loss here for A/B is expected — each cell trains on only a slice of the model (A parks ~80% of the routed experts on CPU), so its loss curve naturally sits above a full-model run. That number is not the goal. What matters is the eval loss: once the trained shard is merged back into the full model, do we recover or beat the base model’s performance on the held-out evals below.

A · in-place ESFT
B · frozen-kept (N=16)
C · ESFT-classic

Evaluation — GSM8K-CoT exact-match (strict)

Per-cell GSM8K-CoT accuracy at each eval. Higher = better; step 0 is the base model.

A · in-place ESFT
B · frozen-kept (N=16)
C · ESFT-classic

B · frozen-kept edges out A here: the 16 extra resident experts give the masked forward more of the natural routing to learn against, so its gains transfer best (42.5% GSM8K strict). Encouragingly, A · in-place ESFT lands on par with the C · ESFT-classic baseline (both ≈39.5%) while using ~5× less training memory — the cheapest variant gives up almost nothing on this eval.

Training cost — total wallclock

A is the efficiency winner. It finishes in 3h 07m — about 4× faster than C and roughly half of B — because parking ~80% of the routed experts on CPU means far less compute per step. Yet its eval accuracy lands on par with the C baseline (≈39.5%), so it buys that 4× speedup and ~5× memory saving at almost no quality cost.

A · in-place ESFT

3h 07m

≈ 4× faster than C

B · frozen-kept (N=16)

5h 57m

16 kept experts resident

C · ESFT-classic

12h 59m

full routing · every expert on GPU

Retention across general benchmarks (no-forgetting check)

Fine-tuning on a narrow math mix risks catastrophic forgetting — the model gets better at the target task but quietly loses general ability it had before. This check re-runs each trained cell on three broad benchmarks (MMLU, ARC-Challenge, HellaSwag) and compares against the base model: scores should hold roughly flat. Small deltas (within ±0.5 pp) mean the math gains came with no meaningful loss of general capability.

Benchmark	Base (step 0)	A · in-place ESFT (step 500)	B · frozen-kept (N=16) (step 500)	C · ESFT-classic (step 500)
MMLU acc	58.5%	58.6% (+0.1 pp)	58.3% (-0.2 pp)	58.0% (-0.5 pp)
ARC-Challenge acc	55.1%	54.6% (-0.5 pp)	54.6% (-0.5 pp)	54.6% (-0.5 pp)
HellaSwag acc	52.5%	52.6% (+0.1 pp)	52.4% (-0.1 pp)	52.2% (-0.3 pp)