← Connito Blog§ Network Telemetry · Status Review

Subnet Status Review

Where Subnet 102 stands this cycle — model and shard sizes, network throughput, active miners, and a head-to-head on in-place ESFT versus classic fine-tuning. Built from first principles, with every number traced to how it’s computed.

§ 01 · Metric

Model & Shard Sizes

Full architecture
15.71B params
64 routed experts / layer
Held on GPU for training
3.11B params
backbone + 2 shared + 8 trainable experts × 26 layers

DeepSeek-V2-Lite has 64 routed experts per MoE layer. Partial training keeps K=8 of the 64 routed experts trainable, plus the 2 shared experts and the backbone resident on GPU = 3.11 B held for training, of which 1.80 B is updated.

ValueHow it’s computed
Full model parameters15.71 Ball 64 routed + 2 shared experts + backbone
Total must hold for training (partial model)3.11 Bbackbone + 2 shared + K=8 trainable experts (= full 15.71 B − the 56 parked routed experts)
· Backbone (attention + embeds + LM head + norms + gates + dense FFN)861.9 Meverything outside the routed MoE experts
· Shared-expert weights (held, not trained)449.8 Mn_shared=2 × 26 MoE layers × per-expert params
· Per-miner trainable shard1.80 BK=8 of the 64 routed experts (per layer) × 26 MoE layers × per-expert params
Per-expert parameters8.7 Mgate + up + down (3 × hidden × moe_intermediate)
MoE layers26num_hidden_layers (27) − first_k_dense_replace (1)
Routed experts / layer (full model)64config.n_routed_experts — all present in the full model; partial training keeps K=8 of these on GPU and parks the other 56 on CPU

§ 02 · Metric

Network Throughput (tokens/sec)

Network throughput · across the window
22,500tokens / sec
A6000 2B-param rate (7,500) × 3 · estimated
ValueHow it’s computed
Tokens per second (estimated)22,500A6000 tokens/sec for 2B-param training (7,500) × 3
Hours per round1.747cycle_length_blocks (524) × 12 / 3600
Tokens per round (assumed)141,480,000tokens_per_second × hours_per_round × 3600
Rounds counted209.1window_hours / hours_per_round

Estimated — both the per-GPU A6000 rate (7,500 tokens/sec) and the ×3 multiplier for merged miner outputs are assumed constants.


§ 03 · Metric

Active Miners

225.2 distinct active miners per cycle on average across 48 cycles. Each running at least an A6000 GPU.

Active miners per cycle

20021022023024025015,70015,75015,80015,850CycleDistinct active miners

§ 04 · Metric

Training Loss

One-line takeawayOur in-place ESFT approaches the same GSM8K accuracy (≈39.5%, matching the C · ESFT-classic baseline) while training ~4× faster — the masked-routing shard recovers performance once merged back into the full model.

About the variants

All three cells fine-tune K=8 routed experts of DeepSeek-V2-Lite on the same metamath + alpaca SFT mix, starting from the same base model and the same expert assignment. The only thing that changes is how the forward pass is set up during training — making the comparison a clean isolation of training-time routing strategy.

CellTraining-time forward passMemory on GPUWhat it tests
A · in-place ESFTGate masked to the K=8 trainable experts only; the other 56 routed experts parked on CPU.≈3.11 B · ~5× smallerThe masked-routing baseline — quantifies the train↔eval gap. Lightest memory and fastest.
B · frozen-kept (N=16)Gate masked to K=8 trainable + N=16 resident-but-frozen experts (the K+N routable set).≈6.7 B · ~2.3× smallerWhether widening the routable set (without more trainable experts) lets the masked forward overlap natural routing.
C · ESFT-classicUnmodified natural top-k routing over all 64 experts; no masking, every expert resident.≈15.7 B · full modelWhether matching the train and eval forwards makes the K=8 gains transfer. Highest memory and slowest.

GPU memory footprint — what actually fits

All three partial-MoE cells train the same 1.8 B parameters and fit on a single A6000 (48 GB). A full fine-tune of the same model trains 15.7 B parameters and peaks near ~126 GB — roughly 6× the memory, requiring multiple datacenter GPUs.

Full fine-tune · GPU peak
126GB
15.7 B trainable · needs multi-GPU FSDP / ZeRO-3
In-place ESFT · GPU peak
21GB
1.8 B trainable · fits one A6000 (48 GB)
SetupTrainable paramsGPU peakFits on…
A · in-place ESFT1.8 B~21 GBSingle A6000 (48 GB)
B · frozen-kept (N=16)1.8 B~28–32 GBSingle A6000
C · ESFT-classic1.8 B~44 GBSingle A6000 (just barely)
Full fine-tune15.7 B~126 GBMulti-GPU FSDP / ZeRO-3, or 2+ × A100 80 GB

Training loss

Each step is one optimizer update; lower is better.

A higher training loss here for A/B is expected — each cell trains on only a slice of the model (A parks ~80% of the routed experts on CPU), so its loss curve naturally sits above a full-model run. That number is not the goal. What matters is the eval loss: once the trained shard is merged back into the full model, do we recover or beat the base model’s performance on the held-out evals below.

123456100200300400500StepTraining loss
  • A · in-place ESFT
  • B · frozen-kept (N=16)
  • C · ESFT-classic

Evaluation — GSM8K-CoT exact-match (strict)

Per-cell GSM8K-CoT accuracy at each eval. Higher = better; step 0 is the base model.

0.320.340.360.380.400.420.440100200300400500StepGSM8K accuracy
  • A · in-place ESFT
  • B · frozen-kept (N=16)
  • C · ESFT-classic

B · frozen-kept edges out A here: the 16 extra resident experts give the masked forward more of the natural routing to learn against, so its gains transfer best (42.5% GSM8K strict). Encouragingly, A · in-place ESFT lands on par with the C · ESFT-classic baseline (both ≈39.5%) while using ~5× less training memory — the cheapest variant gives up almost nothing on this eval.

Training cost — total wallclock

A is the efficiency winner. It finishes in 3h 07m — about 4× faster than C and roughly half of B — because parking ~80% of the routed experts on CPU means far less compute per step. Yet its eval accuracy lands on par with the C baseline (≈39.5%), so it buys that 4× speedup and ~5× memory saving at almost no quality cost.

A · in-place ESFT
3h 07m
≈ 4× faster than C
B · frozen-kept (N=16)
5h 57m
16 kept experts resident
C · ESFT-classic
12h 59m
full routing · every expert on GPU

Retention across general benchmarks (no-forgetting check)

Fine-tuning on a narrow math mix risks catastrophic forgetting — the model gets better at the target task but quietly loses general ability it had before. This check re-runs each trained cell on three broad benchmarks (MMLU, ARC-Challenge, HellaSwag) and compares against the base model: scores should hold roughly flat. Small deltas (within ±0.5 pp) mean the math gains came with no meaningful loss of general capability.

BenchmarkBase (step 0)A · in-place ESFT (step 500)B · frozen-kept (N=16) (step 500)C · ESFT-classic (step 500)
MMLU acc58.5%58.6% (+0.1 pp)58.3% (-0.2 pp)58.0% (-0.5 pp)
ARC-Challenge acc55.1%54.6% (-0.5 pp)54.6% (-0.5 pp)54.6% (-0.5 pp)
HellaSwag acc52.5%52.6% (+0.1 pp)52.4% (-0.1 pp)52.2% (-0.3 pp)
End of report · subnet 102← Back to the blog