Under the Hood of Evaluation
This page explains how the validator evaluates miner submissions end-to-end — from downloading a checkpoint to writing a weight to the blockchain.
The Proof-of-Loss Formula
Every miner submission is scored by:
w_i ∝ ReLU( L(Φ^(t)) − L(Φ^(t) + Δ_i) )
Where:
L(Φ^(t))— the loss of the current global model on the held-out validation setL(Φ^(t) + Δ_i)— the loss after applying miner i's weight updateReLU(·)— clips negative values to zero (miners who worsen the model get w_i = 0)
The score is the absolute loss reduction. Miners who reduce loss more earn proportionally more.
How evaluator.py Works: Step by Step
1. Load the Baseline
The validator loads the current global model checkpoint Φ^(t) — the model version that was served to miners at the start of this cycle.
2. Download the Miner's Checkpoint
For each committed miner submission, the validator:
- Reads the committed hash from the blockchain
- Downloads the checkpoint from the URL in the miner's
SignedModelSubmitMessage - Verifies:
sha256(checkpoint_bytes) == committed_hash— if they don't match, the submission is rejected immediately - Verifies the ed25519 signature on the message
- Verifies the block number is within the valid submission window
3. Apply the Update
The validator constructs Φ^(t) + Δ_i by:
- Starting with the baseline model
Φ^(t) - Loading the miner's expert group checkpoint
- Replacing the corresponding expert parameters in the baseline with those from the checkpoint
This produces a full model with the miner's expert update applied and all other parameters unchanged.
4. Evaluate Loss
Both the baseline and the updated model are evaluated on the held-out validation set D_val:
# Pseudo-code for the evaluation
loss_before = evaluate(baseline_model, validation_set)
loss_after = evaluate(updated_model, validation_set)
Metric: cross-entropy loss (perplexity basis). Evaluated with fp16 inference, eval_batch_size from config.
The validation set is drawn from the same domain distribution as miners' training data. It must be:
- Genuinely held-out (miners have not seen it)
- Representative of the full target domain
- Consistent across cycles (validators don't change the evaluation set mid-training)
5. Compute Score
w_i = max(0.0, loss_before - loss_after) # ReLU
The raw score is then normalized across all miners in the group so weights sum to 1 before the outer optimization step.
6. Log and Store
The score is passed to MinerScoreAggregator for time-series tracking.
MinerScoreAggregator
MinerScoreAggregator maintains a per-miner score history across cycles.
EMA smoothing: raw w_i scores are smoothed using an exponential moving average:
smoothed_score = alpha * new_score + (1 - alpha) * previous_smoothed_score
where alpha = score_ema_alpha from the validator config (default: 0.9). Higher alpha = more weight on recent scores.
Hotkey rotation detection: if a new hotkey begins submitting for an expert group that previously had a different hotkey, the aggregator detects this as a "rotation event" and resets the score history for that group. This prevents a miner from building up a score history under one hotkey and then transferring it to another.
Missing submission handling: if a registered miner fails to submit in a cycle, their w_i for that cycle is recorded as 0. This is used in the EMA calculation, gradually reducing the smoothed score if a miner becomes inactive.
The Alignment Principle
Validators deliberately evaluate on the same distribution miners train on. This is not a limitation — it is a design decision.
If validators evaluated on a separate public benchmark, miners would quickly discover and optimize for that benchmark specifically (Goodhart's Law). The Proof-of-Loss mechanism resists this: the only way to score well is to genuinely improve the model on actual domain data. There is no shortcut.
Figure: The evaluator.py pipeline. For each miner submission: verify hash → apply update → evaluate both models → compute score → pass to MinerScoreAggregator.
Why Perplexity?
Cross-entropy loss (and its exponential, perplexity) is used as the evaluation metric because:
- It is a standard, interpretable measure of language model quality
- It is consistent across cycles and across different miners
- It is robust to the specific tokenization and domain of the training data
- It directly reflects the model's predictive quality on unseen data
A model with lower perplexity on domain data has genuinely learned domain-specific patterns — it is not possible to fake this with submission tricks.