Under the Hood of Evaluation
This page explains how the validator evaluates miner submissions end-to-end — from receiving an uploaded checkpoint to writing a weight to the blockchain.
Raw Score Evaluation
Every miner submission is evaluated by calculating the absolute validation loss :
Where:
- — the loss of the model after applying miner i's expert weight update on the held-out validation set.
- — the current global model checkpoint that was served to miners at the start of this cycle.
- — miner i's expert update payload.
Miners are ranked based on this raw validation loss over the evaluation set.
How evaluator.py Works: Step by Step
1. Wait for Submissions Phase
Miners directly upload their checkpoints to the validator's FastAPI HTTP server (POST /submit-checkpoint). When the submission phase closes, run.py gathers all successful, validated miner checkpoint files from disk for evaluation.
2. Load the Baseline
The validator ensures the baseline model is available in CPU memory as a template for evaluating miner shards.
3. Evaluate the Miner's Checkpoint (Asynchronously)
The validator enqueues all received miner checkpoints to be processed by asynchronous evaluator_worker tasks. For each miner:
- Load Checkpoint: The validator creates a deep copy of the baseline model and loads the miner's state dict into the expert group variables (
load_model_from_path). - Setup Dataloader: Creates a validation dataloader uniquely seeded by a combined validator seed to prevent miners from guessing the validation sequence.
- Run Evaluation: Calculates the cross-entropy loss (
evaluate_model(..., max_eval_batches=50)) viafp16inference. - Record Score: The raw loss value is passed directly to the
MinerScoreAggregator.
4. Compute Final Subnet Weights
- The raw loss values are maintained across epochs in the
MinerScoreAggregator. - The validator retrieves the averaged historical scores for each UID.
- The smoothed scores are filtered. Only the top models according to evaluating loss are selected, and the scores are distributed back to the chain via
subtensor.set_weights().
MinerScoreAggregator
MinerScoreAggregator maintains a per-miner score history across cycles in a unified state.
Rolling Averages/EMA: Instead of a single static calculation, the aggregator maintains an array of (timestamp, score) tuples. During weight setting and gradient aggregation, the validator relies on pulling aggregated metric sets—usually an avg or ema (Exponential Moving Average) metric over the active sliding window of the miner's history.
Hotkey rotation detection: If a new hotkey begins submitting for a UID that previously had a different hotkey, the aggregator detects this rotation event and instantly resets the score history for that UID. This ensures a miner cannot build up a score history under one hotkey and then anonymously transfer/sell it to another.
The Alignment Principle
Validators deliberately evaluate on the same distribution miners train on. This is not a limitation — it is a design decision.
If validators evaluated on a separate public benchmark, miners would quickly discover and optimize for that benchmark specifically (Goodhart's Law). The evaluation mechanism resists this: the only way to score well is to genuinely improve the global model over real domain data.
flowchart LR
Queue["Miner Job Queue<br/>(From Valid Submissions)"]
Aggregator[("MinerScoreAggregator<br/>(Time-Series History)")]
subgraph AsyncEvaluator ["evaluator_worker (Async Task)"]
direction TB
LoadModel["load_model_from_path()<br/>Merge minor checkpoint into base model Φ⁽ᵗ⁾"]
SetupData["get_dataloader()<br/>Initialize seeded validation dataset"]
RunEval["evaluate_model()<br/>Calculate fp16 cross-entropy loss"]
Score["aggregator.add_score()<br/>Push raw 'val_loss' for UID/Hotkey"]
LoadModel --> SetupData --> RunEval --> Score
end
Queue -->|Pulls Miner Job| LoadModel
Score --> |Stores Score| AggregatorFigure: The evaluator.py pipeline. For each miner submission: load miner subset → evaluate full model → record loss → pass to MinerScoreAggregator.
Why Evaluated Loss?
Cross-entropy loss (and its exponential equivalent, perplexity) is used as the evaluation metric because:
- It is a standard, interpretable measure of language model quality.
- It is consistent across cycles and across different miners.
- It is robust to specific data domains and tokenization.
A model with a lower evaluation loss on domain data has genuinely learned domain-specific patterns — it is practically impossible to fake this progression without real training.