Under the Hood of Evaluation

This page explains how the validator evaluates miner submissions end-to-end — from receiving an uploaded checkpoint to writing a weight to the blockchain.

Raw Score Evaluation

Every miner submission is evaluated by calculating the absolute validation loss $L(\Phi^{(t)} + \Delta_i)$ :

Where:

$L(\Phi^{(t)} + \Delta_i)$ — the loss of the model after applying miner i's expert weight update on the held-out validation set.
$\Phi^{(t)}$ — the current global model checkpoint that was served to miners at the start of this cycle.
$\Delta_i$ — miner i's expert update payload.

Miners are ranked based on this raw validation loss over the evaluation set.

How evaluator.py Works: Step by Step

Miners directly upload their checkpoints to the validator's FastAPI HTTP server (POST /submit-checkpoint). When the submission phase closes, run.py gathers all successful, validated miner checkpoint files from disk for evaluation.

2. Load the Baseline

The validator ensures the baseline model $\Phi^{(t)}$ is available in CPU memory as a template for evaluating miner shards.

3. Evaluate the Miner's Checkpoint (Asynchronously)

The validator enqueues all received miner checkpoints to be processed by asynchronous evaluator_worker tasks. For each miner:

Load Checkpoint: The validator creates a deep copy of the baseline model $\Phi^{(t)}$ and loads the miner's state dict into the expert group variables (load_model_from_path).
Setup Dataloader: Creates a validation dataloader uniquely seeded by a combined validator seed to prevent miners from guessing the validation sequence.
Run Evaluation: Calculates the cross-entropy loss (evaluate_model(..., max_eval_batches=50)) via fp16 inference.
Record Score: The raw loss value is passed directly to the MinerScoreAggregator.

4. Compute Final Subnet Weights

The raw loss values are maintained across epochs in the MinerScoreAggregator.
The validator retrieves the averaged historical scores for each UID.
The smoothed scores are filtered. Only the top models according to evaluating loss are selected, and the scores are distributed back to the chain via subtensor.set_weights().

MinerScoreAggregator

MinerScoreAggregator maintains a per-miner score history across cycles in a unified state.

Rolling Averages/EMA: Instead of a single static calculation, the aggregator maintains an array of (timestamp, score) tuples. During weight setting and gradient aggregation, the validator relies on pulling aggregated metric sets—usually an avg or ema (Exponential Moving Average) metric over the active sliding window of the miner's history.

Hotkey rotation detection: If a new hotkey begins submitting for a UID that previously had a different hotkey, the aggregator detects this rotation event and instantly resets the score history for that UID. This ensures a miner cannot build up a score history under one hotkey and then anonymously transfer/sell it to another.

The Alignment Principle

Validators deliberately evaluate on the same distribution miners train on. This is not a limitation — it is a design decision.

If validators evaluated on a separate public benchmark, miners would quickly discover and optimize for that benchmark specifically (Goodhart's Law). The evaluation mechanism resists this: the only way to score well is to genuinely improve the global model over real domain data.

flowchart LR
    Queue["Miner Job Queue<br/>(From Valid Submissions)"]
    Aggregator[("MinerScoreAggregator<br/>(Time-Series History)")]

    subgraph AsyncEvaluator ["evaluator_worker (Async Task)"]
        direction TB
        LoadModel["load_model_from_path()<br/>Merge minor checkpoint into base model Φ⁽ᵗ⁾"]
        SetupData["get_dataloader()<br/>Initialize seeded validation dataset"]
        RunEval["evaluate_model()<br/>Calculate fp16 cross-entropy loss"]
        Score["aggregator.add_score()<br/>Push raw 'val_loss' for UID/Hotkey"]

        LoadModel --> SetupData --> RunEval --> Score
    end

    Queue -->|Pulls Miner Job| LoadModel
    Score --> |Stores Score| Aggregator

Figure: The evaluator.py pipeline. For each miner submission: load miner subset → evaluate full model → record loss → pass to MinerScoreAggregator.

Why Evaluated Loss?

Cross-entropy loss (and its exponential equivalent, perplexity) is used as the evaluation metric because:

It is a standard, interpretable measure of language model quality.
It is consistent across cycles and across different miners.
It is robust to specific data domains and tokenization.

A model with a lower evaluation loss on domain data has genuinely learned domain-specific patterns — it is practically impossible to fake this progression without real training.