Skip to main content

Under the Hood of Evaluation

This page explains how the validator evaluates miner submissions end-to-end — from downloading a checkpoint to writing a weight to the blockchain.

The Proof-of-Loss Formula

Every miner submission is scored by:

w_i ∝ ReLU( L(Φ^(t)) − L(Φ^(t) + Δ_i) )

Where:

  • L(Φ^(t)) — the loss of the current global model on the held-out validation set
  • L(Φ^(t) + Δ_i) — the loss after applying miner i's weight update
  • ReLU(·) — clips negative values to zero (miners who worsen the model get w_i = 0)

The score is the absolute loss reduction. Miners who reduce loss more earn proportionally more.

How evaluator.py Works: Step by Step

1. Load the Baseline

The validator loads the current global model checkpoint Φ^(t) — the model version that was served to miners at the start of this cycle.

2. Download the Miner's Checkpoint

For each committed miner submission, the validator:

  1. Reads the committed hash from the blockchain
  2. Downloads the checkpoint from the URL in the miner's SignedModelSubmitMessage
  3. Verifies: sha256(checkpoint_bytes) == committed_hash — if they don't match, the submission is rejected immediately
  4. Verifies the ed25519 signature on the message
  5. Verifies the block number is within the valid submission window

3. Apply the Update

The validator constructs Φ^(t) + Δ_i by:

  1. Starting with the baseline model Φ^(t)
  2. Loading the miner's expert group checkpoint
  3. Replacing the corresponding expert parameters in the baseline with those from the checkpoint

This produces a full model with the miner's expert update applied and all other parameters unchanged.

4. Evaluate Loss

Both the baseline and the updated model are evaluated on the held-out validation set D_val:

# Pseudo-code for the evaluation
loss_before = evaluate(baseline_model, validation_set)
loss_after = evaluate(updated_model, validation_set)

Metric: cross-entropy loss (perplexity basis). Evaluated with fp16 inference, eval_batch_size from config.

The validation set is drawn from the same domain distribution as miners' training data. It must be:

  • Genuinely held-out (miners have not seen it)
  • Representative of the full target domain
  • Consistent across cycles (validators don't change the evaluation set mid-training)

5. Compute Score

w_i = max(0.0, loss_before - loss_after)  # ReLU

The raw score is then normalized across all miners in the group so weights sum to 1 before the outer optimization step.

6. Log and Store

The score is passed to MinerScoreAggregator for time-series tracking.

MinerScoreAggregator

MinerScoreAggregator maintains a per-miner score history across cycles.

EMA smoothing: raw w_i scores are smoothed using an exponential moving average:

smoothed_score = alpha * new_score + (1 - alpha) * previous_smoothed_score

where alpha = score_ema_alpha from the validator config (default: 0.9). Higher alpha = more weight on recent scores.

Hotkey rotation detection: if a new hotkey begins submitting for an expert group that previously had a different hotkey, the aggregator detects this as a "rotation event" and resets the score history for that group. This prevents a miner from building up a score history under one hotkey and then transferring it to another.

Missing submission handling: if a registered miner fails to submit in a cycle, their w_i for that cycle is recorded as 0. This is used in the EMA calculation, gradually reducing the smoothed score if a miner becomes inactive.

The Alignment Principle

Validators deliberately evaluate on the same distribution miners train on. This is not a limitation — it is a design decision.

If validators evaluated on a separate public benchmark, miners would quickly discover and optimize for that benchmark specifically (Goodhart's Law). The Proof-of-Loss mechanism resists this: the only way to score well is to genuinely improve the model on actual domain data. There is no shortcut.

Figure: evaluator-pipeline Figure: The evaluator.py pipeline. For each miner submission: verify hash → apply update → evaluate both models → compute score → pass to MinerScoreAggregator.

Why Perplexity?

Cross-entropy loss (and its exponential, perplexity) is used as the evaluation metric because:

  • It is a standard, interpretable measure of language model quality
  • It is consistent across cycles and across different miners
  • It is robust to the specific tokenization and domain of the training data
  • It directly reflects the model's predictive quality on unseen data

A model with lower perplexity on domain data has genuinely learned domain-specific patterns — it is not possible to fake this with submission tricks.