Skip to main content

Optimization Guide — Maximize Your Reward

This is the highest-value page on this site for subnet engineers. Understanding how scoring works — and which levers actually move your score — is worth more than any hardware upgrade.

How Scoring Works: The Game You're Playing

Your reward each cycle is proportional to your Proof-of-Loss score:

w_i ∝ ReLU(L(Φ_t) − L(Φ_t + Δ_i))

In plain terms: validators measure the model's loss on held-out data before and after applying your update. Your reward is proportional to how much your update reduced the loss. If your update made things worse (or didn't change them), you get w_i = 0.

The Alignment Principle

Validators evaluate on the same data distribution that miners train on. This is intentional — it means you cannot game a separate benchmark. The only way to score well is to genuinely improve the model on the target domain.

The corollary: overfitting to a narrow slice of training data hurts you once the evaluation dataset shifts even slightly. Train on a diverse, representative sample of your domain.

What "Top-N Take-Most" Means for You

Multiple miners compete in the same expert group. Your score is absolute (based on your loss reduction), not purely relative. But miners ranked 1–N in the group share the reward pool, with higher-ranked miners earning more.

This means:

  • You don't need to beat everyone else — you need to genuinely reduce the model's loss
  • Getting from rank 3 to rank 2 is worth more reward than staying at rank 3
  • Being consistently good earns more than being occasionally great

The Levers That Actually Move Your Score

1. Data Quality (Highest Leverage)

This is the most important variable. The Proof-of-Loss mechanism directly rewards loss reduction. Higher-quality training data produces stronger gradient updates which produce more loss reduction.

Concretely:

  • Clean > noisy: remove near-duplicates, malformed examples, and off-domain content
  • Domain-specific > generic: your data should be tightly aligned with the target domain; general-purpose data dilutes the gradient signal
  • Curated > scraped: a small dataset of hand-verified high-quality examples often outperforms a large dataset of scraped content
  • Diverse > narrow: covering the breadth of the domain prevents overfitting to a specific slice of the evaluation set

Spend time on data quality before spending money on compute.

2. Training Duration

More gradient steps within the training window → better convergence → more loss reduction.

  • Set max_steps as high as you can within the ~30-block training window
  • Monitor your loss curves — if loss is still declining when the commit phase starts, you're undertrained
  • If loss plateaus before max_steps, you may have hit a data quality ceiling (more data would help more than more steps)

3. Learning Rate Tuning

Too high → unstable training (NaN gradients, loss spikes). Too low → slow convergence, low loss reduction per cycle.

Starting point: learning_rate: 3e-5 with cosine decay and 100-step warmup

Signs you need to lower the LR:

  • Loss oscillates instead of declining smoothly
  • NaN detection hook fires frequently
  • w_i scores are inconsistent cycle-to-cycle

Signs you can raise the LR:

  • Loss declines too slowly relative to training window length
  • w_i scores are low but consistent (not NaN/zero)

4. Expert Group Selection

Not all expert groups are equally competitive. When choosing your group:

  • Check competition level: groups with fewer registered miners are less competitive
  • Check domain alignment: run a quick ESFT frequency analysis to find which expert group your data activates most strongly — this is your natural competitive advantage
  • Monitor your w_i scores: if your scores are consistently low despite good training, try a different group

5. Gradient Accumulation

If you're VRAM-limited and forced to use a small batch size, use gradient_accumulation_steps to maintain effective batch size:

batch_size: 2                      # fits in your VRAM
gradient_accumulation_steps: 8 # effective batch = 16

Larger effective batch sizes generally produce more stable gradients and better convergence.

What Doesn't Help

Buying more GPUs (usually). Your expert subset is already small (~10% of total parameters). The marginal return from compute beyond "enough to train well in the cycle window" is small. Data quality and learning rate tuning have higher returns per dollar.

Submitting noise or copies. The ReLU gate in Proof-of-Loss zeroes out any update that doesn't reduce the validation loss. Random noise almost certainly increases loss. Copying another miner's submission exactly produces the same validation loss as the current model (Δ_i ≈ 0), yielding w_i ≈ 0. Neither earns anything.

Monitoring and Iteration

Enable W&B to track your performance across cycles:

wandb: true
wandb_project: blockzero-miner

Key metrics to track:

MetricWhat it tells youTarget direction
train/lossTraining loss at cycle endDecreasing across cycles
chain/w_i_scoreYour Proof-of-Loss scoreIncreasing, consistent
chain/rank_in_groupYour rank among miners in your groupDecreasing (rank 1 = best)
train/loss_reductionΔ loss this cyclePositive, increasing

Figure: reward-curve-optimization Figure: Example w_i score curve over 20 cycles for a miner that improved data quality at cycle 8. The score increase is immediate and sustained, demonstrating that data quality is the dominant variable.

Quick Wins Checklist

Run through this when your scores are lower than expected:

  • Is training loss actually decreasing within cycles? (Check W&B loss curves)
  • Is max_steps high enough to reach convergence? (Loss should plateau, not still declining at submission)
  • Is your training data cleaned and domain-specific?
  • Is the learning rate stable? (No NaN events, smooth loss curve)
  • Is your expert group well-matched to your data? (Run ESFT frequency check)
  • Are you missing commit windows? (Check model_io.py logs for "Missed commit")