Four Rules We Wish We'd Had Before Building a Training Subnet
Building a Bittensor subnet for decentralized AI training is not like building a centralized ML system. The constraints are different, the failure modes are different, and the design patterns that work in one environment actively fail in the other. This is a distillation of what we've learned.
The Design Context
BlockZero is a training subnet on Bittensor. Miners train expert modules on domain-specific data. Validators evaluate training quality and distribute rewards. The model compounds across training cycles as the expert library grows.
The core challenge: everything must work in a setting where:
- Workers (miners) are anonymous and economically motivated
- Hardware is heterogeneous (consumer GPUs to data center GPUs)
- Network connections are variable quality
- Workers can go offline at any time
- The protocol must be manipulation-resistant without a trusted authority
These constraints eliminated most of the approaches we initially considered.
Rule 1: Design for Failure, Not for Success
The most reliable system design principle we've internalized: assume everything will fail, and design recovery into the core protocol rather than treating it as edge case handling.
What this means in practice:
Every training cycle, some miners will be offline. Some will have unstable connections. Some will miss the commit window. Some will submit checkpoints that fail hash verification. Some will submit updates that increase validation loss rather than decreasing it.
We don't try to prevent these failure modes. We design the protocol so that each failure mode has a graceful, automatic response:
- Miner offline → that expert group's slot is empty this cycle; next cycle they rejoin without disruption to others
- Missed commit window → miner simply doesn't participate this cycle; no global consequence
- Hash mismatch → submission rejected; miner receives no reward; global model unaffected
- Validation loss increase → ReLU clips weight to zero; submission excluded from aggregation
The protocol never blocks on any individual miner. The validator aggregates whatever legitimate submissions arrive and proceeds. This is fundamentally different from pipeline parallelism (where one offline worker stalls everyone) or consensus systems (where a missing voter halts progress).
The anti-pattern we avoided: building elaborate recovery protocols that try to heal failures by involving the failed worker. Recovery protocols that require the failed party create recursive failure modes.
Rule 2: Incentives Are Protocol, Not Policy
In a permissionless network, you cannot enforce behavior through terms of service, social norms, or trust relationships. The only enforcement mechanism you have is the economic incentive structure.
Corollary: if your protocol allows a profitable deviation from the intended behavior, that deviation will eventually be discovered and exploited. Design the protocol assuming rational adversarial participants.
We applied this to our reward mechanism design. The naive approach — reward miners proportionally to their submission count, or to the size of their checkpoint update — creates obvious gaming strategies:
- Submission count → miners submit garbage at high frequency
- Update size → miners submit large, useless parameter perturbations
Proof-of-Loss eliminates these gaming strategies by making the reward directly proportional to verifiable contribution:
w_i ∝ ReLU( L(Φ^(t)) − L(Φ^(t) + Δ_i) )
The only way to get w_i > 0 is to submit an update that actually reduces validation loss. There's no shortcut.
We also applied this principle to the two-phase commit design. Without a commit phase, a rational miner would wait to see other submissions and submit a copy of the best one. The commit-before-submit ordering makes this attack cryptographically impossible — you can't change your commitment after seeing others'.
The rule: for every incentive mechanism you design, explicitly ask "what is the profit-maximizing strategy for a fully adversarial miner under this mechanism?" If the profit-maximizing strategy is to do the intended behavior (train well), the mechanism is good. If it's anything else, you have a problem.
Rule 3: Keep the Hub Frozen
In MoE architecture, the model has two types of parameters: the "hub" (attention layers, layer norms, embeddings, router) and the "spokes" (expert FFNs). The hub parameters are shared across all domains. The spoke parameters are domain-specific.
Our design rule: update only expert parameters during fine-tuning. The hub is always frozen.
This rule has three motivations:
Forgetting prevention: The attention layers encode the model's core reasoning machinery — its ability to follow instructions, maintain coherent multi-step reasoning, and structure output. Fine-tuning these on narrow domain data degrades these general capabilities. Freezing them preserves them by construction.
Communication efficiency: Hub parameters are the majority of the model's non-expert parameters. If we updated the hub, we'd need to synchronize it across all miners in addition to the expert updates. Expert-only synchronization is what enables the per-miner bandwidth requirement to be manageable (3-7GB per cycle rather than 60GB+).
Decomposability: If multiple training jobs run simultaneously on different domains, hub updates from different jobs would conflict. Expert updates don't conflict because different domains activate different experts. Expert isolation makes concurrent training safe; hub updates would make it unsafe.
The practical implication: our training loop freezes all attention layers at the start of every training cycle and never unfreezes them. Router parameters are updated, but carefully — only the expert-routing weights, not the full attention mechanism.
Rule 4: Measure What You Pay For
Every reward mechanism should directly measure the thing you actually want to produce. Indirect proxies are gameable and create divergence between what you're paying for and what you're getting.
We want miners to produce better domain models. So we measure domain model improvement:
loss_improvement = L(Φ^(t)) − L(Φ^(t) + Δ_i)
This is not a proxy. It is the direct measurement of the thing we want. We evaluate the updated model on held-out domain data and observe whether it got better.
Compare to reward mechanisms based on indirect proxies:
- Compute time: easy to fake, doesn't measure output quality
- Data volume: easy to generate junk data at high volume
- Gradient norm: measures update magnitude, not improvement direction
- Public benchmark score: gameable by overfitting to the benchmark
The held-out validation set changes every cycle (different samples from the training distribution). This prevents overfitting to a fixed evaluation target. The only way to consistently score well is to consistently train well.
The corollary: the metric you reward will be optimized. If your metric doesn't directly measure what you want, you'll get the metric optimized but not the underlying goal.
What We'd Do Differently
The data distribution problem: We assumed miners would have access to reasonable-quality domain data. In practice, data quality variance is enormous. Some miners have excellent curated datasets; others have noisy web-scraped data that may technically match the domain but contains significant noise. Our current protocol doesn't distinguish — both get evaluated on the same validation set, but high-quality data training converges faster and produces larger Proof-of-Loss improvements. We're considering explicit data quality attestation mechanisms for the next protocol version.
Router annealing timing: We added Router Annealing (re-training the router after new experts are integrated) as a post-integration step. In hindsight, this should have been built into the core protocol from the beginning — router drift from new experts affects inference quality in ways that take several cycles to become visible. The right approach is periodic router recalibration as a first-class protocol step, not an optional remediation step.
Validator coordination latency: The inter-validator consensus protocol adds latency to the weight submission step. In our initial design, we underestimated how much latency this would introduce at the end of each cycle. The result is that weight submissions sometimes land in the next cycle's window. We've tuned the timing parameters to compensate, but a more principled solution would be to design the consensus protocol with explicit latency bounds.
The Meta-Lesson
Building for a decentralized, adversarial environment is fundamentally different from building for a trusted, centralized environment. The tools are different. The threat model is different. The failure modes are different.
The biggest mistake we see in new subnet designs is importing patterns from centralized ML infrastructure without questioning whether they hold in the decentralized setting. Data parallel gradient averaging, fixed benchmark evaluation, trust-on-join registration — these patterns work fine when all parties are aligned. In a permissionless network, they create systematic vulnerabilities.
The rules above are our version of "first principles for adversarial ML infrastructure." We expect to add to them as the network matures and we encounter failure modes we haven't anticipated yet.
The most honest thing we can say: this is hard, and we're still learning.