Skip to main content

What Nobody Tells You About Building a Training Subnet on Bittensor

· 7 min read
Research & Engineering

Post-mortems are more useful than press releases. This is our honest account of what we underestimated, what surprised us, and what we'd do differently if we were starting over. The ML architecture worked. The systems problems nearly killed us.

Background

We built BlockZero as a Bittensor subnet for decentralized MoE fine-tuning. The core research — TEFT, Proof-of-Loss, expert isolation — is described in the research blog series. This post is about the engineering experience: what was hard that we thought would be easy, and what turned out to be easier than expected.


The Three Things We Underestimated

1. The Blockchain Interaction Layer

We expected the Bittensor SDK to be a relatively thin integration layer — call commit(), call set_weights(), done. What we found was that the blockchain interaction layer is where most of the operational complexity lives.

The WebSocket connection to the subtensor is stateful and fragile. Concurrent writes from multiple threads corrupt the message stream and cause the connection to drop silently. We lost training cycles to this before we understood what was happening — the commit would appear to succeed (no exception thrown) but the subtensor had already closed the connection and was not receiving the message.

The fix — a global WebSocket lock that serializes all chain writes — was simple once we identified the problem. But identifying the problem took two weeks of debugging, because the failure mode was silent corruption rather than a thrown exception.

Lesson: treat the blockchain connection as you would treat a database with strict serialization requirements. Acquire a lock before every write. Log every transaction. Verify every on-chain read matches your expectations before acting on it.

The commit window (5 blocks, ~60 seconds) is unforgiving. If your retry logic exhausts the window due to chain congestion or connection issues, you miss the cycle. We recommend building chain connection health monitoring into your startup sequence and alerting immediately if the subtensor connection is degraded.

2. Checkpoint Management at Scale

Our initial checkpoint design was ad-hoc: save the checkpoint to local disk, serve it via HTTP for validator download, delete after the cycle. This worked in single-miner testing. It broke in production.

The issues:

  • Storage exhaustion: a 7GB checkpoint per cycle accumulates fast. We needed retention policies (keep last N, delete on validator confirmation).
  • Concurrent downloads: multiple validators downloading the same checkpoint simultaneously caused serving issues under the default Python HTTP server. We needed proper content-addressed storage with caching.
  • Checkpoint integrity: a checkpoint corrupted mid-transfer (network fluke, disk error) would fail the validator's hash verification with no clear error. We needed end-to-end integrity checking (SHA-256 at upload, verify before serving).
  • Backend portability: local disk worked for single-machine deployments but didn't work for distributed setups or cloud instances with ephemeral storage.

We rewrote the checkpoint layer twice before it was solid. The current architecture uses fsspec as an abstraction layer (supporting local, S3, and IPFS backends), content-addressed file naming ({hotkey}_{block}_{hash[:8]}.pt), top-k retention by cycle, and a dedicated FastAPI server for serving with proper connection pooling.

Lesson: checkpoint management is a systems problem, not an ML problem. Treat it like you would treat any production data pipeline — with monitoring, retention policies, integrity checks, and failure handling.

3. Expert Group Competition Dynamics

We designed the expert group competition mechanism (multiple miners compete to train the same expert group; the winner gets primary reward) to encourage quality within each group. What we didn't anticipate was the strategic behavior it would elicit.

Miners optimized not just training quality but group selection. High-value groups (those most likely to be requested by customers) attracted more miners, creating hot spots of competition. Low-value groups were abandoned, creating coverage gaps. The distribution of miners across expert groups became highly skewed.

The skew had second-order effects: validators spent disproportionate time evaluating submissions for hot groups (more submissions to evaluate) while cold groups had too few submissions to produce a reliable aggregated update.

We introduced per-group miner cap recommendations and group value signals (based on historical customer usage) to guide miner allocation. We're still tuning this — it's fundamentally a market design problem, and market design is hard.

Lesson: any mechanism that creates within-group competition also creates between-group selection dynamics. Design for the between-group allocation problem, not just the within-group quality problem. The miners will figure out the between-group game faster than you expect.


The Three Things That Were Easier Than Expected

1. The Core Training Loop

We expected the distributed training loop — managing multi-GPU local training, gradient accumulation, expert selection, checkpoint serialization — to be a significant engineering effort. It was straightforward.

PyTorch's standard training infrastructure (DataLoader, mixed precision training, gradient checkpointing) worked without modification for the expert-selective training scenario. The "sparse" aspect of TEFT (freezing non-selected experts) is just param.requires_grad = False for the frozen parameters — standard PyTorch. The expert selection step is an offline analysis that runs before training begins.

The training loop itself is simpler than a typical full fine-tuning loop because there are fewer parameters to track. We were pleasantly surprised by this.

2. The Proof-of-Loss Evaluation

We expected validator evaluation to be computationally expensive — re-running the model with each miner's update to measure the loss change.

In practice, the evaluation is fast because it's forward-pass-only (no backward pass, no gradient computation), and the validation set is small (a few hundred examples is enough for a reliable loss estimate). Evaluating a 7GB expert group checkpoint takes 2-3 minutes on a single A100. For 10-20 miners per group, that's 20-60 minutes of evaluation time — well within the 45-block cycle window.

The evaluation is also pleasingly robust. Random noise submissions almost never pass the ReLU filter — the loss change distribution for noise is reliably positive (noise increases loss). We've seen very few borderline cases.

3. Bittensor's Weight Mechanism

We expected significant difficulty adapting to Bittensor's on-chain weight submission system. The set_weights() call was actually straightforward to integrate. The main complexity is timing (weight submissions must land within the tempo window) and stake-weighting (higher-stake validators have more influence on the consensus weight vector), both of which are well-documented.

The inter-validator consensus layer (hivemind P2P for aggregating validator scores before weight submission) required more engineering, but the underlying blockchain interface was clean.


What We'd Do Differently

Start with the checkpoint system. We built it last (because it seemed like infrastructure) and rebuilt it twice. It should have been designed and tested first, with clear requirements: retention policy, backend abstraction, serving infrastructure, integrity checks. Everything else depends on it.

Budget more time for chain interaction debugging. Blockchain operations have failure modes that are difficult to reproduce in testing because they depend on chain congestion, block timing, and subtensor connection state. Production is the only real test environment. Build robust logging and alerting from day one.

Design the miner allocation mechanism before launch. We launched with the assumption that miners would distribute naturally across expert groups. They didn't — they optimized for the highest-value groups. A mechanism design pass before launch would have anticipated this.

Run adversarial simulations. We modeled expected miner behavior; we didn't model adversarial miner behavior. A dedicated red-teaming exercise before launch would have surfaced several attack vectors we discovered in production (and patched reactively). Run the adversarial simulations before you need to patch in production.


The Honest Assessment

The hardest parts of building BlockZero were not the ML parts. The MoE architecture, TEFT protocol, and Proof-of-Loss mechanism worked roughly as designed. The hard parts were:

  • Building reliable infrastructure for a distributed system with adversarial participants
  • Understanding the failure modes of blockchain interactions under real network conditions
  • Anticipating the strategic behavior that economic incentives would elicit

If you're building on Bittensor, expect to spend as much engineering time on the systems layer as on the ML layer. The blockchain is not transparent infrastructure — it's a first-class engineering concern with its own failure modes, timing constraints, and performance characteristics.

We're sharing this because the Bittensor ecosystem gets better when builders learn from each other. If you're building a training subnet and want to compare notes, find us.