Why We Built BlockZero on Mixture-of-Experts (And Why Data Parallel Would Have Failed)
The single most important architectural decision in BlockZero's design was the choice of expert parallelism over data parallel or pipeline parallel training. This post explains why the other approaches would have failed — and why MoE expert parallelism is uniquely suited to decentralized networks.
The Three Strategies
Every distributed deep learning system chooses between three fundamental approaches to splitting work across workers:
Data parallelism: every worker holds the full model and trains on a different batch. Gradients are synchronized across all workers after each batch.
Pipeline parallelism: the model is split into sequential stages (layers 1–12 on worker A, layers 13–24 on worker B, etc.) and activations are passed forward through the pipeline.
Expert parallelism: the model's Mixture-of-Experts layers are split by expert, with each worker responsible for a subset of experts. Workers train independently and sync periodically.
For a centralized cluster with homogeneous hardware, fast interconnects, and controlled failure modes, all three can work. For a permissionless decentralized network — where workers have heterogeneous hardware, consumer-grade internet connections, variable availability, and no shared trust relationship — the calculus is completely different.
Why Data Parallel Fails in Permissionless Networks
Data parallelism has one fatal structural problem for our use case: every worker must hold the full model.
For a 70B parameter model, that means 140GB of VRAM per worker in fp16 — requiring A100 or H100 hardware. The moment you impose that requirement on a permissionless network, you've excluded the vast majority of potential participants. You end up with a small cluster of well-capitalized miners, not a distributed network.
The second problem is communication overhead. Data parallel training requires synchronizing gradients after every training step. In a 256-miner network processing 100 batches per minute, that's 25,600 gradient communication events per minute — continuous high-bandwidth traffic that is simply not viable over consumer internet connections with variable latency and throughput.
BlockZero's comparator on Bittensor is SN3 (Templar), which uses data parallelism. The hardware barrier is real: you need to commit to a full 70B model on your GPU to participate. That requirement concentrates the subnet among a small number of operators with data center hardware.
Why Pipeline Parallel Fails on Heterogeneous Networks
Pipeline parallelism solves the memory problem — the model is split across workers, so no single worker needs to hold everything. But it introduces a worse problem: bubble latency.
Bubble latency is the idle time in a pipeline when a stage is waiting for the output of the previous stage. In a centralized cluster with microsecond latency between nodes, the bubble is a manageable fraction of total computation time. The math changes completely on internet-connected nodes.
Consider a 4-stage pipeline where each stage takes 10 seconds and the network latency between stages is 50ms. The bubble introduces negligible overhead. Now replace that with consumer internet where latency is 5-50ms but can spike to 500ms or higher. Suddenly the pipeline is frequently stalled on network operations rather than compute.
The deeper problem: pipeline parallelism requires exactly N workers to be online and healthy simultaneously. In a permissionless network, you cannot guarantee N-of-N availability. One offline or slow worker breaks the pipeline for everyone.
SN9 (iota) on Bittensor uses pipeline parallelism. The architecture works, but the bubble latency and dependency requirements make it fragile in heterogeneous internet environments.
Why Expert Parallelism Works
Mixture-of-Experts (MoE) architecture divides the model's feed-forward layers into multiple independent expert networks. A router selects which experts process each token. Key property: experts are independent — they don't depend on each other's outputs within a forward pass.
This independence maps perfectly onto decentralized training:
Each worker trains only their expert group. Memory requirement per worker = total_params / num_groups, not total_params. A 30B model split across 4 groups requires 7.5B parameters per worker — achievable on a single A100 or two A6000s. Adding more groups reduces per-worker requirements further.
Workers train truly independently. There are no inter-worker dependencies during the training phase. A worker can go offline mid-cycle, restart, and rejoin on the next cycle without breaking anyone else's training. This robustness is essential for a permissionless network.
Aggregate capacity scales with network size. Unlike data parallelism, where adding workers doesn't increase model capacity (they're all training the same model), adding expert groups in BlockZero literally increases the total number of experts in the model — improving its capacity per the MoE scaling law:
log L(N, E) = a log N + b log Ê + c log N log Ê + d
More miners → more expert groups → lower loss. The network's collective intelligence grows with participation.
The DILoCo Enabling Technology
Expert parallelism solves the architectural problem, but there's still the question of synchronization frequency. Even if each miner is training independently, they eventually need to sync their updates to build a shared model.
DiLoCo (Decentralized and Low Communication training, Douillard et al. 2024) provided the key insight: workers can train locally for N steps and sync only once at the end, applying Nesterov momentum to the accumulated pseudo-gradient. Douillard et al. showed this achieves comparable performance to synchronous training while requiring approximately 500× less communication.
In practice, BlockZero miners sync approximately once per hour (one 45-block cycle). Compare this to data-parallel training which would require syncing every gradient step — thousands of times per hour.
Figure: Communication events per hour for data parallel (synchronous) vs. expert parallel with DILoCo async sync. The 500× reduction makes decentralized expert training viable over consumer internet.
Comparison Table
| SN3 Templar | SN9 iota | BlockZero | |
|---|---|---|---|
| Strategy | Data parallel | Pipeline parallel | Expert parallel |
| Memory per miner | Full model | Pipeline stage | total / num_groups |
| Sync frequency | Every step | Every step | Hourly |
| Worker independence | No | No | Yes |
| Scales with participants? | Compute only | Compute only | Compute + capacity |
The Downstream Implications
Choosing expert parallelism wasn't just an architecture decision — it determined everything downstream:
- The reward mechanism: Proof-of-Loss works because each expert group is independently evaluable. You can't independently evaluate data-parallel updates without re-running the full training.
- The hardware economics: consumer GPU participation is possible because of the memory scaling property. This determines the miner pool size, which determines model quality.
- The failure model: independent workers mean graceful degradation. One offline miner doesn't stall the network. This would not be true in pipeline parallelism.
- The business model: the expert library compounds because experts are modular and independently addressable. This wouldn't work if the model was trained as a single monolith.
The architecture decision made everything else possible.
References
- Douillard et al. (2024). DiLoCo: Distributed Low-Communication Training of Language Models. arXiv:2311.08105
- Zoph et al. (2022). Designing Effective Sparse Expert Models. arXiv:2202.08906
- DeepSeek-AI (2024). DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. arXiv:2401.06066
- Fedus et al. (2022). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.