The Distributed MoE Landscape: A Practical Survey of What Works and What Doesn't

February 19, 2025 · 7 min read

Research & Engineering

Before building BlockZero, we surveyed the literature on distributed Mixture-of-Experts training. This is a practical synthesis of what we found — what methods exist, what problems they solve, where they fall short, and what gap BlockZero fills.

The State of the Field

Distributed MoE training is a young research area. Most of the foundational work on MoE architecture (Switch Transformers, GLaM, DeepSeekMoE) focuses on centralized training on high-end GPU clusters. The question of how to train MoE models on decentralized, heterogeneous hardware is newer and much less settled.

We organize the landscape into four milestones of increasing ambition.

Milestone 1: MoE Fundamentals

Switch Transformers (Fedus et al., 2022) established MoE as a practical approach to scaling: replace every other FFN layer in a dense transformer with a router + expert ensemble. Sparse activation (each token routes to 1–2 experts) keeps FLOPs constant while scaling parameter count. The key result: a 1.6T parameter MoE model achieves better scaling efficiency than a dense model of comparable FLOPs.

DeepSeekMoE (DeepSeek-AI, 2024) introduced fine-grained expert specialization: more experts, each smaller, with higher top-K selection. The routing formula:

g_{l,i}(x) = Softmax(TopK(x · Ψ_l, k))_i

The result: experts specialize more narrowly, and the model achieves higher accuracy at lower cost per parameter. DeepSeekMoE established that expert specialization is not accidental — it emerges reliably from training and can be encouraged architecturally.

Designing Effective Sparse Expert Models (Zoph et al., 2022) provided practical guidance: how to choose the number of experts, the capacity factor, the auxiliary load-balancing loss. This is the engineering paper that made MoE models reliable to train.

Gap identified: These papers assume centralized training with fast interconnects. None addresses decentralized or heterogeneous settings.

Milestone 2: Selective Expert Adaptation

Once MoE architecture was established, researchers turned to a key observation: if experts specialize by domain, you can fine-tune for a domain by updating only the relevant experts. This is dramatically more efficient than full fine-tuning.

ESFT: Expert-Specialized Fine-Tuning (Wang et al., 2024) operationalized this. Run domain data through the model; measure expert activation frequencies; select the top-k most frequently activated experts per layer; freeze everything else. Empirical result: ESFT achieves comparable domain performance to full fine-tuning while updating ~12.5% of parameters and showing reduced catastrophic forgetting.

DES-MoE: Differential Expert Specialization (Li et al., 2025) improved on ESFT's weakness: it sometimes selected general-purpose experts (high activation on all data, not just the target domain). Updating general experts risks degrading cross-domain capability. DES-MoE subtracts the general-domain activation profile:

I^(ℓ) = I^(ℓ)_freq \ TopK_{k'}( E_{x ~ D_gen}[activation_rate(expert_i, x)] )

This filters out the universally-active experts and selects only domain-specific ones. Result: 89% reduction in catastrophic forgetting vs. standard fine-tuning.

Gap identified: Both ESFT and DES-MoE assume centralized training with trusted workers. Neither addresses quality gating in adversarial settings.

Milestone 3: Composable Expert Specialization

If individual domain experts can be trained selectively, can pre-trained experts from different domains be composed into a single model?

Branch-Train-MiX (BTX) (Sukhbaatar et al., 2024) tackled this directly. Start from a dense foundation model; branch into multiple copies; fine-tune each copy on a different domain; merge the fine-tuned experts back into a single MoE model. The merged model retains specialization from each branch.

BTX demonstrated that composable experts are possible: a merged model outperforms individual fine-tuned models on held-out domain tasks while preserving general capability. The key mechanism is that post-branch fine-tuning specializes different experts for different domains, and the router learns to direct tokens accordingly.

FlexOLMo explored a similar direction with more flexibility: heterogeneous expert sizes and dynamic routing during composition. The goal is to allocate more capacity to high-frequency domains and less to rare ones.

BAM (Branch-Attend-MiX) added an important observation: attention layers should not be domain-specialized. Branching attention leads to inconsistent reasoning style across domains. BAM branches only the FFN (expert) layers, keeping attention shared — the same design principle that BlockZero uses.

Gap identified: BTX and FlexOLMo are designed for semi-centralized settings. They assume trusted workers and a central coordinator for the composition step. They don't address what happens when workers are anonymous and potentially adversarial.

Milestone 4: Decentralized Training

The most recent work attempts to train large models across geographically distributed, internet-connected nodes — with no shared trust relationship between workers.

DiLoCo (Douillard et al., 2024) is the enabling technology for decentralized training. The key insight: workers can run local optimization for H steps before synchronizing. The global model update uses Nesterov momentum on the accumulated pseudo-gradient:

Δ = Φ^(0) − Φ^(H)   (pseudo-gradient)
Φ^(0) ← Φ^(0) − η_out · momentum(Δ)

Douillard et al. showed that infrequent synchronization (H=500 steps) achieves comparable performance to synchronous training with ~500× less communication. This makes internet-speed connections viable.

BTS (Branch-Train-Synchronize) applied DiLoCo-style reasoning to MoE: branch model copies, train them independently, and synchronize periodically. The communication savings compound: MoE's sparse activation + DiLoCo's infrequent synchronization dramatically reduces per-step communication.

Gap identified: DiLoCo and BTS assume trusted workers. They average all submitted updates without quality gating. In a permissionless network, this assumption breaks: free-riding is profitable and quality degrades.

Where BlockZero Fits

TEFT combines Milestones 2, 3, and 4 and adds the quality gate that the decentralized systems lack:

System	Decentralized	Expert-selective	Quality gate	Forgetting prevention
Switch Transformers	✗	✗	N/A	N/A
ESFT	✗	✓	None	Partial
DES-MoE	✗	✓	None	89% reduction
BTX	Partial	✓	None	Partial
DiLoCo	✓	✗	None	N/A
BTS	Partial	✓	None	Partial
TEFT (BlockZero)	✓	✓	Proof-of-Loss	Structural

The Proof-of-Loss mechanism is the critical addition. It enables quality-gated aggregation without a trusted central evaluator. Each miner's update is individually evaluated for actual loss improvement; updates that fail the quality test are excluded and receive zero reward.

This makes the system viable in a trust-minimal environment while preserving the quality properties of Milestone 2 and 3 approaches.

Open Problems

The field has made real progress, but substantial open problems remain:

Router drift: As the expert library grows and new domain experts are added, the router's routing decisions may become misaligned — it was trained on the original expert distribution. Router Annealing (periodic re-training of router weights) partially addresses this, but a principled approach to router updates in a growing library is still an open problem.

Multi-domain routing: A single token may require reasoning that draws on multiple domains (e.g., a legal question about financial regulations). Current top-K routing doesn't express multi-domain dependencies. Hierarchical routing or domain-conditioned routing are active research directions.

Expert retirement: An expert library that grows indefinitely accumulates stale experts — experts trained on outdated domain data or superseded by better-trained successors. Principled expert lifecycle management (when to retire, how to migrate usage) is not yet solved.

Cross-subnet composition: In the Bittensor ecosystem, if multiple subnets train domain-specific experts, could those experts be composed into a meta-model? The cross-subnet coordination problem involves both technical and economic questions that remain open.

BlockZero is actively working on several of these directions. We'll publish as results develop.

References

Fedus et al. (2022). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. JMLR.
DeepSeek-AI (2024). DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. arXiv:2401.06066
Zoph et al. (2022). Designing Effective Sparse Expert Models. arXiv:2202.08906
Wang et al. (2024). ESFT: Towards Efficient Fine-Tuning for Large Mixture-of-Experts Models. arXiv:2409.10878
Li et al. (2025). DES-MoE: Dynamic Expert Specialization for Catastrophic Forgetting-Free MoE Adaptation.
Sukhbaatar et al. (2024). Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM. arXiv:2403.07816
Douillard et al. (2024). DiLoCo: Distributed Low-Communication Training of Language Models. arXiv:2311.08105

The State of the Field​

Milestone 1: MoE Fundamentals​

Milestone 2: Selective Expert Adaptation​

Milestone 3: Composable Expert Specialization​

Milestone 4: Decentralized Training​

Where BlockZero Fits​

Open Problems​

References​