Skip to main content

What Is a Mixture-of-Experts Model?

A Mixture-of-Experts (MoE) model replaces the dense feed-forward block in a transformer with a collection of specialized subnetworks — called experts — coordinated by a routing mechanism.

Instead of activating every parameter for every token, an MoE model activates only a small subset of experts per token. This idea was introduced at scale in Shazeer et al., 2017 and later extended in Switch Transformers (Fedus et al., 2022), which demonstrated that sparse models could scale to trillions of parameters while keeping inference cost manageable.

Today, sparse MoE architectures underpin frontier systems such as DeepSeek-V3 and Qwen3.


The Core Mechanism

A traditional transformer layer contains a dense feed-forward network (FFN). In that design, every parameter in the FFN is activated for every token. No matter what the input is — code, math, dialogue, or poetry — the entire block participates in the computation.

A Mixture-of-Experts (MoE) layer changes this fundamentally.

Instead of one large feed-forward network, the layer contains many smaller subnetworks called experts. Each expert can specialize in different patterns or domains. Some may become stronger at mathematical reasoning, others at syntax, others at multilingual text. Alongside these experts sits a router — a lightweight gating network whose job is to decide which experts should process each token.

Formally, the model can be described as containing:

  • Shared parameters that process all tokens (attention layers, normalization layers, etc.),
  • A router that computes expert selection probabilities,
  • A collection of expert parameters, indexed by layer and expert number.

For an input token representation xx at layer \ell, the router computes a score for each expert and selects only the top-kk experts. The remaining experts are ignored for that token.

The output of the layer is then a weighted combination of the selected experts:

y=i=1Mr(x)iEi(x;W,i)y = \sum_{i=1}^{M} r_\ell(x)_i \cdot E_i(x; W_{\ell,i})

In practice, although there may be dozens or even hundreds of experts in a layer, only a small number — typically 1 or 2 — are active for any given token.

This sparse activation is the defining property of MoE models.

MoE Routing Mechanism
Input Token
Router
Selects top-k
Expert 1 ✓
Expert 2 ✓
Expert 3
Expert N
Output
Weighted sum

Why This Changes Scaling

The critical insight is that total parameter count and active compute are no longer the same thing.

In a dense model, increasing parameters directly increases inference cost. In an MoE model:

  • The total number of experts MM can grow very large.
  • The number of active experts kk per token remains small.

So the model’s capacity (total stored parameters) can increase dramatically, while the per-token compute remains roughly constant.

Recent work such as Unified Scaling Laws for Routed Language Models shows that routed models scale predictably with respect to activated parameters rather than total parameters. In other words, performance improvements track how much compute is actually used per token, not how many parameters exist in storage.

This makes MoE a mathematically grounded scaling strategy — not just a parameter expansion trick.


Why This Structure Matters

The architectural separation between shared components, router, and experts creates clear modular boundaries.

Experts can specialize. Experts can be updated independently. Experts can be added, merged, or removed.

Dense models behave like a single brain.

MoE models behave like an ecosystem of specialists coordinated by a manager.

That structural modularity is what makes MoE not only efficient at scale — but uniquely suited for distributed and decentralized training systems.

Dense Model
All parameters active
Every token uses every weight
600B params = 600B compute per token
MoE Model
E1 ✓
E2 ✓
E3
E64
Only 2/64 experts active per token

Why Sparse Activation Is the Key Property

Sparse activation creates the structural separation that makes decentralized training tractable:

  • Parameter capacity scales independently of compute. A 30B-parameter MoE model with top-2 routing out of 64 experts has FLOPs per token comparable to a ~3B dense model, because only 2/64 expert FFNs fire per token.
  • Expert parameters are modular. Each expert W_{ℓ,i} is a self-contained parameter block. Its gradients depend only on the tokens routed to it. This means experts can be updated independently without breaking the model's other capacities.
  • Domain specialization emerges naturally. Empirical analysis of deployed MoE models consistently shows that routing distributions are highly concentrated: for a given domain (e.g., mathematics, code, legal text), a small subset of experts receives the overwhelming majority of routing weight. The identity of those experts is domain-specific and consistent across layers.

This last property is the empirical foundation for TEFT — the training protocol BlockZero uses. If experts are already domain-specialized by pre-training, then fine-tuning for a new domain only requires updating the small subset of experts that activate on that domain's data.


Scaling Laws for Routed Models

The efficiency advantage of MoE routing is not merely qualitative. From Unified Scaling Laws for Routed Language Models, the validation loss L of an MoE model with dense parameter count N and E experts follows:

log L(N, E) = a log N + b log Ê + c log N log Ê + d

where Ê is a saturating transformation of E that accounts for diminishing returns at large expert counts.

This law has two important implications for BlockZero:

  1. Adding experts reduces loss. As E grows, log L decreases logarithmically — more experts means better performance even if the dense backbone N stays fixed.
  2. Expert count and parameter count compound each other. The cross-term c log N log Ê means that the benefit of adding experts is larger when the dense model is also larger.

In practice, this means a modestly sized dense backbone (e.g., 30B parameters) augmented with many experts can match the performance of a much larger dense model. BlockZero's decentralized network is the mechanism for scaling E without requiring any single participant to hold the full model.

tip

The scaling law cross-term is the key reason BlockZero targets large base models like Qwen3-VL-30B: at frontier scale, each additional expert group trained by the network contributes more to overall model quality.


Why MoE Is a Natural Fit for Decentralization

The same properties that make MoE efficient at scale also make it structurally aligned with decentralized training.

1. Modular Parameter Boundaries

Experts are independent subnetworks. They can be trained, updated, or replaced without modifying the entire model. Research such as Expert-Specialized Fine-Tuning (Wang et al., 2024) demonstrates that only a subset of experts dominate routing for a given domain.

This means distributed participants can work on disjoint expert subsets without stepping on each other’s gradients.

2. Sparse Communication

Because only selected experts are active or updated, synchronization can be limited to a fraction of parameters. Low-communication training approaches such as DiLoCo (Douillard et al., 2024) show that infrequent synchronization can maintain convergence.

MoE sparsity reduces bandwidth requirements by orders of magnitude compared to dense data-parallel training.

3. Natural Task Specialization

Empirical analyses show that experts specialize in domains. Routing concentration results from Wang et al., 2024 and interpretability work like Hu et al., 2026 demonstrate that only a small fraction of experts dominate particular tasks.

In a decentralized network, this allows:

  • Domain-focused contributors
  • Specialized compute allocation
  • Parallel expert evolution

The model becomes an ecosystem of specialists rather than a monolithic weight tensor.

4. Composability

Branch-based systems such as Branch-Train-MiX (Sukhbaatar et al., 2024) show that independently trained experts can later be merged into a unified MoE model.

This composability makes decentralized coordination feasible: experts trained in isolation can be joined without retraining the entire foundation.


The Bigger Picture

Dense models are monolithic. Updating them requires synchronizing everything.

MoE models are modular. Updating them can mean modifying only what matters.

Unified scaling theory shows routed models scale predictably and efficiently. Empirical results show expert specialization is real and structured. Together, these properties make MoE not just an architectural choice — but a natural substrate for distributed, composable, and decentralized AI systems.