Skip to main content

MoE and Catastrophic Forgetting: How Expert Isolation Gives You Domain Specialization Without Destroying General Capability

· 6 min read
Research & Engineering

Catastrophic forgetting is the central unsolved problem of continual learning. You train a model to be better at task B, and it forgets how to do task A. The more you specialize, the more you destroy. BlockZero's MoE architecture makes this tradeoff avoidable — by construction.

What Catastrophic Forgetting Actually Is

When a neural network learns, it encodes knowledge in its parameters. The problem is that the same parameters do double (or triple, or hundredfold) duty — they encode multiple pieces of knowledge simultaneously, distributed across the weight matrix in a way that's fundamentally entangled.

When you fine-tune on new data, gradient descent pushes the parameters to minimize loss on that data. But the push doesn't know what else those parameters were encoding. It just moves the weights in the direction that reduces the new loss. If that movement happens to conflict with what the weights needed to be to preserve old knowledge, the old knowledge is overwritten. Forgotten.

The phenomenon was documented as early as 1989 (McCloskey & Cohen) and has been a research priority ever since. Despite decades of work — elastic weight consolidation (EWC), progressive neural networks, PackNet — no general solution exists for dense transformer models. You can mitigate forgetting; you cannot eliminate it.

Why LoRA Doesn't Fully Solve It

Low-rank adaptation methods like LoRA constrain parameter changes to a low-dimensional subspace. The intuition is that if you change fewer things, you overwrite less. And it's partially right — LoRA does reduce forgetting compared to full fine-tuning.

But the constraint is architectural, not semantic. LoRA doesn't know which parts of the model encode domain-A knowledge vs. domain-B knowledge. It just limits how much it changes anything. The result is a tradeoff: the more capacity you allocate to the new domain (higher rank), the more forgetting occurs. The less capacity (lower rank), the less forgetting but also less specialization. You can tune the dial but you cannot escape the fundamental tension.

For production use cases — where you need both strong specialization and preserved general capability — LoRA buys time but doesn't solve the problem.

The Geometric Intuition Behind Expert Isolation

Here's what makes MoE architecturally different.

In a standard dense transformer, every token passes through every parameter in the FFN layers. Every parameter is implicated in every kind of reasoning. The knowledge is fully mixed.

In an MoE transformer, each token passes through only 2–8 experts out of 64 per layer. The router decides which experts handle which tokens. Over millions of training steps, experts naturally specialize: some experts activate primarily on mathematical text, others on legal language, others on code, others on conversational text.

The crucial property: experts that specialize on domain A do not participate in processing domain B. Their parameters are not implicated in domain B's forward passes. When you fine-tune on domain B, the gradients flow through domain B's experts — and domain A's experts are untouched.

This is not a soft constraint like LoRA's low-rank projection. It is a structural property: the computation graph for domain A tokens does not include domain B experts.

The TEFT Approach: Making Isolation Explicit

TEFT (Targeted Expert Fine-Tuning) makes the isolation explicit and verifiable.

Before training begins, TEFT runs a one-time expert selection pass. For a target domain, it identifies which experts activate most consistently on that domain's data — these are the experts that already encode domain-relevant knowledge:

I^(ℓ) = TopK_k( E_{x ~ D_new}[activation_rate(expert_i, x)] )

For BlockZero's DES-MoE variant, it goes further: it subtracts the general-domain activation profile, selecting experts that are specifically relevant to the new domain rather than generally active:

I^(ℓ) = I^(ℓ)_freq \ TopK_{k'}( E_{x ~ D_gen}[activation_rate(expert_i, x)] )

During training, only the selected experts are updated. The attention layers, layer norms, embeddings, and all non-selected experts are frozen. This is not a soft regularization — it is a hard freeze. Non-selected parameters do not receive gradients.

The result is that domain B fine-tuning updates live entirely in the subspace of domain B's experts. Domain A's experts cannot be affected because they are frozen.

What the Data Shows

Our pilot experiment with Qwen3-VL-30B on mathematics domain data directly tests this claim.

The catastrophic forgetting prediction: as math training progresses, the model's perplexity on general-domain text (Wikipedia) should increase. The model is "forgetting" general language as it specializes on math.

Figure: teft-forgetting-experiment Figure: Wikipedia perplexity (proxy for general capability) throughout math domain training. With TEFT expert isolation, Wiki PPL remains stable at ~12.15 across all 1,600 training steps.

What we observed: Wikipedia PPL was 12.15 at step 0 and 12.15 at step 1,600. Flat line. No forgetting.

For comparison, standard full fine-tuning on the same math data would be expected to show Wiki PPL increasing meaningfully — a typical forgetting curve. TEFT prevents this by construction.

The GSM8k math reasoning benchmark was similarly stable at ~2.57 PPL throughout training — even as the training loss on the math dataset dropped from >10.0 to ~3.5 in the first 500 steps. Math capability improved while general capability was preserved.

Implications for Production Deployment

The catastrophic forgetting result matters for a specific reason: it enables incremental expert library construction.

If fine-tuning caused forgetting, every new domain expert would degrade the model's general capability, and eventually the model would become useless for general reasoning. The library strategy would be self-defeating.

Because expert isolation prevents forgetting, the model can accumulate domain expertise without degradation. A model with financial, legal, medical, and code experts does not forget how to reason, follow instructions, or structure its output — those capabilities live in the frozen attention layers, which are never touched.

The expert library can grow indefinitely. Each addition enriches the model without destroying what's already there.

DES-MoE: The Literature Result

The DES-MoE paper (Li et al., 2025) provides the baseline comparison for catastrophic forgetting mitigation:

MethodCatastrophic Forgetting Reduction
Standard fine-tuning0% (baseline)
LoRA~40–60% (varies by rank)
ESFT (frequency-based selection)~60–70%
DES-MoE (differential selection)89%

The 89% figure comes from subtracting general-purpose experts from the selected set. These experts, which activate frequently on all domains, are particularly risky to update — any change to them propagates into general capability. DES-MoE's differential selection excludes them, capturing almost all of the available isolation benefit.

The Deeper Point

Catastrophic forgetting is not a training hyperparameter problem. It is a representational problem: knowledge is stored in parameters that are shared across domains, so updating for one domain necessarily disturbs the other.

MoE architecture solves the representational problem directly: domain knowledge is stored in separate parameters. TEFT makes this property explicit and enforces it during training.

The result isn't a clever workaround. It's an architectural property that makes catastrophic forgetting structurally impossible for properly isolated experts.

References

  • McCloskey & Cohen (1989). Catastrophic Interference in Connectionist Networks.
  • Kirkpatrick et al. (2017). Overcoming Catastrophic Forgetting in Neural Networks. PNAS.
  • Li et al. (2025). DES-MoE: Dynamic Expert Specialization for Catastrophic Forgetting-Free MoE Adaptation.
  • Wang et al. (2024). ESFT: Towards Efficient Fine-Tuning for Large Mixture-of-Experts Models. arXiv:2409.10878