Literature Review: Modular and Composable MoE Systems
Modern Mixture-of-Experts (MoE) systems are the result of several years of research demonstrating that large language models can be sparse, modular, and composable.
The foundation was laid by Shazeer et al., 2017, who introduced sparsely gated experts, followed by Switch Transformers (Fedus et al., 2022), which demonstrated trillion-parameter sparse models. These ideas now underpin frontier architectures including DeepSeek-V3, Qwen3, and GPT-5 lineage models.
Selective Adaptation
One of the most important observations in recent work is that expert activation is highly domain-specific.
In Expert-Specialized Fine-Tuning (Wang et al., 2024), routing distributions are shown to be highly concentrated. Fine-tuning only task-relevant experts reduces training cost by up to ~90% while matching or exceeding full fine-tuning performance.
Similarly, Dynamic Expert Specialization (Li et al., 2025) introduces adaptive routing and gradient masking to prevent catastrophic forgetting when adapting to multiple domains.
Together, these works establish that MoE models do not require full-model updates for domain adaptation.
Composable Expert Systems
Beyond selective updates, composability is the second major milestone.
Branch-Train-MiX (Sukhbaatar et al., 2024) demonstrates that independently trained dense branches can be converted into experts and merged into a single MoE model.
FlexOLMo (Shi et al., 2025) enables independently trained experts on private datasets to be plugged into a shared model without retraining.
FFT-MoE (Hu et al., 2025) extends this approach to federated and heterogeneous hardware environments.
Across these systems, experts are treated as modular units that can be added, removed, or merged.
Expert Lifecycle Management
Research also shows that experts can be pruned or removed with minimal impact.
Task-specific pruning preserves approximately 99% of performance while keeping only a single expert per layer (Chen et al., 2022).
FlexOLMo demonstrates dataset-level opt-out by removing domain-specific experts while preserving unrelated performance.
The literature consistently shows that MoE models are naturally modular, sparse, and evolvable.