TEFT: Targeted Expert Fine-Tuning — How We Reduce Communication Overhead by Orders of Magnitude
This post introduces TEFT (Targeted Expert Fine-Tuning) — the protocol at the core of BlockZero — and explains how it achieves communication-efficient, quality-gated distributed MoE fine-tuning over a permissionless network. This is the research paper translated into blog form.
The Problem TEFT Solves
Standard distributed training synchronizes all model parameters at every sync step. For a 30B MoE model, that means transmitting 60GB of weights per sync per worker. Even with DiLoCo-style infrequent synchronization, the bandwidth requirement per cycle is enormous if you're syncing everything.
But you don't need to sync everything. In a Mixture-of-Experts model, each forward pass activates only a small subset of experts — typically 2 out of 64 per layer. For a specific domain (mathematics, legal text, code), the activated experts are consistent and predictable: the same small subset of experts activates repeatedly on math data.
This is the insight TEFT exploits: if the model already has domain-specific experts — experts that activate primarily on one domain — then fine-tuning for that domain only requires updating those experts. Everything else can stay frozen. And if you're only transmitting the expert updates, communication overhead drops proportionally.
Phase 1: Expert Selection
Before training begins, TEFT performs a one-time analysis to identify the experts most relevant to the target domain. Two approaches are supported:
ESFT: Frequency-Based Selection (Wang et al. 2024)
Run a sample of the target domain data through the model and measure how frequently each expert activates. Select the top-k most frequently activated experts per layer:
I^(ℓ) = TopK_k( E_{x ~ D_new}[activation_rate(expert_i, x)] )
ESFT is conceptually simple and empirically effective. For single-domain adaptation, it reliably identifies the experts that encode domain-relevant knowledge.
DES-MoE: Differential Selection (Li et al. 2025)
ESFT has one weakness: it may select general-purpose experts — experts that activate frequently on all inputs, not just the target domain. Updating these experts risks degrading general capability (catastrophic forgetting).
DES-MoE addresses this by subtracting the general-domain activation profile from the domain-specific profile:
I^(ℓ) = I^(ℓ)_freq \ TopK_{k'}( E_{x ~ D_gen}[activation_rate(expert_i, x)] )
By excluding experts that are equally prominent on general data, DES-MoE selects experts that are uniquely relevant to the new domain. These are the experts that, when updated, produce the most targeted domain adaptation with the least forgetting.
In our pilot experiments, DES-MoE produced lower catastrophic forgetting than ESFT, at the cost of slightly slower initial convergence. For production training runs, DES-MoE is the recommended approach.
Phase 2: Sparse Local Training
With the expert selection map I_target fixed, each miner downloads the selected expert parameters from the validator and runs local inner optimization:
Φ^(t,H) ← InnerOpt(Φ^(t), D_local, H)
What "sparse" means here: only the parameters in I_target — the selected experts plus the router — are updated. All shared hub parameters (attention layers, layer norms, embedding matrices) are frozen. All non-selected expert parameters are frozen.
For Qwen3-VL-30B with 64 experts per layer and top-8 selection, this means:
- 8 of 64 expert FFNs per layer are updated (12.5% of expert parameters)
- All attention layers frozen (preserved general capability)
- All remaining experts frozen (isolated from domain update)
Bandwidth implication: miners transmit only the 8-of-64 selected experts. In fp16, a typical expert group checkpoint is 3-7GB rather than 60GB for the full model. Communication overhead drops by roughly an order of magnitude compared to transmitting all expert parameters.
Why freezing the hub matters: the attention layers encode the model's core reasoning machinery. They are what allows the model to follow instructions, structure its output, and reason across multiple steps. Updating them on narrow domain data risks the model losing these general capabilities. Expert isolation preserves them by design.
Phase 3: Proof-of-Loss Aggregation
After H inner steps, each miner submits their trained expert checkpoint. Here's where BlockZero diverges fundamentally from every other distributed training approach.
The standard approach: collect all submitted updates, average them, apply the average. This is what DiLoCo does, what BTX does, what FedAvg does.
The problem: in a permissionless network, you cannot trust that every submitted update is honest and high-quality. Some miners may submit random noise to collect rewards without actually training. Some may submit deliberately bad updates. A naive average integrates all of this.
TEFT's approach: Proof-of-Loss validation before any aggregation.
w_i ∝ ReLU( L(Φ^(t)) − L(Φ^(t) + Δ_i) )
For each miner i, the validator applies their update to the current model and measures the resulting loss on a held-out validation set. The miner's weight in the aggregation is proportional to how much their update actually reduced the loss.
Three properties make this powerful:
Free-rider filtering: a miner who submits random noise will, with overwhelming probability, increase the validation loss. ReLU clips w_i = 0. They receive zero weight in the aggregation and zero reward. No manual blacklisting required — bad submissions self-identify.
Proportional reward: miners who produce better domain adaptation get more weight in the global update and more TAO reward. This creates direct incentive alignment: the best training produces the most reward.
Goodhart's Law resistance: the validation set is held-out data from the training distribution. There is no fixed public benchmark to memorize. The only way to score well is to actually train well on the domain.
After scoring, the accepted pseudo-gradients (all miners with w_i > 0) are aggregated using DiLoCo-style Nesterov momentum on the weighted average. This is the outer optimization step that produces the updated global model.
Phase 4: Robust Expert Integration
Once the partial model Φ* has converged, it must be reintegrated into the full foundation model. Three strategies are available:
Direct Replacement: overwrite the selected expert parameters with the fine-tuned versions. Fast, effective when the inner loop used modest learning rates.
WiSE-FT (Weight-Space Ensemble): interpolate between the original and fine-tuned expert weights: Θ^(final) = (1-α)·Φ^(0) + α·Φ*. The mixing coefficient α controls the tradeoff between specialization and stability. Recommended when the inner loop ran aggressively.
Router Annealing: after Direct Replacement, freeze all expert parameters and fine-tune only the router on a mix of domain and general data. This recalibrates routing probabilities to properly direct domain tokens to the newly specialized experts. Best for large domain shifts where the router hasn't seen the new domain.
Pilot Results: Mathematics Domain
To validate TEFT, we ran Qwen3-VL-30B-Instruct on the Nemotron-CC-Math-v1 dataset — 133B tokens of high-quality mathematical reasoning text.
Setup: ESFT selection with k=8 experts per layer, 1,600 training steps total.
| Metric | Result | What it means |
|---|---|---|
| Training loss at step 0 | >10.0 | Model has no prior math fine-tuning |
| Training loss at step 500 | ~3.5 | Rapid domain adaptation |
| GSM8k PPL throughout | ~2.57 (stable) | Math reasoning benchmark unaffected |
| Wiki PPL throughout | ~12.15 (stable) | General capability perfectly preserved |
The critical result is the Wikipedia PPL stability. A model with catastrophic forgetting would show Wiki PPL increasing as math training progresses. Instead, it flatlines at 12.15 across all 1,600 steps. Expert isolation works.
Figure: TEFT math pilot results. Training loss (top) drops rapidly. Both downstream perplexity metrics (bottom) remain stable throughout training.
How TEFT Compares to Related Work
| System | Decentralized | Expert-selective | Quality gate | Catastrophic forgetting |
|---|---|---|---|---|
| TEFT | ✓ | ✓ | Proof-of-Loss | Prevented by design |
| ESFT | ✗ | ✓ | None | Partial mitigation |
| DES-MoE | ✗ | ✓ | None | 89% reduction |
| BTX | Partially | ✗ | None | Not addressed |
| FlexOLMo | Partially | ✓ | None | Partial |
| DiLoCo | ✓ | ✗ (dense) | None | Not applicable (dense) |
TEFT is the only system that combines all four properties. The Proof-of-Loss mechanism is the key novelty: it enables quality gating without a centralized evaluator, making the system viable in a trust-minimal permissionless environment.
References
- Wang et al. (2024). ESFT: Towards Efficient Fine-Tuning for Large Mixture-of-Experts Models. arXiv:2409.10878
- Li et al. (2025). DES-MoE: Dynamic Expert Specialization for Catastrophic Forgetting-Free MoE Adaptation.
- Sukhbaatar et al. (2024). Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM. arXiv:2403.07816
- Douillard et al. (2024). DiLoCo: Distributed Low-Communication Training of Language Models. arXiv:2311.08105