Skip to main content

TEFT: Targeted Expert Fine-Tuning

Targeted Expert Fine-Tuning (TEFT) is the optimization framework behind BlockZero. It enables large Mixture-of-Experts (MoE) models to adapt to new domains without retraining the entire model — and without requiring centralized, high-bandwidth infrastructure.

The core idea is simple:

In an MoE model, only a small subset of experts are responsible for any given domain.
TEFT trains only those experts — and leaves everything else untouched.

This approach is grounded in prior work showing that expert activation patterns are highly domain-specific, and that most experts remain “cold” for a given task (Wang et al., 2024; Li et al., 2025a). Instead of synchronizing the full parameter set, TEFT isolates and updates only the sparse components that matter.

The result: dramatically lower communication cost, preserved general capability, and clean integration back into the base model.

TEFT Pipeline
1
Identify
Analyze routing to find domain-relevant experts
2
Train
Sparse local training on selected experts only
3
Aggregate
Proof-of-Loss weighted scoring
4
Reintegrate
Merge trained experts back into foundation model

Step 1: Expert Identification

Before training begins, TEFT analyzes routing behavior on the target dataset to determine which experts are consistently activated.

This produces a static index set:

ItargetI_{\text{target}}

representing the experts most relevant to the domain.

Only these experts are included in the training payload. All other experts remain frozen and are never transmitted across the network.

Because MoE routing distributions are typically highly concentrated (Wang et al., 2024), this subset is small relative to the total number of experts — often a single-digit fraction per layer.

This is the first source of efficiency: bandwidth scales with relevant experts, not model size.


Step 2: Sparse Local Training

Miners download only the selected expert parameters and perform local optimization on their partition of the dataset.

Let:

  • Φ(t)\Phi^{(t)} denote the current partial model (only selected experts),
  • DlocalD_{\text{local}} denote a miner’s data shard.

After HH steps of local training:

Φ(t,H)InnerOpt(Φ(t),Dlocal,H)\Phi^{(t,H)} \leftarrow \text{InnerOpt}(\Phi^{(t)}, D_{\text{local}}, H)

Crucially:

  • Shared hub parameters θH\theta_H remain frozen
  • Non-selected experts remain frozen
  • Only parameters in ItargetI_{\text{target}} are updated

This protects general reasoning ability while allowing domain-specific plasticity.

The miner then submits the weight displacement:

Δi=Φ(t)Φi(t,H)\Delta_i = \Phi^{(t)} - \Phi^{(t,H)}_i

rather than full gradients.


Step 3: Proof-of-Loss Aggregation

Validators do not blindly average updates.

Instead, each submitted update is evaluated on a held-out validation set. The contribution weight for miner ii is proportional to the observed reduction in loss:

wiReLU(L(Φ(t))L(Φ(t)+Δi))w_i \propto \text{ReLU}\left( L(\Phi^{(t)}) - L(\Phi^{(t)} + \Delta_i) \right)

If an update does not reduce validation loss, its weight becomes zero.

This mechanism:

  • Filters noisy or adversarial updates
  • Rewards genuine improvement
  • Prevents free-rider behavior

The weighted updates are then aggregated using a DiLoCo-style outer optimization step (Douillard et al., 2024), which enables low-frequency synchronization with stable convergence.


Step 4: Reintegration into the Foundation Model

Once the partial model converges, the trained experts are merged back into the original foundation model.

Because only a sparse subset was updated:

  • General knowledge remains intact
  • Unrelated experts are unaffected
  • Domain specialization is cleanly isolated

Empirical evidence from partial-model experiments shows that updating only domain-relevant experts can achieve strong in-domain convergence while maintaining stable out-of-domain perplexity (see feasibility results, Section 5).


Why TEFT Works

TEFT leverages three empirical properties of modern MoE models:

  1. Expert sparsity — Only a small subset of experts are active per domain.
  2. Expert modularity — Experts can be updated independently.
  3. Sparse utilization — Many experts contribute minimally for any given task.

These properties are repeatedly observed across MoE literature, including ESFT (Wang et al., 2024), DES-MoE (Li et al., 2025a), and pruning analyses (Chen et al., 2022).

Instead of fighting MoE sparsity with dense synchronization, TEFT aligns with it.


The Result

TEFT transforms MoE adaptation from a monolithic training job into a sparse, modular, verifiable optimization process.

It achieves:

  • Orders-of-magnitude lower communication overhead
  • Protection against catastrophic forgetting
  • Incentive-aligned aggregation
  • Compatibility with decentralized infrastructure

Rather than asking every node to train the entire model, TEFT asks:

Which experts matter for this domain — and who can improve them?

That shift makes large-scale collaborative adaptation feasible.