Expert Partitioning

The expert partitioning system is what makes Connito's hardware requirements tractable for individual miners. This page explains how the model's expert parameters are divided into groups, how groups are assigned to miners, and what this means for hardware selection and training strategy.

How Memory Scales Per Miner

In standard data-parallel distributed training, every worker must hold the entire model in memory. For a 70B-parameter model, that means 70B × 2 bytes (fp16) = ~140GB of VRAM per worker — well beyond consumer hardware.

Connito's expert partitioning inverts this requirement. The model's expert parameters are divided into num_groups partitions. Each miner receives and trains only their assigned partition. The memory requirement per miner scales as:

miner_memory ≈ total_expert_params / num_groups + hub_params

Since hub parameters are frozen and can be kept in 8-bit or on CPU, the effective memory cost per miner is approximately total_params / num_groups.

For a 30B-parameter model split across 4 expert groups, each miner requires roughly 7.5B parameters worth of GPU memory — achievable on a single A100 80GB or two A6000 48GB GPUs.

As the subnet grows and more groups are defined, the per-miner memory requirement decreases. A 100-group partition of Qwen3-VL-30B would require only ~300M parameters per miner — compatible with a single RTX 4090.

How ExpertManager Works

The ExpertManager class handles all expert partitioning logic. It operates by scanning the model's state_dict keys to identify expert parameter blocks using naming conventions specific to each supported architecture.

For Qwen3-VL-30B and similar models, expert layers follow the pattern:

model.layers.{n}.mlp.experts.{k}.*

where {n} is the transformer layer index and {k} is the expert index within that layer. The ExpertManager groups these keys into logical expert groups by clustering expert indices across all layers.

For example, if a model has 28 transformer layers each with 64 experts, and num_groups = 8, the ExpertManager assigns experts 0–7 to Group 0, experts 8–15 to Group 1, and so on — consistently across all layers. Each group contains a complete vertical slice of the model: a consistent set of expert indices spanning the full depth of the network.

ExpertMapping Format

The ExpertMapping data structure records the assignment of expert group IDs to miners:

ExpertMapping = {
    "group_id": int,              # which expert group this entry describes
    "expert_indices": List[int],  # which expert indices (per layer) are in this group
    "layer_indices": List[int],   # which transformer layers contain MoE blocks
    "miner_hotkeys": List[str],   # hotkeys of miners assigned to this group
    "param_keys": List[str],      # exact state_dict keys for all params in this group
}

This mapping is produced once per model configuration and stored by the validator. When a miner registers with the subnet, their hotkey is associated with a group ID. The validator uses the ExpertMapping to construct the partial model slice sent to each miner at the start of each training cycle.

Group Competition

Multiple miners can be assigned to the same expert group simultaneously. This is a deliberate design choice:

Redundancy ensures quality. If one miner in a group submits low-quality updates, other miners in the same group may submit better updates. The Proof-of-Loss scoring selects the best.
Competition raises quality. Miners competing for the same group have direct incentive to train more carefully — their score, and therefore their reward, is determined relative to other miners on the same group.
Load balancing. Popular groups (e.g., those with domain-aligned data) will attract more miners, increasing coverage and update diversity.

Within each expert group, the validator runs Proof-of-Loss on all submissions and uses the utility-weighted aggregation formula. Miners with w_i = 0 (degrading submissions) receive no reward for that round.

How Groups Are Assigned

Group assignment is deterministic based on miner hotkey — a miner's cryptographic identity on the Bittensor network. This means:

A miner's group assignment is stable across training rounds; they do not need to re-negotiate their group each cycle.
Miners can change their group preference by re-registering with a different group specification.
The validator enforces group assignments: a miner submitting weights for a group they are not assigned to will have their submission rejected.

ESFT & DES-MoE Selection

Expert group assignment and expert selection (Phase 1 of TEFT) operate at two different levels of granularity:

Group assignment determines which slice of the model's expert parameters a miner is responsible for. This is a coarse partition across the full expert pool.
Expert selection (ESFT or DES-MoE) determines which experts within the assigned group are actually trained in a given round. A miner assigned to Group 2 might have 8 experts per layer in their group; ESFT might select only 3 of those 8 as the most domain-relevant.

This two-level structure provides flexibility: miners train a well-defined memory footprint (their group), but within that group, they apply domain-targeted fine-tuning to further minimize compute and improve specialization quality.

Two-Level Expert Selection

Full Expert Pool

64 experts × 28 layers

→

Group Assignment

Coarse partition (8 experts/group)

→

ESFT Selection

Fine-grained (3 of 8 most relevant)

→

Trained Experts

Domain-specialized output

Implications for Miners

The partitioning design creates meaningful strategic choices:

Domain alignment. Miners with domain-specific data (e.g., a dataset of scientific papers) should prefer groups where the pre-trained experts already show activation on that domain. This can be determined by running a quick ESFT frequency analysis on a sample of the target data before registering.

Hardware capacity. Each group has a fixed memory footprint. Miners with more VRAM can register to larger groups (if variable group sizes are configured) or run multiple group assignments in parallel.

Competition assessment. Groups with fewer registered miners offer less competition for the Proof-of-Loss reward. Groups with many miners offer more redundancy but each individual miner's share of the reward is lower.

Data quality vs. compute. The Proof-of-Loss mechanism rewards loss reduction, not training duration. A miner with high-quality domain data may outperform a miner with more compute but lower-quality data. Data curation is a first-class strategy for maximizing reward.

The Connito subnet does not require any coordination between miners in the same group. Each miner trains independently and submits independently. The validator handles all aggregation. There is no peer-to-peer communication between miners — only between miners and validators.