Skip to main content

Subnet System Architecture

This page describes the operational architecture of the BlockZero subnet — the cycle of distributed training, evaluation, and model update that runs continuously on the Bittensor blockchain. It assumes familiarity with machine learning training loops but not with Bittensor; Bittensor-specific concepts are explained briefly where they appear.

System Overview

BlockZero operates as a Bittensor subnet — a specialized subnetwork within the Bittensor protocol consisting of three classes of participants:

  • Subnet Owner (SN): the central coordinator of the subnet. The SN owner controls expert group assignment — defining which expert partitions exist, which datasets they train on, and how they map to model layers. The SN owner also bridges communication between customer requirements and the subnet's task and data design, translating business needs into expert group configurations. Operationally, the SN owner runs the phase service that provides cycle timing to all participants and bootstraps the DHT network for inter-validator communication.
  • Miners (workers): nodes that perform computation. In BlockZero, miners train expert subsets of the MoE model on domain data.
  • Validators: nodes that evaluate and aggregate miner contributions. Validators hold the full model, score miner submissions, and post the resulting quality scores to the Bittensor blockchain.

The subnet operates in repeating cycles of 45 blocks, where one Bittensor block is produced approximately every 12 seconds. Each full training cycle takes approximately 9 minutes.

Figure: subnet-cycle-diagram Figure: The 45-block (~9-minute) BlockZero subnet cycle, showing the four phases and the roles of miners and validators in each.

The Four-Phase Cycle

45-Block Training Cycle (~9 min)
Distribute
~5 blocks
Train
~30 blocks
Commit
~5 blocks
Evaluate & Merge
~5 blocks

Phase 1: Distribute (~5 blocks)

Validators serve the current global model state to miners. Each miner is assigned to an expert group — a partition of the model's expert parameters. Miners are also free to selectively join a different expert group by changing the task.expert_group_name field in their configuration file (e.g., setting it to "exp_math" or "exp_dummy"), which resolves to the corresponding expert_groups/<name>/config.yaml containing the group's group_id, dataset, and training hyperparameters. Miners download only the parameters relevant to their assigned group via HTTP from the validator's model endpoint.

The download is a partial model Φ(t)\Phi^{(t)} containing:

  • The selected expert weights {W,iiItarget}\{W_{\ell,i} \mid i \in I_{\text{target}}\} for the miner's assigned group (see TEFT: Targeted Expert Fine-Tuning for how ItargetI_{\text{target}} is determined)
  • The router parameters Ψ\Psi (needed to compute routing probabilities during training)

Shared hub parameters θH\theta_H (attention, layer norms) are not transmitted by default; miners are initialized with the frozen hub from a prior checkpoint.

Phase 2: Train (~30 blocks)

Miners run inner optimization independently, with no synchronization required. The training loop is:

Φ(t,H)InnerOpt(Φ(t),Dlocal,H)\Phi^{(t,H)} \leftarrow \text{InnerOpt}(\Phi^{(t)}, D_{\text{local}}, H)

The inner optimizer uses AdamW with a cosine learning rate schedule, running in fp16 precision with a GradScaler for numerical stability. Only the parameters in ItargetI_{\text{target}} are updated; shared hub parameters θH\theta_H and non-selected experts remain frozen — consistent with the TEFT principle of sparse, targeted updates.

HH (the number of inner steps) is set by the subnet configuration and represents the amount of local training done before the miner must submit. In the current configuration, this corresponds to approximately 100 gradient steps per cycle.

The miner then computes the weight displacement (pseudo-gradient):

Δi=Φ(t)Φi(t,H)\Delta_i = \Phi^{(t)} - \Phi^{(t,H)}_i
note

Miners are free to use any hardware and data pipeline, as long as they produce valid weight updates for their assigned expert group by the end of the training phase. The Bittensor protocol is hardware-agnostic.

Phase 3: Commit (~5 blocks)

Before submitting trained weights, each miner performs a two-phase commit to the blockchain:

  1. Hash phase: The miner computes a cryptographic hash of the weight displacement Δi\Delta_i and posts this hash to the blockchain. The hash is recorded on-chain but the weights themselves are not yet submitted.
  2. Submit phase: After the hash is committed, the miner submits the actual weights to the validator.

Figure: two-phase-commit-sequence Figure: The two-phase commit protocol. The miner posts a hash to the blockchain before revealing the weights, making it cryptographically impossible to retroactively change the submission after seeing other miners' updates.

Why Two-Phase Commit Prevents Gaming

Without this protocol, a rational miner could observe other miners' submissions before posting their own, then copy the best-performing update to claim unearned reward. This is a form of front-running — well-known in blockchain contexts.

The hash-before-submit design prevents this: once a miner posts the hash, the content of their submission is fixed. Any attempt to submit different weights after observing competitors would produce a hash mismatch, and the validator would reject the submission. This guarantees that every submitted update reflects genuine independent training work.

Phase 4: Submit and Evaluate (~5 blocks)

Validators download all submitted weight updates, evaluate each one using Proof-of-Loss, synchronize gradients across validators, aggregate using the outer optimizer, and update the global model.

Scoring: For each miner ii with pseudo-gradient Δi\Delta_i:

wiReLU(L(Φ(t))L(Φ(t)+Δi))w_i \propto \text{ReLU}\left( L(\Phi^{(t)}) - L(\Phi^{(t)} + \Delta_i) \right)

The validator applies Δi\Delta_i to a local copy of the current model and measures loss on a held-out validation set DvalD_{\text{val}}. The weight wiw_i is the loss reduction — how much the miner's update improved the model. Miners whose updates increase loss receive wi=0w_i = 0.

Local gradient aggregation: Each validator populates the global model's gradients from the top-scoring miners' weight displacements, weighted equally:

local=1Ni=1NΔi(for miners with wi>0)\nabla_{\text{local}} = \frac{1}{N} \sum_{i=1}^{N} \Delta_i \quad \text{(for miners with } w_i > 0\text{)}

Inter-validator synchronization: Before running the outer optimizer, validators synchronize their locally aggregated gradients across all active validators using a decentralized allreduce (see Inter-Validator Merging). Each validator averages its local gradients with those of peer validators per expert group, ensuring all validators converge on the same global update.

Outer optimization: The DiLoCo-style outer optimizer applies Nesterov momentum SGD to the synchronized gradients:

Φ(t+1)Φ(t)αOuterOpt(synced)\Phi^{(t+1)} \leftarrow \Phi^{(t)} - \alpha \cdot \text{OuterOpt}\left( \nabla_{\text{synced}} \right)

Blockchain settlement: The validator calls set_weights() on the Bittensor blockchain, posting each miner's normalized score. The Bittensor protocol converts these scores into token emission weights — determining how the subnet's share of TAO rewards is distributed among miners.

Communication Efficiency

A key design goal of BlockZero is minimizing communication overhead. The DiLoCo approach achieves approximately 500× less communication than synchronous training baselines by:

  1. Sparse parameter transmission: Only the selected expert group parameters (ItargetI_{\text{target}}) are transmitted, not the full model.
  2. Infrequent synchronization: Miners synchronize only once per 45-block cycle (~9 minutes), not after every gradient step.

This architecture makes BlockZero practical over the open internet, where bandwidth is limited and latency is variable — conditions under which synchronous distributed training approaches collapse due to pipeline bubbles and communication stalls.

Validator Responsibilities

Validators maintain the full model in memory and are responsible for:

  • Serving model slices to miners (HTTP file server)
  • Running Proof-of-Loss evaluation on each submission
  • Aggregating accepted updates using the DiLoCo outer optimizer
  • Maintaining a checkpoint history for model distribution
  • Calling set_weights() to post miner scores on-chain
  • Inter-validator synchronization of aggregated gradients before outer optimization (see Inter-Validator Merging)
warning

Both miners and validators require an NVIDIA A6000 (48GB VRAM) or equivalent GPU. Validators hold the full model for Proof-of-Loss evaluation; miners only hold their assigned expert group.