Connito AI
Layer 3: Distributed System

Subnet Architecture

This page describes the operational architecture of the Connito subnet — the cycle of distributed training, evaluation, and model update that runs continuously on the Bittensor blockchain. It assumes familiarity with machine learning training loops but not with Bittensor; Bittensor-specific concepts are explained briefly where they appear.

System Overview

Connito operates as a Bittensor subnet — a specialized subnetwork within the Bittensor protocol consisting of three classes of participants:

  • Subnet Owner (SN): the central coordinator of the subnet. The SN owner controls expert group assignment — defining which expert partitions exist, which datasets they train on, and how they map to model layers. The SN owner also bridges communication between customer requirements and the subnet's task and data design, translating business needs into expert group configurations. Operationally, the SN owner runs the phase service that provides cycle timing to all participants and bootstraps the DHT network for inter-validator communication.
  • Miners (workers): nodes that perform computation. In Connito, miners train expert subsets of the MoE model on domain data.
  • Validators: nodes that evaluate and aggregate miner contributions. Validators score miner submissions, post the resulting quality scores to the Bittensor blockchain, and perform weight aggregation from the miners.

The subnet operates in repeating cycles of ~450 blocks, where one Bittensor block is produced approximately every 12 seconds. Each full training cycle takes approximately 90 minutes.

The Four-Phase Cycle

Phase 1: Distribute (~20 blocks)

Validators serve the current global model state to miners. Miners download only the parameters relevant to their assigned group via HTTP from the validator's model endpoint.

The download is a partial model Φ(t)\Phi^{(t)} contains only the selected expert weights {W,iiItarget}\{W_{\ell,i} \mid i \in I_{\text{target}}\} for the miner's assigned group (see TEFT: Targeted Expert Fine-Tuning for how ItargetI_{\text{target}} is determined)

Shared parameters (attention, layer norms, experts outside of target) are not transmitted by default; miners are initialized with the default parameters from a public model (Deepseek V2 Lite).

Phase 2: Train (~300 blocks)

Miners run inner optimization independently, with no synchronization required. The training loop is:

Φ(t,H)InnerOpt(Φ(t),Dlocal,H)\Phi^{(t,H)} \leftarrow \text{InnerOpt}(\Phi^{(t)}, D_{\text{local}}, H)

The inner optimizer uses AdamW with a cosine learning rate schedule. Only the parameters in ItargetI_{\text{target}} are updated; shared parameters and non-selected experts remain frozen — consistent with the TEFT principle of sparse, targeted updates.

HH (the number of inner steps) is set by the subnet configuration and represents the amount of local training done before the miner must submit. In the current configuration, this corresponds to approximately 100 gradient steps per cycle.

Miners are free to use any hardware and data pipeline, as long as they produce valid weight updates for their assigned expert group by the end of the training phase. The Bittensor protocol is hardware-agnostic.

Phase 3: Commit (~16 blocks)

Before submitting trained weights, each miner performs a two-phase commit to the blockchain:

  1. Hash phase: The miner computes a cryptographic hash of the weight Φi(t,H)\Phi_i^{(t,H)} and posts this hash to the blockchain. The hash is recorded on-chain but the weights themselves are not yet submitted.
  2. Submit phase: After the hash is committed, the miner submits the actual weights to the validator.

Why Two-Phase Commit

Without this protocol, a rational miner could observe other miners' submissions before posting their own, then copy the best-performing update to claim unearned reward. This is a form of front-running — well-known in blockchain contexts.

The hash-before-submit design prevents this: once a miner posts the hash, the content of their submission is fixed. Any attempt to submit different weights after observing competitors would produce a hash mismatch, and the validator would reject the submission. This guarantees that every submitted update reflects genuine independent training work.

Phase 4: Submit & Evaluate (~50 blocks)

Validators download all submitted weight updates, evaluate each one using Proof-of-Loss, synchronize gradients across validators, aggregate using the outer optimizer, and update the global model.

Scoring: For each worker ii, the validator computes the validation loss improvement:

si=ReLU(L(Φ(t))L(Φi(t,H)))s_i = \text{ReLU}\left( L(\Phi^{(t)}) - L(\Phi^{(t,H)}_i) \right)

Updates are ranked by sis_i, and only the top-kk improving updates are retained. Validation performance is used as a ranking signal — the aggregate update is formed from the top-performing submissions, not proportionally from every worker. If an update does not reduce validation loss, its score is zero and it is excluded.

Local gradient aggregation: The validator averages the weights from the top-kk workers (ranked by sis_i) to produce a merged weight, then computes the delta against the original that would be used as gradient in the outer optimizer update:

Φˉ=1kj=1kΦj(t,H),Δagg=Φ(t)Φˉ\bar{\Phi} = \frac{1}{k} \sum_{j=1}^{k} \Phi^{(t,H)}_j, \quad \Delta_{\text{agg}} = \Phi^{(t)} - \bar{\Phi}

Inter-validator synchronization: Before running the outer optimizer, validators synchronize their locally aggregated gradients across all active validators using a decentralized allreduce. Each validator averages its local gradients with those of peer validators per expert group, ensuring all validators converge on the same global update.

Outer optimization: The DiLoCo-style outer optimizer applies Nesterov momentum SGD to the synchronized gradients:

Φ(t+1)Φ(t)αOuterOpt(synced)\Phi^{(t+1)} \leftarrow \Phi^{(t)} - \alpha \cdot \text{OuterOpt}\left( \nabla_{\text{synced}} \right)

Blockchain settlement: The validator calls set_weights() on the Bittensor blockchain, posting each miner's normalized score. The Bittensor protocol converts these scores into token emission weights — determining how the subnet's share of TAO rewards is distributed among miners.

Communication Efficiency

A key design goal of Connito is minimizing communication overhead. The DiLoCo approach achieves approximately 500× less communication than synchronous training baselines by:

  1. Sparse parameter transmission: Only the selected expert group parameters (ItargetI_{\text{target}}) are transmitted, not the full model.
  2. Infrequent synchronization: Miners synchronize only once per 450-block cycle (~90 minutes), not after every gradient step.

This architecture makes Connito practical over the open internet, where bandwidth is limited and latency is variable — conditions under which synchronous distributed training approaches collapse due to pipeline bubbles and communication stalls.

Validator Responsibilities

Validators maintain the full model in memory and are responsible for:

  • Serving model slices to miners (HTTP file server)
  • Running Proof-of-Loss evaluation on each submission
  • Aggregating accepted updates using the DiLoCo outer optimizer
  • Maintaining a checkpoint history for model distribution
  • Calling set_weights() to post miner scores on-chain
  • Inter-validator synchronization of aggregated gradients before outer optimization