Running Your Miner
BlockZero miners consist of two cooperating processes. Both must be running for a complete, reward-generating setup.
Starting Your Miner
The miner requires two processes running simultaneously. Use two terminals (or tmux/screen):
Terminal A — Local Training:
env TORCHDYNAMO_CAPTURE_SCALAR_OUTPUTS=1 \
python mycelia/miner/train.py \
--path /path/to/checkpoints/miner/<your hotkey>/<run name>/
Terminal B — Model I/O (chain communication):
python mycelia/miner/model_io.py \
--path /path/to/checkpoints/miner/<your hotkey>/<run name>/
Both processes must point at the same config directory (the --path argument). When --path is not provided, the default config from mycelia/config.py is used.
Use a separate directory per hotkey (e.g., hk1/, hk2/) to avoid mixing artifacts when running multiple miners.
Two-Process Architecture
Figure: The two-process miner architecture. model_io.py manages all chain communication and timing; train.py handles GPU training. They communicate via shared checkpoints and the subprocess coordination layer.
Process 1: train.py — GPU Training Loop
train.py is the GPU-bound process. It:
- Runs the inner optimization loop (AdamW + cosine LR schedule, fp16 GradScaler)
- Loads the expert group from the latest checkpoint written by
model_io.py - Writes training checkpoints for
model_io.pyto pick up and submit - Supports multi-GPU via
torch.multiprocessing.spawnfor DDP
Process 2: model_io.py — Chain Sync Daemon
model_io.py manages all interaction with the Bittensor blockchain and validator. It runs three background threads:
| Thread | What it does | When it runs |
|---|---|---|
| Download thread | Downloads the latest expert group from the validator's HTTP endpoint | Phase 1 (Distribute) |
| Commit thread | Computes a hash of the current best checkpoint and posts it to the blockchain | Phase 3 (Commit) |
| Submit thread | Uploads the committed checkpoint to the validator server | Phase 4 (Submit) |
The daemon is phase-aware — it watches the chain block number and activates each thread only during the correct window in the 45-block cycle.
What to Watch in Logs
model_io.py logs
[INFO] Chain sync ready, current block: 12340
[INFO] Entering DISTRIBUTE phase at block 12340
[INFO] Downloading expert group 0 from validator...
[INFO] Download complete: expert_group_0_step_1000.pt (7.2GB)
[INFO] Entering TRAIN phase at block 12345
[INFO] Entering COMMIT phase at block 12374
[INFO] Hash committed: sha256:abcdef... at block 12376
[INFO] Entering SUBMIT phase at block 12379
[INFO] Weights submitted: checkpoint_500_12379_group0.pt
[INFO] Submission accepted, evaluation_eta_blocks: 5
train.py logs
[INFO] Loading expert group from checkpoint...
[INFO] Starting training: step 0/500
[INFO] step 100/500 | loss: 8.234 | lr: 2.1e-5 | throughput: 12.3 tok/s
[INFO] step 200/500 | loss: 6.891 | lr: 1.8e-5
[INFO] step 500/500 | loss: 3.512 | lr: 8.4e-6
[INFO] Training cycle complete. Checkpoint saved: checkpoint_500_12379_group0.pt
The Cycle from Your Perspective
Each 45-block (~9 minute) cycle proceeds as follows:
-
Blocks 0–5 (Distribute):
model_io.pydownloads your expert group from the validator.train.pywaits for the download to complete. -
Blocks 5–35 (Train):
train.pyrunsmax_stepsof gradient descent on your local domain data.model_io.pymonitors the checkpoint directory. -
Blocks 35–40 (Commit):
model_io.pytakes the best checkpoint fromtrain.py, computes its hash, and posts the hash to the blockchain. -
Blocks 40–45 (Submit/Evaluate):
model_io.pyuploads the checkpoint to the validator. The validator scores your submission via Proof-of-Loss and updates weights.
Monitoring
Enable W&B logging for live training curves:
wandb: true
wandb_project: blockzero-miner
Key metrics to watch:
train/loss— should decrease each cycle as your expert improveschain/w_i_score— your Proof-of-Loss score for the last cycle (higher = more reward)chain/submission_status—accepted,rejected, ormissed
Stopping Gracefully
kill -SIGTERM <model_io_pid> # allows clean chain state save
kill -SIGTERM <train_pid> # allows checkpoint write
Avoid kill -9 (SIGKILL) — it can leave a WebSocket lock held on the chain connection.