Modular Models as Product Architecture

§ 01 · Context

The Future of AI Is Continuous Improvement —
From model releases to reusable expertise

For a long time, AI teams have treated the model as the product. If the model needed to become better at math, the answer was to train the model. If it needed stronger coding, safer behavior, better tool use, or a new enterprise domain, the answer was still some version of the same thing: update the checkpoint, evaluate the new version, and ship it.

That made sense when the main story of AI progress was scale. Bigger models, more data, larger training runs, broader post-training. The model improved because the whole model changed. But as models become more capable, that assumption starts to feel too blunt.

Not every customer-facing product problem is a whole-model problem. A company may want to improve coding without weakening safety. A customer may need a private legal capability without a full model fork. A data partner may want its contribution included for some users and excluded for others. A model may need a new language or domain without disturbing everything it already knows.

Today, we are moving from improving models as monolithic systems to improving them as collections of reusable capabilities.

That is why modular models are starting to look less like an efficiency trick and more like product architecture. The important shift is not only that Mixture-of-Experts models activate fewer parameters per token. It is that expertise is becoming a first-class asset inside the model.

Across these research threads, the same pattern is becoming visible. LoRAHub and Microsoft’s LoRA library work treat adapters as reusable expert modules. BAR and Branch-Train-Stitch move that logic into model architecture, recombining specialists through routers or stitch layers. EMO goes further, asking whether experts can emerge during training itself. Together, they suggest that expertise is becoming something models can store, compose, govern, and improve over time.

This is the direction Connito is exploring: how to make modular expertise easier to identify, organize, and reuse. If models are becoming collections of capabilities, then the next infrastructure problem is not only training better experts. It is helping teams understand which experts exist, what they are good at, and how they can compound over time.

§ 02 · Research

LoRA Libraries as Modular Model Architecture

The first real “expert library” pattern we have seen in LLM architecture comes from LoRA: small, task-specific adapters that can be trained separately, stored, reused, and composed around a shared base model.

The most interesting shift in modular LLM research is that “model capability” no longer has to live inside one monolithic checkpoint. In Towards Modular LLMs by Building and Reusing a Library of LoRAs, Microsoft researchers frame LoRA adapters as reusable experts: train lightweight adapters for tasks, organize them into a library, and route new inputs to the most relevant modules. Their Model-Based Clustering method groups tasks by similarity in LoRA parameters, while Arrow provides zero-shot routing, selecting useful adapters without retraining a router or requiring access to the original training data. This turns specialization into an architectural layer: capabilities can be added, clustered, reused, and composed rather than baked permanently into the base model.

How a LoRA expert library works (MBC + Arrow): adapters are trained separately, clustered by parameter similarity (MBC), and a zero-shot router (Arrow) selects the relevant adapters for a new input and composes them over a shared base model — no router retraining, no access to the original training data. Connito schematic, after Ostapenko et al. (Microsoft).

MoLoRA extends this idea further. Instead of routing an entire request to a single adapter, it routes at the token level, allowing one response to draw on multiple specialized LoRAs. This matters for mixed-capability tasks: “write code to solve this equation” may need both mathematical reasoning and code-generation expertise. MoLoRA’s core claim is that specialization can beat scale: smaller models equipped with composable adapters can outperform larger general models on targeted benchmarks.

As product architecture, this suggests a new pattern: ship a stable base model, maintain a growing library of domain adapters, and use routing as the orchestration layer. The product surface becomes modular, extensible, and cheaper to update — closer to a plugin ecosystem than a single model release.

§ 03 · Research

Branching as Product Architecture

BAR, short for Branch-Adapt-Route, is a practical example of modular model development. Instead of treating every improvement as a whole-model update, BAR starts with an existing post-trained model, branches it into separate domain experts, adapts those experts independently, and then routes between them inside a shared Mixture-of-Experts system.

How BAR works: a dense base model M is branched into independently trained experts — math (Σ), code (</>) and safety — each a small two-FFN block sharing one Attention. They merge into a single MoE (anchor FFN + domain FFNs), and any one expert can be swapped for a new version (v2) without retraining the rest. Connito schematic, after BAR (Ai2).

The important shift is architectural. Math, code, tool use, and safety are not treated as generic benchmark categories. They become reusable capabilities that can be trained, evaluated, upgraded, and governed on their own timelines. The original model remains as an anchor expert, preserving general behavior, while new experts specialize around domains where the system needs to improve.

In the paper’s 7B experiments, BAR reaches an average score of 49.1 across 19 benchmarks. That is stronger than continual post-training and other modular baselines, and surprisingly close to a much more expensive retraining pipeline with mid-training, which scores 50.5. Mid-training is the phase between general pretraining and final post-training, where a model is exposed to focused domain data such as code, math, legal, or scientific text so it develops stronger specialist knowledge before instruction tuning. BAR suggests that modular post-training can capture much of that benefit through separate experts, without forcing every capability back through the same full-model training pipeline.

Approach	Score	Requirement
Full retrain	47.8	Retrain without extra mid-training
BAR	49.1	Branch experts, adapt separately, train router
Full retrain + mid-training	50.5	Retrain with additional domain mid-training

The broader point is that a model roadmap does not have to be a sequence of monolithic checkpoint releases. A code expert can be upgraded without retraining the safety expert. A math expert can improve without disturbing tool use. BAR makes expertise look less like a hidden property of a checkpoint and more like infrastructure that can compound over time.

Branch-Train-Stitch from Meta AI pushes the same idea through a different mechanism. Instead of merging specialists through routing, it branches a seed model into independently trained experts, freezes them, and learns lightweight stitch layers that connect their representations back into one generalist system. BAR routes experts; BTS stitches them together.

§ 04 · Research

Emergent Modularity

But what if we don’t already know what the experts should be?

BAR starts with predefined capabilities: math, code, tool use, safety. EMO approaches the same modular future from the opposite direction. Instead of assigning experts to domains upfront, it asks whether useful expert structure can emerge during pretraining itself.

The intuition is simple. Tokens from the same document usually belong to the same broad context. A code file, a math proof, a scientific article, and a general web page each carry different patterns of knowledge. EMO uses this document-level structure as a weak signal. During training, each document is routed through a shared pool of experts, encouraging tokens from the same document to rely on similar expert groups. Over time, those groups begin to specialize around broader themes without needing manually labeled domains.

How EMO works: the router picks a small shared pool of experts for each whole document, and every token in that document routes only within it. No one labels the domains — coherent expert groups (math, code, biomedical…) emerge from the data, and related documents reuse overlapping pools. Connito schematic, after EMO (Ai2).

The result is a model that behaves less like one undifferentiated MoE and more like a system with recoverable expert subsets. In the paper’s experiments, EMO retains nearly full performance when using only part of its expert library: keeping 25% of experts leads to about a 1% absolute performance drop, while keeping 12.5% leads to about a 3% drop.

The result. Selective expert use, no fine-tuning (1T tokens). As the retained subset shrinks 128 → 8, the standard MoE (green) collapses toward random, while EMO (orange) stays close to full performance — about a 1% drop at 25% of experts and 3% at 12.5%. Connito chart, data from EMO (Ai2), Figure 3.

EMO expert subset used	Experts removed	Reported performance drop
100%	0%	0%
25%	75%	About 1% absolute
12.5%	87.5%	About 3% absolute

That matters because it changes what an MoE can be used for. The point is not only sparse activation during inference. It is the possibility of identifying smaller expert groups that carry useful capabilities. BAR builds an expert library deliberately; EMO shows that, under the right training pressure, part of that library can emerge from the data itself.

§ 05 · Connito

Connito’s Place in the Modular Stack

The research direction is becoming clear: models are starting to separate into reusable capabilities. The next question is how those capabilities become operational.

An expert is only useful if a team can find it, understand what it does, trust its performance, and decide when to reuse it. Without that layer, modularity risks becoming another internal training artifact: technically interesting, but hard to manage as a product system.

One way to build it up — compounding releases. Each version adds or upgrades exactly one expert and keeps the rest, so the served model's capability stack grows release over release. Connito schematic.

This is where Connito fits into the modular model movement. Connito is building the infrastructure around expert discovery, training, validation, and reuse. Rather than treating every customer request as a new model project, Connito helps identify the specialist capability a task actually needs, route training through a distributed Bittensor network, evaluate the resulting expert, and fold successful work back into a growing capability library.

Parallel roadmaps. One lane per capability, each advancing on its own cadence; at each release marker the current experts compose into the single served model, so teams iterate independently and the product still ships as one model. Connito schematic.

The goal is not only to make individual models better. It is to make model improvement more cumulative. A legal expert, a coding expert, or a customer-specific domain expert should not vanish after one deployment. It should become something that can be tested, governed, combined with other experts, and improved as new demand appears.

For Connito, the opportunity is to turn modular model behavior into a repeatable product layer. The checkpoint still matters, but the durable asset is the system around it: the expert library, the validation process, and the network that keeps expanding what the library can do.