Fine-Tuning Friction — Regressions, Overbuilt Models, and LoRA’s Practical Limits
Even when companies commit to customizing their models, they quickly discover that fine-tuning is fragile. The process is not modular, not easily reusable, and not structurally isolated. Each adjustment to the model can introduce unintended side effects, making iteration slow and scaling across customers increasingly complex.
Customization Causes Regressions
Fine-tuning updates shared model weights. Those weights are not cleanly separated by capability — formatting, reasoning, safety behavior, and general knowledge overlap in parameter space. When you improve performance in one domain, you risk degrading performance elsewhere.
This phenomenon is well documented in continual learning research, where sequential task adaptation leads to performance degradation on previously learned capabilities Gupta et al., 2023.
In production systems, this shows up as clean JSON outputs becoming malformed after domain tuning, previously stable reasoning chains becoming inconsistent, safety behavior shifting subtly, or general capability declining outside the tuned domain. When serving multiple customers, the problem multiplies. Each tuned variant requires independent evaluation, regression testing, deployment infrastructure, and monitoring. Operational overhead grows faster than the value customization provides.
Fine-tuning does not isolate change — it propagates it.
Giant Models Are Overbuilt for Most Downstream Jobs
Modern frontier models are enormous. Meta’s Llama 4 Behemoth has been reported in the 600B+ parameter class, and DeepSeek-V3 operates at similar extreme scales to compete at the frontier.
However, most downstream applications do not require that full breadth of capability on every request.
Fine-tuning does not reduce model size. A 600B model remains a 600B model at inference.
To put this in practical terms:
- 600B parameters in FP16 require roughly 1.2 TB of memory just to load weights.
- Even using 80GB A100 or H100 GPUs, you would need at least 15 GPUs just to hold the model weights, before accounting for activations, KV cache, and batching overhead.
- Realistically, serving a 600B dense model requires 16–32 high-end GPUs per instance, depending on inference strategy and optimization.
- At typical cloud rates (4+ per GPU-hour for A100/H100 class hardware), running a single always-on instance can cost thousands of dollars per day.
This mismatch becomes especially painful in constrained environments such as robotics, IoT, and mobile systems, where memory, battery life, offline operation, and latency constraints are strict.
Sparse Mixture-of-Experts (MoE) architectures demonstrate that activating only a subset of parameters per token can preserve performance while dramatically reducing active compute. For example, MoE routing efficiency and parameter sparsity are analyzed in detail in Shazeer et al., 2022, showing how selective expert activation maintains quality without activating the full parameter set.
Traditional fine-tuning offers no such structural efficiency.
Fine-tuning changes behavior — not footprint.
LoRA Helps — But It Has Practical Limits
Low-Rank Adaptation (LoRA) was introduced as a parameter-efficient method for fine-tuning large models by freezing base weights and training small low-rank adapters (Hu et al., 2021). It reduces training cost and mitigates some regression risk.
However, LoRA’s limitations are structural.
LoRA is capacity-limited by design. Because it constrains updates to low-rank subspaces, the magnitude and dimensionality of behavior change it can represent are restricted. If the behavior shift required is large — such as deep domain knowledge transfer, complex reasoning adjustments, or highly specialized structured output — a small adapter may not be enough to reliably move the needle.
Research on intrinsic dimensionality and parameter-efficient tuning suggests that substantial adaptation may require expanding the effective parameter subspace, which erodes the efficiency advantage (Aghajanyan et al., 2021). In practice, increasing LoRA rank increases memory, compute cost, and optimization instability, approaching the complexity of full fine-tuning.
This creates a practical ceiling:
- Performance may plateau for highly specialized domains.
- Raising rank increases cost and complexity.
- Hyperparameters (rank, target layers, learning rate) significantly affect results.
- At scale, adapter versioning and compatibility management become an MLOps burden.
LoRA reduces cost per update. It does not eliminate structural fragility.
The Pattern
Across regressions, oversized deployments, and adapter sprawl, the pattern is consistent: traditional fine-tuning assumes a monolithic architecture.
You modify a shared system and hope improvements in one dimension do not degrade another.
As customization scales across customers and domains, that assumption becomes brittle. What begins as a simple model adjustment turns into regression firefighting, variant management, and repeated retraining cycles.
The friction is not accidental.
It is structural.