LoRA (Low-Rank Adaptation)
LoRA fine-tunes a model by training small low-rank adapter matrices instead of all weights — a fraction of the memory and cost, nearly full-tune quality.
LoRA (Low-Rank Adaptation) is the technique that made fine-tuning affordable: instead of updating a model's billions of weights, it freezes them and trains small low-rank matrices injected alongside — typically well under 1% of parameters — capturing most of the quality of a full fine-tune.
The insight is that the change a fine-tune needs is low-rank: representable as the product of two thin matrices per adapted layer. Training only those slashes GPU memory (no optimizer state for frozen weights), produces megabyte-scale adapter artifacts instead of full model copies, and lets one base model serve many tasks by swapping adapters. QLoRA stacks quantization underneath — a 4-bit frozen base with trainable adapters — bringing 7–70B-class fine-tuning onto single GPUs.
In practice LoRA/QLoRA is the default for open-weight model tuning, with libraries like Unsloth optimizing the loop. The end-to-end procedure — dataset to adapter to eval — is packaged in the qlora-finetune-runner skill; whether you should be fine-tuning at all is the decision-tree guide's question.
Frequently asked questions
- What's the difference between LoRA and QLoRA?
- QLoRA is LoRA on top of a quantized base model: the frozen weights are loaded in 4-bit precision while the trainable adapters stay higher-precision. That cuts memory enough to fine-tune models in the 7–70B class on a single consumer or prosumer GPU, at a small quality cost that's usually acceptable.
- Do LoRA adapters change the original model?
- No — the base weights stay frozen. Training produces a small adapter file (often tens of megabytes) that's applied on top at inference, or merged into the weights for deployment. That's the operational win: one base model, many cheap task adapters, swappable per workload.
Related
- Fine-TuningFine-tuning continues training a pretrained model on your own examples, changing its weights to teach durable behavior, format, or domain style.
- QuantizationQuantization shrinks a model by storing weights in lower precision (8-, 4-, even 2-bit) — cutting memory and speeding inference at a small accuracy cost.
- Qlora Finetune RunnerRun a QLoRA (4-bit LoRA) fine-tune of an open-weight model from a prepared dataset — set up the config, train memory-efficiently (e.g. with Unsloth/PEFT), watch for overfitting, save the adapter, and run a quick eval against the prepared split. Use when you have a clean dataset and want to execute a parameter-efficient fine-tune on a single GPU.
- UnslothAn open-source library that makes LoRA/QLoRA fine-tuning of LLMs roughly 2x faster and far more memory-efficient, so you can fine-tune on a single GPU.
- Preparing a Fine-Tuning Dataset: Cleaning, Synthetic Data, and Eval SplitsThe dataset is the model. How to build a fine-tuning dataset that works — format, curation, cleaning, synthetic augmentation, and a leak-free eval split.