The Fundamental Distinction
The three techniques target different failure modes:
- Prompt engineering — the model has the capability but needs clearer instructions or examples. The knowledge is already in the weights.
- RAG — the model lacks access to specific facts, documents, or data that changes over time. The knowledge needs to be retrieved and injected at inference time.
- Fine-tuning — the model needs to learn a consistent style, domain vocabulary, output format, or behavior that cannot be adequately expressed in a prompt. The behavior needs to be baked into the weights.
These are not competing alternatives for the same problem — they solve different problems. In practice, the best systems often combine all three.
Start Here: The Decision Tree
| Condition | Recommended Approach |
|---|---|
| Model gives correct answers but wrong format or tone | Prompt engineering (few-shot examples, output format instructions) |
| Model doesn't know facts that change monthly | RAG — facts belong in a retrieval index, not weights |
| Model needs to know your private/internal documents | RAG — train-time data is expensive; retrieval is cheap |
| Model consistently fails on domain-specific tasks despite good prompts | Fine-tuning — the task pattern needs to be in the weights |
| You need a specific output structure (JSON schema, code style) enforced reliably | Fine-tuning (or structured outputs/function calling first) |
| Latency requirements preclude long prompts | Fine-tuning — compress few-shot examples into weights |
| Privacy: data cannot leave your infrastructure | Fine-tuning on a self-hosted open model |
Prompt Engineering: Always Try This First
Modern frontier models are dramatically underutilized through simple prompts. Before investing in RAG infrastructure or fine-tuning budgets, exhaust prompt engineering:
- Chain-of-thought: "Think step by step" measurably improves reasoning on math and logic tasks — sometimes by 20–30% on benchmarks.
- Few-shot examples: 3–5 input/output examples in the prompt dramatically shape output format and style with no training cost.
- Role assignment: "You are an expert Python code reviewer" improves code review quality — the model activates relevant knowledge.
- Explicit constraints: "Respond in JSON only. Do not include any prose." is more reliable than hoping the model defaults to structured output.
Budget 1–2 days of prompt iteration before deciding you need RAG or fine-tuning. Most format and tone problems dissolve with good few-shot examples.
RAG: When Knowledge Is External or Changing
RAG is the right choice when the information your model needs:
- Changes frequently (news, prices, support tickets, recent research)
- Exists in your private documents (contracts, wikis, codebases, customer data)
- Is too voluminous for a context window
- Requires citation (you need to surface which document the answer came from)
RAG does NOT improve: reasoning ability, domain-specific vocabulary usage, consistent output formatting, or tasks requiring implicit knowledge baked into language itself.
A common mistake: building a RAG pipeline because the model "doesn't know our product" when the actual problem is inconsistent tone and formatting — a prompt engineering or fine-tuning problem.
Fine-tuning: When Behavior Needs to Be in the Weights
Fine-tuning changes the model weights to embed a pattern or behavior. Use it when:
- You have hundreds to thousands of high-quality input/output examples
- The task is very specific (e.g., extract structured data from radiology reports)
- You need to compress long few-shot prompts into a smaller, faster model
- You need consistent behavior that prompt instructions alone cannot guarantee
LoRA and QLoRA: Fine-tuning Without Owning a Data Center
Full fine-tuning updates all model weights — impractical for 7B+ parameter models on a single GPU. LoRA (Low-Rank Adaptation) inserts trainable rank-decomposition matrices into the attention layers, leaving the original weights frozen. QLoRA further quantizes the frozen base model to 4-bit, making 70B fine-tuning possible on a single A100 80GB.
# QLoRA fine-tuning setup (peft + bitsandbytes)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import get_peft_model, LoraConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="bfloat16"
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B",
quantization_config=bnb_config
)
peft_config = LoraConfig(
r=16, # rank: higher = more capacity, more params
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
task_type="CAUSAL_LM"
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.042LoRA with rank 16 adds ~3.4M trainable parameters to an 8B model (0.04% of total). Training cost: approximately $5–20 on cloud GPUs for a small dataset (1,000–10,000 examples), versus thousands of dollars for full fine-tuning.
Cost Comparison
| Approach | Setup Cost | Per-Request Cost | Data Required | Maintenance |
|---|---|---|---|---|
| Prompt engineering | Hours | ~$0.002 (gpt-4o-mini) | None | Low — update prompt in code |
| RAG pipeline | Days (infra setup) | Embedding + retrieval + LLM | Source documents | Medium — re-index on updates |
| Fine-tuning (LoRA) | $5–$50 one-time | Hosting fine-tuned model | 500–10,000 examples | High — retrain on distribution shift |
| Fine-tuning (full) | $500–$5,000+ | Hosting full model | 10,000+ examples | Very high |
The Hybrid Approach: RAG + Fine-tuning
The highest-performing production systems use both. Fine-tune for consistent behavior, tone, and domain vocabulary — then use RAG to inject current facts at inference time. The fine-tuned model knows how to answer; RAG tells it what to answer from.
A concrete example: a legal assistant fine-tuned on 5,000 legal document Q&A pairs for style and citation format, with a RAG index over the firm's case library for specific precedents. Neither alone is sufficient.
Full fine-tuning on a narrow dataset can cause "catastrophic forgetting" — the model loses general capability in pursuit of task-specific performance. Always evaluate on held-out general benchmarks after fine-tuning. LoRA mitigates this by leaving base weights frozen, making it safer for narrow task adaptation.