Fine-tuning vs Prompting vs RAG: A Decision Framework for Adapting LLMs

When to prompt, retrieve, or fine-tune: knowledge vs behavior, data needs, cost, privacy, SFT/LoRA/DPO — and why most teams start with prompt + RAG.

FEB 15, 2026 14 MIN READ

Part 6 of the Building AI Agents series {Phần 6}. Previous {Trước}: Context Engineering & Memory · Next {Tiếp}: Evaluating LLMs & Agents.

You have a base model that is already capable {Bạn có base model đã đủ capable}. The hard product question is not which model — it is how to adapt it to your domain, your format, and your freshness requirements {Câu hỏi sản phẩm khó không phải model nào — mà là cách adapt cho domain, format, và yêu cầu freshness}. Senior teams converge on three levers: prompt engineering, retrieval (RAG), and fine-tuning {Team senior hội tụ vào ba lever: prompt engineering, retrieval (RAG), và fine-tuning}. Each changes a different part of the system; mixing them without a framework wastes GPU hours and creates brittle agents {Mỗi cái thay đổi phần khác của hệ thống; trộn không có framework thì đốt GPU và tạo agent giòn}.

This post is a decision/strategy guide — not a training tutorial {Bài này là guide quyết định/chiến lược — không phải tutorial training}. For RAG architecture depth, see RAG Guide {Để hiểu sâu kiến trúc RAG, xem RAG Guide}. For fine-tuning mechanics, see Fine-tuning LLM Basics {Cho cơ chế fine-tuning, xem Fine-tuning LLM Basics}.

Open the full demo {Mở demo đầy đủ}: /tools/training-decision-demo/.

The three adaptation levers {Ba lever adaptation}

Think of an LLM deployment as three separable concerns {Hãy coi deployment LLM gồm ba concern tách được}:

Lever	What it changes	When it takes effect	Typical cost profile
Prompt engineering	Instructions and in-context examples	Every request (inference)	Low upfront; scales with tokens
RAG	External knowledge injected at query time	Every request (retrieve + infer)	Medium upfront (index); ongoing retrieval + embedding
Fine-tuning	Model weights (behavior and optionally knowledge)	Once at train time; cheap at inference	High upfront (data + GPU); lower per-token if prompts shrink

                    ┌─────────────────────────────────────┐
                    │           BASE MODEL                │
                    │   (pre-trained general weights)     │
                    └─────────────────────────────────────┘
                           ▲           ▲           ▲
                           │           │           │
              ┌────────────┘           │           └────────────┐
              │                        │                        │
     PROMPT ENGINEERING              RAG                 FINE-TUNING
     (system + few-shot)      (retrieve → context)    (SFT / LoRA / DPO)
              │                        │                        │
     "How to respond"          "What facts to use"      "Default tendencies"

Callout: Prompting steers inference-time behavior; RAG supplies external facts; fine-tuning encodes persistent patterns into weights {Callout: Prompting điều khiển behavior lúc inference; RAG cung cấp fact bên ngoài; fine-tuning mã hóa pattern bền vào weights}. Confusing “we need the model to know X” (knowledge) with “we need the model to always output Y” (behavior) is the root cause of most bad fine-tune proposals {Nhầm “model cần biết X” (knowledge) với “model phải luôn output Y” (behavior) là gốc hầu hết đề xuất fine-tune tệ}.

Knowledge vs behavior — the first fork {Knowledge vs behavior — ngã rẽ đầu tiên}

Before choosing a lever, classify the gap {Trước khi chọn lever, phân loại gap}:

Gap type	Symptom	Wrong fix	Right starting point
Knowledge	Model lacks facts, docs, policies, product specs	Fine-tune on PDF dump	RAG or long-context prompt
Behavior	Wrong JSON shape, tone, classification label, tool-call style	Stuff 200 examples in every prompt forever	Few-shot prompt → SFT/LoRA if stable
Both	Enterprise agent over proprietary docs with strict format	Prompt-only spaghetti	Hybrid: RAG + fine-tuned formatter/router

Knowledge is what the model should cite — ideally with sources you can update without retraining {Knowledge là model nên trích dẫn gì — lý tưởng có nguồn cập nhật không cần retrain}. Behavior is how the model should act given any context — classification boundaries, refusal patterns, structured extraction schemas {Behavior là model nên hành xử thế nào với mọi context — ranh giới classification, pattern refuse, schema extract}.

USER: "What's our refund policy for EU customers?"

KNOWLEDGE GAP     → model hallucinates or uses outdated train data
BEHAVIOR GAP      → model knows refunds exist but outputs prose instead of
                    required JSON \{"eligible": bool, "reason": str\}
BOTH              → needs retrieved policy doc AND structured output schema

Parts 3–5 of this series covered prompt and context engineering {Phần 3–5 của series đã cover prompt và context engineering}. This post assumes you can already assemble system messages, memory, and tool schemas — the question is whether that is enough {Bài này giả định bạn đã lắp system message, memory, tool schema — câu hỏi là liệu đó đã đủ}.

Decision framework {Framework quyết định}

Use five axes in order {Dùng năm trục theo thứ tự}. The interactive demo above walks the same tree {Demo tương tác phía trên đi cùng cây quyết định}.

1. Freshness {Freshness — độ mới}

Update cadence	Recommendation bias
Daily / weekly (inventory, news, policies)	RAG — weights go stale immediately
Monthly / quarterly	RAG or hybrid; prompt if corpus is tiny
Static (historical, legal archive)	Prompt or fine-tune if behavior-stable

Callout: Fine-tuning encodes knowledge at train time {Callout: Fine-tuning mã hóa knowledge lúc train}. If your “ground truth” changes every sprint, you will retrain constantly or ship lies {Nếu “ground truth” đổi mỗi sprint, bạn retrain liên tục hoặc ship thông tin sai}.

2. Corpus size vs context window {Kích thước corpus vs context window}

If relevant documents cannot fit reliably in your per-query budget (including memory and tool results), retrieval is mandatory {Nếu tài liệu liên quan không fit tin cậy trong budget mỗi query (kể cả memory và tool result), retrieval là bắt buộc}. Long-context models help but do not replace search at 100K+ document scale {Long-context giúp nhưng không thay search ở quy mô 100K+ document}.

3. Labeled data availability {Dữ liệu labeled}

Volume	Quality bar	Fine-tune viability
0–50 pairs	Any	Prompt / few-shot only
50–500	Human-reviewed	Marginal LoRA; validate hard
500+	Consistent format, edge cases covered	SFT / LoRA reasonable
5K+	Preference pairs or rankings	DPO / RLHF-style tuning

Quality beats quantity {Chất lượng hơn số lượng}. Five hundred noisy ChatGPT-generated pairs often lose to fifty expert-labeled examples plus a strong prompt {500 cặp noisy từ ChatGPT thường thua 50 ví dụ expert-labeled cộng prompt mạnh}.

4. Latency and token economics {Latency và kinh tế token}

Repeated 4K-token system prompts on every agent step add cost and latency {System prompt 4K token lặp mỗi bước agent tăng cost và latency}. Fine-tuning can compress instructions into weights — useful when you need sub-second tool routing at scale {Fine-tuning có thể nén instruction vào weights — hữu ích khi cần tool routing dưới giây ở scale}. RAG adds retrieval latency (10–200ms+ depending on index) but avoids giant prompts {RAG thêm latency retrieval (10–200ms+ tùy index) nhưng tránh prompt khổng lồ}.

5. Privacy and deployment {Privacy và deployment}

Sensitive data that cannot leave your VPC pushes you toward self-hosted embeddings, local vector DB, and on-prem fine-tuning {Dữ liệu nhạy cảm không ra khỏi VPC đẩy bạn về embedding self-hosted, vector DB local, fine-tune on-prem}. Managed fine-tuning APIs may require sending training JSONL to a vendor — read the data-processing terms {API fine-tune managed có thể yêu cầu gửi JSONL training cho vendor — đọc điều khoản xử lý dữ liệu}.

DECISION CHECKLIST (in order)
─────────────────────────────
□ Classify gap: knowledge | behavior | both
□ Freshness: will weights be stale in < 1 month?
□ Corpus: fits in context per query?
□ Labeled data: count + quality sufficient for SFT?
□ Latency/cost: can you afford large prompts every step?
□ Privacy: can training data leave the boundary?
□ Run eval baseline BEFORE committing to fine-tune (→ Part 7)

Prompt engineering — when it is enough {Prompt engineering — khi nào đủ}

Start here almost always {Gần như luôn bắt đầu ở đây}. Prompting covers:

System instructions and role {System instruction và role}
Few-shot exemplars (Part 3) {Few-shot exemplar (Phần 3)}
Structured output via JSON mode, grammar, or post-validation (Part 4) {Structured output qua JSON mode, grammar, hoặc post-validation (Phần 4)}
Context assembly from memory (Part 5) {Lắp context từ memory (Phần 5)}

Strength	Limit
Hours to iterate	Context window ceiling
No training infra	Instruction-following drift at scale
Easy A/B in production	Cost grows with prompt length

Callout: If a 10-shot prompt with retrieved snippets hits your accuracy target in eval, stop — you do not need fine-tuning {Callout: Nếu prompt 10-shot với snippet retrieved đạt accuracy target trong eval, dừng — không cần fine-tuning}.

RAG — when retrieval is the answer {RAG — khi retrieval là câu trả lời}

RAG solves dynamic, voluminous, or private knowledge without weight updates {RAG giải knowledge động, khối lượng lớn, hoặc private không cần cập nhật weight}. The agent pattern from Part 5 — memory + tools + context — often is RAG when “memory” is a vector index over docs {Pattern agent từ Phần 5 — memory + tool + context — thường chính là RAG khi “memory” là vector index trên doc}.

Use RAG when:

Knowledge changes faster than you can retrain {Knowledge đổi nhanh hơn retrain}
You need citations for compliance or debugging {Cần trích dẫn cho compliance hoặc debug}
Corpus exceeds practical context (even with summarization) {Corpus vượt context thực tế (kể cả summarization)}

Avoid treating RAG as “dump everything into the prompt” {Đừng coi RAG là “nhét hết vào prompt”}. Chunking, hybrid search, reranking, and query transformation matter more than embedding model choice for most teams {Chunking, hybrid search, reranking, query transformation quan trọng hơn chọn embedding model với hầu hết team}. See the RAG Guide for pipeline detail {Xem RAG Guide cho chi tiết pipeline}.

Fine-tuning — types and when each fits {Fine-tuning — loại và khi nào hợp}

Fine-tuning updates model parameters on your data {Fine-tuning cập nhật tham số model trên data của bạn}. Overview of common approaches:

Method	What it does	Data needed	Typical use
SFT (Supervised Fine-Tuning)	Minimize loss on input→output pairs	500+ quality pairs	Format, extraction, classification
LoRA / QLoRA (PEFT)	Train small adapter matrices, freeze base	Same as SFT, less VRAM	Cost-efficient behavior adaptation
Full fine-tune	Update all weights	Large curated set	Rare; foundation-model teams
RLHF	Reward model + policy optimization	Human rankings, expensive	Alignment, complex preferences
DPO (Direct Preference Optimization)	Optimize preferred vs rejected outputs	Preference pairs	Style, safety, tone without full RL pipeline

SFT / LoRA pipeline (simplified)
────────────────────────────────
curated JSONL  →  tokenize  →  train adapters  →  merge/export
     │                                              │
     └─ hold-out eval set (NEVER train on this) ────┘

When fine-tuning shines {Khi fine-tuning tỏa sáng}

Stable behavior that prompts cannot reliably enforce (strict JSON, domain-specific tool grammar) {Behavior ổn định prompt không enforce tin cậy (JSON chặt, grammar tool theo domain)}
High QPS where shaving 2K tokens per request pays for training {QPS cao mà bỏ 2K token mỗi request thì hoàn vốn training}
Edge distribution — your inputs look nothing like generic web text {Phân phối edge — input không giống web text generic}

When fine-tuning fails {Khi fine-tuning thất bại}

Teaching facts that change (use RAG) {Dạy fact hay đổi (dùng RAG)}
Small, dirty datasets — model memorizes noise {Dataset nhỏ, bẩn — model học thuộc noise}
Catastrophic forgetting — overtrain and the model loses general capability {Catastrophic forgetting — train quá và model mất capability tổng quát}
Skipping eval — you ship a model that regresses on out-of-domain prompts {Bỏ qua eval — ship model regress trên prompt ngoài domain}

For training hyperparameters and LoRA rank selection, defer to Fine-tuning LLM Basics {Cho hyperparameter và chọn LoRA rank, xem Fine-tuning LLM Basics}.

Data preparation and pitfalls {Chuẩn bị dữ liệu và cạm bẫy}

Fine-tuning success is 80% data curation {Thành công fine-tuning 80% là curate data}:

{
  "messages": [
    {"role": "system", "content": "Extract refund eligibility as JSON."},
    {"role": "user", "content": "Order #8821, delivered 45 days ago, EU."},
    {"role": "assistant", "content": "{\"eligible\": false, \"reason\": \"outside_30_day_window\"}"}
  ]
}

Pitfall	Symptom	Mitigation
Label inconsistency	Model outputs random formats	Style guide + adjudication
Train/eval leakage	Inflated offline scores	Strict document-level splits
Synthetic data pollution	Gibberish on real inputs	Cap synthetic ratio; human spot-check
Overfitting small sets	Perfect on train, fails in prod	Regularization, early stop, more real data
Catastrophic forgetting	General reasoning degrades	Lower LR, LoRA not full FT, mix general data

Callout: Never fine-tune on your eval set {Callout: Không bao giờ fine-tune trên eval set}. Part 7 covers building eval harnesses that survive adaptation changes {Phần 7 cover xây eval harness sống sót khi đổi adaptation}.

Evaluation before and after {Đánh giá trước và sau}

Adaptation without measurement is guessing {Adaptation không đo lường là đoán mò}. Minimum bar:

Baseline — best prompt (+ RAG if applicable) on frozen eval set {Baseline — prompt tốt nhất (+ RAG nếu có) trên eval set cố định}
Hypothesis — “fine-tune will improve JSON validity from 92% → 98%” {Giả thuyết — “fine-tune cải JSON validity từ 92% → 98%”}
Compare — same eval, same sampling params (Part 2), same context budget {So sánh — cùng eval, cùng sampling (Phần 2), cùng context budget}
Regression check — general capability slice (reasoning, refusal, safety) {Kiểm tra regression — lát capability tổng quát (reasoning, refuse, safety)}

EVAL LOOP
─────────
prompt-only baseline  →  score
        ↓
+RAG baseline         →  score  (did retrieval help knowledge?)
        ↓
+SFT candidate        →  score  (did weights help behavior?)
        ↓
ship winner + monitor drift in production

Link forward: Evaluating LLMs & Agents {Liên kết tiếp: Evaluating LLMs & Agents}.

When NOT to fine-tune {Khi KHÔNG nên fine-tune}

Most production agents never need custom weights on day one {Hầu hết agent production không cần weight custom ngày đầu}. Do not fine-tune when:

You have not exhausted prompt + RAG on a proper eval {Chưa cạn kiệt prompt + RAG trên eval đúng}
The problem is missing documents, not missing weights {Vấn đề là thiếu document, không phải thiếu weight}
You have < 100 reliable examples {Có < 100 ví dụ tin cậy}
Requirements change weekly — you will live in retrain hell {Requirement đổi hàng tuần — sẽ sống trong địa ngục retrain}
A vendor JSON mode / structured output solves the format gap {JSON mode / structured output của vendor giải gap format}

DEFAULT STACK FOR MOST TEAMS
────────────────────────────
1. Strong system prompt + few-shot
2. RAG over authoritative docs
3. Structured output + validation retry loop
4. Fine-tune ONLY after eval proves prompt ceiling

Hybrid patterns that win in production {Pattern hybrid thắng ở production}

Real agents combine levers {Agent thật kết hợp lever}:

Pattern	Architecture	Example
RAG + prompt	Retrieve docs; prompt enforces format and guardrails	Support bot with citations
RAG + SFT	Fine-tuned extractor/router; RAG supplies facts	Medical coding assistant
SFT + prompt overrides	Weights for core task; system prompt for policy updates	Classifier with seasonal promo rules in prompt
Multi-model	Small fine-tuned router + large general reasoner	Cost-optimized agent swarm

HYBRID AGENT (common enterprise)
────────────────────────────────
User query
    │
    ├─► Retriever ──► top-k chunks (RAG)
    │
    ├─► Fine-tuned intent router (LoRA)
    │
    └─► General LLM + system prompt + tool schemas
              │
              ▼
        validated structured response

The demo recommends hybrids when you answered both for knowledge and behavior, or when you have labeled data and a changing corpus {Demo gợi ý hybrid khi bạn chọn both cho knowledge và behavior, hoặc có labeled data và corpus hay đổi}.

Cost and effort comparison {So sánh cost và effort}

Relative ranking (1 = lowest) {Xếp hạng tương đối (1 = thấp nhất)}:

Dimension	Prompt	RAG	Fine-tuning
Upfront engineering	1	3	4–5
Ongoing operational cost	2–4 (tokens)	3–4 (index + tokens)	2 (inference) + retrain cycles
Time to first good result	Hours	Days–weeks	Weeks
Knowledge freshness	Poor (static in prompt)	Excellent	Poor unless + RAG
Behavior consistency	Moderate	N/A for format	Excellent

Fine-tuning is a capital expense; prompting and RAG are mostly operating expenses {Fine-tuning là chi phí vốn; prompting và RAG chủ yếu là chi phí vận hành}. Run the ROI math against your query volume before committing GPUs {Chạy ROI với query volume trước khi cam kết GPU}.

Key takeaways {Điểm chính}

Classify the gap: knowledge (RAG), behavior (prompt → fine-tune), or both (hybrid) {Phân loại gap: knowledge (RAG), behavior (prompt → fine-tune), hoặc both (hybrid)}.
Prompt + RAG first — fine-tune only when eval proves a ceiling {Prompt + RAG trước — fine-tune chỉ khi eval chứng minh trần}.
Fine-tuning types: SFT/LoRA for format and style; DPO/RLHF for preferences; not for fresh facts {Loại fine-tuning: SFT/LoRA cho format và style; DPO/RLHF cho preference; không cho fact mới}.
Quality labeled data and held-out eval are non-negotiable {Data labeled chất lượng và eval hold-out không thương lượng}.
Production winners are usually hybrids — RAG for facts, weights or prompts for behavior {Người thắng production thường là hybrid — RAG cho fact, weight hoặc prompt cho behavior}.

Next: how to measure whether any of this worked — Evaluating LLMs & Agents {Tiếp: cách đo mọi thứ có hiệu quả — Evaluating LLMs & Agents}.