jvinhit//lab

Search posts

Type to search across journal entries.

navigate open esc close

Fine-tuning vs Prompting vs RAG: A Decision Framework for Adapting LLMs

When to prompt, retrieve, or fine-tune: knowledge vs behavior, data needs, cost, privacy, SFT/LoRA/DPO — and why most teams start with prompt + RAG.

Part 6 of the Building AI Agents series {Phần 6}. Previous {Trước}: Context Engineering & Memory · Next {Tiếp}: Evaluating LLMs & Agents.

You have a base model that is already capable {Bạn có base model đã đủ capable}. The hard product question is not which model — it is how to adapt it to your domain, your format, and your freshness requirements {Câu hỏi sản phẩm khó không phải model nào — mà là cách adapt cho domain, format, và yêu cầu freshness}. Senior teams converge on three levers: prompt engineering, retrieval (RAG), and fine-tuning {Team senior hội tụ vào ba lever: prompt engineering, retrieval (RAG), và fine-tuning}. Each changes a different part of the system; mixing them without a framework wastes GPU hours and creates brittle agents {Mỗi cái thay đổi phần khác của hệ thống; trộn không có framework thì đốt GPU và tạo agent giòn}.

This post is a decision/strategy guide — not a training tutorial {Bài này là guide quyết định/chiến lược — không phải tutorial training}. For RAG architecture depth, see RAG Guide {Để hiểu sâu kiến trúc RAG, xem RAG Guide}. For fine-tuning mechanics, see Fine-tuning LLM Basics {Cho cơ chế fine-tuning, xem Fine-tuning LLM Basics}.

Open the full demo {Mở demo đầy đủ}: /tools/training-decision-demo/.


The three adaptation levers {Ba lever adaptation}

Think of an LLM deployment as three separable concerns {Hãy coi deployment LLM gồm ba concern tách được}:

LeverWhat it changesWhen it takes effectTypical cost profile
Prompt engineeringInstructions and in-context examplesEvery request (inference)Low upfront; scales with tokens
RAGExternal knowledge injected at query timeEvery request (retrieve + infer)Medium upfront (index); ongoing retrieval + embedding
Fine-tuningModel weights (behavior and optionally knowledge)Once at train time; cheap at inferenceHigh upfront (data + GPU); lower per-token if prompts shrink
                    ┌─────────────────────────────────────┐
                    │           BASE MODEL                │
                    │   (pre-trained general weights)     │
                    └─────────────────────────────────────┘
                           ▲           ▲           ▲
                           │           │           │
              ┌────────────┘           │           └────────────┐
              │                        │                        │
     PROMPT ENGINEERING              RAG                 FINE-TUNING
     (system + few-shot)      (retrieve → context)    (SFT / LoRA / DPO)
              │                        │                        │
     "How to respond"          "What facts to use"      "Default tendencies"

Callout: Prompting steers inference-time behavior; RAG supplies external facts; fine-tuning encodes persistent patterns into weights {Callout: Prompting điều khiển behavior lúc inference; RAG cung cấp fact bên ngoài; fine-tuning mã hóa pattern bền vào weights}. Confusing “we need the model to know X” (knowledge) with “we need the model to always output Y” (behavior) is the root cause of most bad fine-tune proposals {Nhầm “model cần biết X” (knowledge) với “model phải luôn output Y” (behavior) là gốc hầu hết đề xuất fine-tune tệ}.


Knowledge vs behavior — the first fork {Knowledge vs behavior — ngã rẽ đầu tiên}

Before choosing a lever, classify the gap {Trước khi chọn lever, phân loại gap}:

Gap typeSymptomWrong fixRight starting point
KnowledgeModel lacks facts, docs, policies, product specsFine-tune on PDF dumpRAG or long-context prompt
BehaviorWrong JSON shape, tone, classification label, tool-call styleStuff 200 examples in every prompt foreverFew-shot prompt → SFT/LoRA if stable
BothEnterprise agent over proprietary docs with strict formatPrompt-only spaghettiHybrid: RAG + fine-tuned formatter/router

Knowledge is what the model should cite — ideally with sources you can update without retraining {Knowledgemodel nên trích dẫn gì — lý tưởng có nguồn cập nhật không cần retrain}. Behavior is how the model should act given any context — classification boundaries, refusal patterns, structured extraction schemas {Behaviormodel nên hành xử thế nào với mọi context — ranh giới classification, pattern refuse, schema extract}.

USER: "What's our refund policy for EU customers?"

KNOWLEDGE GAP     → model hallucinates or uses outdated train data
BEHAVIOR GAP      → model knows refunds exist but outputs prose instead of
                    required JSON \{"eligible": bool, "reason": str\}
BOTH              → needs retrieved policy doc AND structured output schema

Parts 3–5 of this series covered prompt and context engineering {Phần 3–5 của series đã cover prompt và context engineering}. This post assumes you can already assemble system messages, memory, and tool schemas — the question is whether that is enough {Bài này giả định bạn đã lắp system message, memory, tool schema — câu hỏi là liệu đó đã đủ}.


Decision framework {Framework quyết định}

Use five axes in order {Dùng năm trục theo thứ tự}. The interactive demo above walks the same tree {Demo tương tác phía trên đi cùng cây quyết định}.

1. Freshness {Freshness — độ mới}

Update cadenceRecommendation bias
Daily / weekly (inventory, news, policies)RAG — weights go stale immediately
Monthly / quarterlyRAG or hybrid; prompt if corpus is tiny
Static (historical, legal archive)Prompt or fine-tune if behavior-stable

Callout: Fine-tuning encodes knowledge at train time {Callout: Fine-tuning mã hóa knowledge lúc train}. If your “ground truth” changes every sprint, you will retrain constantly or ship lies {Nếu “ground truth” đổi mỗi sprint, bạn retrain liên tục hoặc ship thông tin sai}.

2. Corpus size vs context window {Kích thước corpus vs context window}

If relevant documents cannot fit reliably in your per-query budget (including memory and tool results), retrieval is mandatory {Nếu tài liệu liên quan không fit tin cậy trong budget mỗi query (kể cả memory và tool result), retrieval là bắt buộc}. Long-context models help but do not replace search at 100K+ document scale {Long-context giúp nhưng không thay search ở quy mô 100K+ document}.

3. Labeled data availability {Dữ liệu labeled}

VolumeQuality barFine-tune viability
0–50 pairsAnyPrompt / few-shot only
50–500Human-reviewedMarginal LoRA; validate hard
500+Consistent format, edge cases coveredSFT / LoRA reasonable
5K+Preference pairs or rankingsDPO / RLHF-style tuning

Quality beats quantity {Chất lượng hơn số lượng}. Five hundred noisy ChatGPT-generated pairs often lose to fifty expert-labeled examples plus a strong prompt {500 cặp noisy từ ChatGPT thường thua 50 ví dụ expert-labeled cộng prompt mạnh}.

4. Latency and token economics {Latency và kinh tế token}

Repeated 4K-token system prompts on every agent step add cost and latency {System prompt 4K token lặp mỗi bước agent tăng cost và latency}. Fine-tuning can compress instructions into weights — useful when you need sub-second tool routing at scale {Fine-tuning có thể nén instruction vào weights — hữu ích khi cần tool routing dưới giây ở scale}. RAG adds retrieval latency (10–200ms+ depending on index) but avoids giant prompts {RAG thêm latency retrieval (10–200ms+ tùy index) nhưng tránh prompt khổng lồ}.

5. Privacy and deployment {Privacy và deployment}

Sensitive data that cannot leave your VPC pushes you toward self-hosted embeddings, local vector DB, and on-prem fine-tuning {Dữ liệu nhạy cảm không ra khỏi VPC đẩy bạn về embedding self-hosted, vector DB local, fine-tune on-prem}. Managed fine-tuning APIs may require sending training JSONL to a vendor — read the data-processing terms {API fine-tune managed có thể yêu cầu gửi JSONL training cho vendor — đọc điều khoản xử lý dữ liệu}.

DECISION CHECKLIST (in order)
─────────────────────────────
□ Classify gap: knowledge | behavior | both
□ Freshness: will weights be stale in < 1 month?
□ Corpus: fits in context per query?
□ Labeled data: count + quality sufficient for SFT?
□ Latency/cost: can you afford large prompts every step?
□ Privacy: can training data leave the boundary?
□ Run eval baseline BEFORE committing to fine-tune (→ Part 7)

Prompt engineering — when it is enough {Prompt engineering — khi nào đủ}

Start here almost always {Gần như luôn bắt đầu ở đây}. Prompting covers:

  • System instructions and role {System instruction và role}
  • Few-shot exemplars (Part 3) {Few-shot exemplar (Phần 3)}
  • Structured output via JSON mode, grammar, or post-validation (Part 4) {Structured output qua JSON mode, grammar, hoặc post-validation (Phần 4)}
  • Context assembly from memory (Part 5) {Lắp context từ memory (Phần 5)}
StrengthLimit
Hours to iterateContext window ceiling
No training infraInstruction-following drift at scale
Easy A/B in productionCost grows with prompt length

Callout: If a 10-shot prompt with retrieved snippets hits your accuracy target in eval, stop — you do not need fine-tuning {Callout: Nếu prompt 10-shot với snippet retrieved đạt accuracy target trong eval, dừng — không cần fine-tuning}.


RAG — when retrieval is the answer {RAG — khi retrieval là câu trả lời}

RAG solves dynamic, voluminous, or private knowledge without weight updates {RAG giải knowledge động, khối lượng lớn, hoặc private không cần cập nhật weight}. The agent pattern from Part 5 — memory + tools + context — often is RAG when “memory” is a vector index over docs {Pattern agent từ Phần 5 — memory + tool + context — thường chính là RAG khi “memory” là vector index trên doc}.

Use RAG when:

  • Knowledge changes faster than you can retrain {Knowledge đổi nhanh hơn retrain}
  • You need citations for compliance or debugging {Cần trích dẫn cho compliance hoặc debug}
  • Corpus exceeds practical context (even with summarization) {Corpus vượt context thực tế (kể cả summarization)}

Avoid treating RAG as “dump everything into the prompt” {Đừng coi RAG là “nhét hết vào prompt”}. Chunking, hybrid search, reranking, and query transformation matter more than embedding model choice for most teams {Chunking, hybrid search, reranking, query transformation quan trọng hơn chọn embedding model với hầu hết team}. See the RAG Guide for pipeline detail {Xem RAG Guide cho chi tiết pipeline}.


Fine-tuning — types and when each fits {Fine-tuning — loại và khi nào hợp}

Fine-tuning updates model parameters on your data {Fine-tuning cập nhật tham số model trên data của bạn}. Overview of common approaches:

MethodWhat it doesData neededTypical use
SFT (Supervised Fine-Tuning)Minimize loss on input→output pairs500+ quality pairsFormat, extraction, classification
LoRA / QLoRA (PEFT)Train small adapter matrices, freeze baseSame as SFT, less VRAMCost-efficient behavior adaptation
Full fine-tuneUpdate all weightsLarge curated setRare; foundation-model teams
RLHFReward model + policy optimizationHuman rankings, expensiveAlignment, complex preferences
DPO (Direct Preference Optimization)Optimize preferred vs rejected outputsPreference pairsStyle, safety, tone without full RL pipeline
SFT / LoRA pipeline (simplified)
────────────────────────────────
curated JSONL  →  tokenize  →  train adapters  →  merge/export
     │                                              │
     └─ hold-out eval set (NEVER train on this) ────┘

When fine-tuning shines {Khi fine-tuning tỏa sáng}

  • Stable behavior that prompts cannot reliably enforce (strict JSON, domain-specific tool grammar) {Behavior ổn định prompt không enforce tin cậy (JSON chặt, grammar tool theo domain)}
  • High QPS where shaving 2K tokens per request pays for training {QPS cao mà bỏ 2K token mỗi request thì hoàn vốn training}
  • Edge distribution — your inputs look nothing like generic web text {Phân phối edge — input không giống web text generic}

When fine-tuning fails {Khi fine-tuning thất bại}

  • Teaching facts that change (use RAG) {Dạy fact hay đổi (dùng RAG)}
  • Small, dirty datasets — model memorizes noise {Dataset nhỏ, bẩn — model học thuộc noise}
  • Catastrophic forgetting — overtrain and the model loses general capability {Catastrophic forgetting — train quá và model mất capability tổng quát}
  • Skipping eval — you ship a model that regresses on out-of-domain prompts {Bỏ qua eval — ship model regress trên prompt ngoài domain}

For training hyperparameters and LoRA rank selection, defer to Fine-tuning LLM Basics {Cho hyperparameter và chọn LoRA rank, xem Fine-tuning LLM Basics}.


Data preparation and pitfalls {Chuẩn bị dữ liệu và cạm bẫy}

Fine-tuning success is 80% data curation {Thành công fine-tuning 80% là curate data}:

{
  "messages": [
    {"role": "system", "content": "Extract refund eligibility as JSON."},
    {"role": "user", "content": "Order #8821, delivered 45 days ago, EU."},
    {"role": "assistant", "content": "{\"eligible\": false, \"reason\": \"outside_30_day_window\"}"}
  ]
}
PitfallSymptomMitigation
Label inconsistencyModel outputs random formatsStyle guide + adjudication
Train/eval leakageInflated offline scoresStrict document-level splits
Synthetic data pollutionGibberish on real inputsCap synthetic ratio; human spot-check
Overfitting small setsPerfect on train, fails in prodRegularization, early stop, more real data
Catastrophic forgettingGeneral reasoning degradesLower LR, LoRA not full FT, mix general data

Callout: Never fine-tune on your eval set {Callout: Không bao giờ fine-tune trên eval set}. Part 7 covers building eval harnesses that survive adaptation changes {Phần 7 cover xây eval harness sống sót khi đổi adaptation}.


Evaluation before and after {Đánh giá trước và sau}

Adaptation without measurement is guessing {Adaptation không đo lường là đoán mò}. Minimum bar:

  1. Baseline — best prompt (+ RAG if applicable) on frozen eval set {Baseline — prompt tốt nhất (+ RAG nếu có) trên eval set cố định}
  2. Hypothesis — “fine-tune will improve JSON validity from 92% → 98%” {Giả thuyết — “fine-tune cải JSON validity từ 92% → 98%”}
  3. Compare — same eval, same sampling params (Part 2), same context budget {So sánh — cùng eval, cùng sampling (Phần 2), cùng context budget}
  4. Regression check — general capability slice (reasoning, refusal, safety) {Kiểm tra regression — lát capability tổng quát (reasoning, refuse, safety)}
EVAL LOOP
─────────
prompt-only baseline  →  score

+RAG baseline         →  score  (did retrieval help knowledge?)

+SFT candidate        →  score  (did weights help behavior?)

ship winner + monitor drift in production

Link forward: Evaluating LLMs & Agents {Liên kết tiếp: Evaluating LLMs & Agents}.


When NOT to fine-tune {Khi KHÔNG nên fine-tune}

Most production agents never need custom weights on day one {Hầu hết agent production không cần weight custom ngày đầu}. Do not fine-tune when:

  • You have not exhausted prompt + RAG on a proper eval {Chưa cạn kiệt prompt + RAG trên eval đúng}
  • The problem is missing documents, not missing weights {Vấn đề là thiếu document, không phải thiếu weight}
  • You have < 100 reliable examples {Có < 100 ví dụ tin cậy}
  • Requirements change weekly — you will live in retrain hell {Requirement đổi hàng tuần — sẽ sống trong địa ngục retrain}
  • A vendor JSON mode / structured output solves the format gap {JSON mode / structured output của vendor giải gap format}
DEFAULT STACK FOR MOST TEAMS
────────────────────────────
1. Strong system prompt + few-shot
2. RAG over authoritative docs
3. Structured output + validation retry loop
4. Fine-tune ONLY after eval proves prompt ceiling

Hybrid patterns that win in production {Pattern hybrid thắng ở production}

Real agents combine levers {Agent thật kết hợp lever}:

PatternArchitectureExample
RAG + promptRetrieve docs; prompt enforces format and guardrailsSupport bot with citations
RAG + SFTFine-tuned extractor/router; RAG supplies factsMedical coding assistant
SFT + prompt overridesWeights for core task; system prompt for policy updatesClassifier with seasonal promo rules in prompt
Multi-modelSmall fine-tuned router + large general reasonerCost-optimized agent swarm
HYBRID AGENT (common enterprise)
────────────────────────────────
User query

    ├─► Retriever ──► top-k chunks (RAG)

    ├─► Fine-tuned intent router (LoRA)

    └─► General LLM + system prompt + tool schemas


        validated structured response

The demo recommends hybrids when you answered both for knowledge and behavior, or when you have labeled data and a changing corpus {Demo gợi ý hybrid khi bạn chọn both cho knowledge và behavior, hoặc có labeled data corpus hay đổi}.


Cost and effort comparison {So sánh cost và effort}

Relative ranking (1 = lowest) {Xếp hạng tương đối (1 = thấp nhất)}:

DimensionPromptRAGFine-tuning
Upfront engineering134–5
Ongoing operational cost2–4 (tokens)3–4 (index + tokens)2 (inference) + retrain cycles
Time to first good resultHoursDays–weeksWeeks
Knowledge freshnessPoor (static in prompt)ExcellentPoor unless + RAG
Behavior consistencyModerateN/A for formatExcellent

Fine-tuning is a capital expense; prompting and RAG are mostly operating expenses {Fine-tuning là chi phí vốn; prompting và RAG chủ yếu là chi phí vận hành}. Run the ROI math against your query volume before committing GPUs {Chạy ROI với query volume trước khi cam kết GPU}.


Key takeaways {Điểm chính}

  • Classify the gap: knowledge (RAG), behavior (prompt → fine-tune), or both (hybrid) {Phân loại gap: knowledge (RAG), behavior (prompt → fine-tune), hoặc both (hybrid)}.
  • Prompt + RAG first — fine-tune only when eval proves a ceiling {Prompt + RAG trước — fine-tune chỉ khi eval chứng minh trần}.
  • Fine-tuning types: SFT/LoRA for format and style; DPO/RLHF for preferences; not for fresh facts {Loại fine-tuning: SFT/LoRA cho format và style; DPO/RLHF cho preference; không cho fact mới}.
  • Quality labeled data and held-out eval are non-negotiable {Data labeled chất lượngeval hold-out không thương lượng}.
  • Production winners are usually hybrids — RAG for facts, weights or prompts for behavior {Người thắng production thường là hybrid — RAG cho fact, weight hoặc prompt cho behavior}.

Next: how to measure whether any of this worked — Evaluating LLMs & Agents {Tiếp: cách đo mọi thứ có hiệu quả — Evaluating LLMs & Agents}.


The Building AI Agents series {Loạt bài Building AI Agents}

  1. Tokens & Context Windows
  2. Sampling: temperature, top_p, top_k
  3. Prompt Engineering for Agents
  4. Stopping Criteria & Output Control
  5. Context Engineering & Memory
  6. Fine-tuning vs Prompting vs RAG (current)
  7. Evaluating LLMs & Agents
  8. Choosing a Model
  9. Function Calling & Tool Use
  10. Agent Patterns: ReAct, Reflection, Planning