Evaluating LLMs & Agents: Golden Sets, Metrics, LLM-as-Judge, and Regression in CI
Why eval is the hardest part of shipping agents — golden datasets, offline vs online metrics, LLM-as-judge rubrics, human agreement, and regression in CI.
Part 7 of the Building AI Agents series {Phần 7}. Previous {Trước}: Fine-tuning vs Prompting vs RAG · Next {Tiếp}: Choosing a Model.
You can tune sampling, craft prompts, wire RAG, and pick the best model — and still ship an agent that fails silently in production {Bạn có thể tune sampling, viết prompt, gắn RAG, chọn model tốt nhất — vẫn deploy agent fail âm thầm trên production}. Evaluation is the discipline that tells you whether any of that work actually moved the needle {Evaluation là môn khoa học cho biết mọi effort đó có thực sự cải thiện hay không}. For senior engineers building LLM agents, eval is simultaneously the most important and most under-invested part of the stack {Với senior engineer xây agent, eval vừa quan trọng nhất vừa ít được đầu tư nhất trong stack}.
This post covers how to build golden datasets, choose metrics that match your task, automate judging without fooling yourself, and wire regression suites into CI {Bài này cover cách xây golden dataset, chọn metric khớp task, tự động judge mà không tự lừa mình, và gắn regression suite vào CI}. Part 6 explained how to improve outputs via fine-tuning, prompting, and RAG; this part explains how you know those improvements are real {Phần 6 giải thích cách cải thiện output qua fine-tuning, prompting, RAG; phần này giải thích cách biết cải thiện đó là thật}.
Open the full demo {Mở demo đầy đủ}: /tools/llm-eval-demo/.
Why eval is the hardest part of shipping agents {Tại sao eval là phần khó nhất khi ship agent}
Agents are non-deterministic, multi-step, and context-dependent {Agent không deterministic, nhiều bước, phụ thuộc context}. A demo that “looks good” on three hand-picked prompts is not evidence of production readiness {Demo “trông ổn” trên ba prompt chọn tay không phải bằng chứng sẵn sàng production}. Eval is hard because:
| Challenge | Why it hurts |
|---|---|
| No single ground truth | Paraphrases, partial credit, multiple valid tool paths |
| Subjective quality | Tone, helpfulness, safety — hard to encode as exact match |
| Distribution shift | Eval set from March; users ask differently in June |
| Compounding errors | One wrong tool call poisons a 10-step trajectory |
| Cost of human review | Expert annotators do not scale to every PR |
Callout: The teams that ship reliable agents treat eval as infrastructure, not a one-time benchmark run before launch {Callout: Team ship agent đáng tin coi eval là infrastructure, không phải benchmark một lần trước launch}. They maintain a living golden set, run regressions on every prompt/model change, and track online metrics after deploy {Họ duy trì golden set sống, chạy regression mỗi khi đổi prompt/model, và theo dõi metric online sau deploy}.
Without eval, you optimize for vibes {Không có eval, bạn optimize theo cảm giác}. With eval, you can do eval-driven development — change one variable, measure delta, keep or revert {Có eval, bạn làm eval-driven development — đổi một biến, đo delta, giữ hoặc revert}.
Building a golden / eval dataset {Xây golden / eval dataset}
A golden set (or eval set) is a curated collection of inputs with expected outputs or scoring criteria {Golden set (hay eval set) là tập input được curate kèm output mong đợi hoặc tiêu chí chấm điểm}. It is your contract for “what good looks like” {Đó là hợp đồng cho “good trông như thế nào”}.
What each case should contain {Mỗi case nên chứa gì}
{
"id": "support-refund-042",
"input": {
"messages": [{"role": "user", "content": "Can I return this after 45 days?"}],
"context": "Policy: returns within 30 days of purchase."
},
"reference": "Returns are accepted within 30 days; 45 days exceeds the window.",
"metadata": {
"category": "grounded_qa",
"difficulty": "medium",
"tags": ["policy", "hallucination_trap"]
},
"acceptance": {
"must_mention": ["30 days"],
"must_not_claim": ["full refund after 45 days"]
}
}
Sourcing cases {Nguồn case}
| Source | Strength | Risk |
|---|---|---|
| Production logs (sampled, redacted) | Real distribution | Privacy, PII handling |
| Failure post-mortems | High ROI — every bug becomes a case | Reactive only |
| Synthetic generation | Scale quickly | May not match real user phrasing |
| Edge-case engineering | Catches known failure modes | Overfits to imagined threats |
| Adversarial / red-team | Finds jailbreaks, injection | Maintenance burden |
Start with 50–200 high-quality cases covering your critical paths {Bắt đầu 50–200 case chất lượng cao cover critical path}. A thousand shallow cases often teach less than fifty that encode real failure modes {Một nghìn case nông thường ít giá trị hơn năm mươi case mã hóa failure mode thật}. Version your dataset in git alongside prompt and model config {Version dataset trong git cùng prompt và model config}.
Stratification {Phân tầng}
Tag cases by capability: extraction, reasoning, tool selection, multi-turn memory, refusal/safety {Gắn tag theo capability: extraction, reasoning, tool selection, multi-turn memory, refusal/safety}. Report metrics per stratum — a 95% aggregate can hide 60% on tool calls {Báo metric theo stratum — aggregate 95% có thể che 60% trên tool call}.
Offline vs online evaluation {Offline vs online eval}
| Dimension | Offline eval | Online eval |
|---|---|---|
| When | Pre-deploy, CI, local dev | Production traffic |
| Data | Golden set, held-out logs | Live user interactions |
| Cost | Compute + optional human review | Latency, user trust if wrong |
| Signal | Controlled, reproducible | Real distribution, drift |
| Speed | Fast iteration | Slow feedback loops |
Offline eval is your safety net: run before merge, compare model A vs B, catch regressions {Offline eval là lưới an toàn: chạy trước merge, so sánh model A vs B, bắt regression}. Online eval validates that offline gains transfer — A/B tests, shadow mode, human review queues {Online eval xác nhận gain offline chuyển sang thật — A/B test, shadow mode, hàng đợi human review}.
Developer change (prompt / model / RAG)
│
▼
Offline golden-set eval ──► pass threshold? ──no──► block merge
│ yes
▼
Staging / canary deploy
│
▼
Online metrics (success rate, latency, CSAT)
│
▼
Feed failures back into golden set
The loop closes when production failures become new golden cases {Vòng lặp khép kín khi failure production trở thành golden case mới}.
Metrics — what to measure and what breaks {Metric — đo gì và cái gì hỏng}
No single metric fits all agent tasks {Không metric nào fit mọi task agent}. Pick metrics aligned with user-visible success {Chọn metric khớp success người dùng thấy được}.
Exact match and n-gram overlap {Exact match và n-gram overlap}
Exact match (EM) — output equals reference string {Exact match (EM) — output bằng reference string}. Works for classification, short factual answers, structured IDs {Hợp cho classification, câu trả lời factual ngắn, ID có cấu trúc}. Breaks on paraphrase: "Paris" vs "The capital is Paris." scores zero {Vỡ với paraphrase: "Paris" vs "The capital is Paris." được 0 điểm}.
BLEU / ROUGE — n-gram overlap with reference {BLEU / ROUGE — overlap n-gram với reference}. Common in summarization benchmarks {Phổ biến trong benchmark summarization}. Penalizes valid rewording; rewards verbose copying {Phạt diễn đạt hợp lệ; thưởng copy dài dòng}. Poor fit for agents where task success matters more than wording {Kém fit cho agent khi task success quan trọng hơn cách diễn đạt}.
Semantic similarity {Semantic similarity}
Embed output and reference; score cosine similarity {Embed output và reference; chấm cosine similarity}. Captures paraphrase better than EM {Bắt paraphrase tốt hơn EM}. Still blind to factual errors if embeddings cluster wrongly — "Paris is in Germany" may score high against "Paris is in France" {Vẫn mù lỗi factual nếu embedding cluster sai — "Paris is in Germany" có thể điểm cao với "Paris is in France"}. Use as a signal, not sole gate {Dùng như tín hiệu, không phải cổng duy nhất}.
Task success rate {Tỷ lệ task success}
Did the agent accomplish the goal? {Agent có hoàn thành mục tiêu không?}
- Database query agent — returned rows match expected SQL result
- Support bot — ticket resolved without escalation (human label)
- Code agent — tests pass, diff applies cleanly
Task success is the metric executives care about {Task success là metric leadership quan tâm}. It is also the hardest to automate at scale {Cũng là metric khó tự động hóa ở scale}.
Groundedness and faithfulness {Groundedness và faithfulness}
When the agent has retrieved context or tool results, does the answer stick to sources? {Khi agent có context retrieve hoặc kết quả tool, câu trả lời có bám nguồn không?}
| Metric | Definition |
|---|---|
| Groundedness | Claims supported by provided context |
| Faithfulness | No contradictions with source material |
| Citation accuracy | Quoted spans exist and support the claim |
Case TC-04 in the demo is a groundedness failure: the model claims 60 days when the doc says 30 {Case TC-04 trong demo là lỗi groundedness: model nói 60 ngày trong khi doc ghi 30}. See also AI Hallucination and How to Spot It for detection patterns {Xem thêm AI Hallucination and How to Spot It cho pattern phát hiện}.
Hallucination rate {Tỷ lệ hallucination}
Fraction of outputs containing ** unsupported factual claims** relative to available evidence {Tỷ lệ output chứa claim factual không được evidence hỗ trợ so với bằng chứng có sẵn}. Define “claim” operationally — entity mentions, numbers, dates, policy statements {Định nghĩa “claim” operational — entity, số, ngày, policy statement}. Manual audit on a sample; automate with NLI models or LLM-as-judge with human calibration {Audit thủ công trên sample; tự động bằng NLI hoặc LLM-as-judge có human calibration}.
Tool-call accuracy {Độ chính xác tool call}
For function-calling agents, decompose into {Với agent function-calling, tách thành}:
tool_selection_accuracy — picked the right tool?
argument_accuracy — JSON args valid and correct?
execution_success_rate — tool ran without error?
end_to_end_success — final answer correct given tool result?
A model can nail argument_accuracy but fail tool_selection_accuracy when two tools overlap {Model có thể đạt argument_accuracy nhưng fail tool_selection_accuracy khi hai tool overlap}. Report separately {Báo riêng}.
Latency and cost {Latency và cost}
Quality is not free {Chất lượng không miễn phí}. Track per-case:
| Metric | Why |
|---|---|
| Time-to-first-token (TTFT) | User-perceived responsiveness |
| Total latency | End-to-end task time |
| Tokens in / out | Direct cost driver |
| Tool round-trips | Each hop adds latency and failure surface |
A cheaper model that scores 3% lower but runs 4× faster may win on P99 latency SLOs {Model rẻ hơn kém 3% nhưng nhanh 4× có thể thắng trên SLO P99 latency}. Eval dashboards should plot quality vs cost Pareto fronts {Dashboard eval nên vẽ Pareto quality vs cost}.
LLM-as-judge {LLM-as-judge}
When exact match fails and human review does not scale, teams use a stronger LLM (or the same model with a rubric prompt) to score outputs {Khi exact match fail và human review không scale, team dùng LLM mạnh hơn (hoặc cùng model với rubric prompt) để chấm output}.
Rubric-based scoring {Chấm theo rubric}
Define criteria with weights and a 1–5 scale {Định nghĩa tiêu chí có trọng số và thang 1–5}:
Correctness (35%) — factually accurate vs reference / context
Relevance (25%) — answers the actual question
Conciseness (15%) — no unnecessary padding
Format/Ground (25%) — valid JSON / faithful to sources
The interactive demo lets you weight criteria and score cases manually, then click Simulate LLM Judge to see how automated rubric scoring aggregates {Demo tương tác cho phép gán trọng số tiêu chí và chấm thủ công, rồi bấm Simulate LLM Judge để xem aggregate chấm rubric tự động}.
JUDGE_PROMPT = """You are an impartial evaluator. Score the model output 1-5 on each criterion.
## Input
{input}
## Context (if any)
{context}
## Model output
{output}
## Reference (if any)
{reference}
## Criteria
- correctness: factual accuracy
- relevance: addresses the question
- conciseness: no fluff
- format: valid structure / grounded in context
Respond JSON only:
{{"correctness": N, "relevance": N, "conciseness": N, "format": N, "rationale": "..."}}
"""
Pairwise comparison {So sánh cặp}
Instead of absolute scores, ask the judge: “Which output is better, A or B?” {Thay vì điểm tuyệt đối, hỏi judge: “Output nào tốt hơn, A hay B?”}. Pairwise is often more consistent than 1–5 absolutes {Pairwise thường nhất quán hơn thang 1–5 tuyệt đ đối}. Use Bradley-Terry or Elo ranking to aggregate pairwise results into a model leaderboard {Dùng Bradley-Terry hoặc Elo để aggregate pairwise thành leaderboard model}.
Known biases and mitigations {Bias đã biết và cách giảm}
| Bias | Symptom | Mitigation |
|---|---|---|
| Position bias | First answer wins ~55% | Swap A/B order, average both |
| Verbosity bias | Longer answer rated higher | Rubric: penalize padding; compare length-normalized |
| Self-preference | GPT-4 prefers GPT-4 outputs | Use different judge model than candidate |
| Anchoring | Reference answer skews scores | Hide reference when testing open-ended tasks |
| Leniency drift | Scores inflate over time | Anchor with frozen calibration cases |
Callout: LLM-as-judge is a proxy, not ground truth {Callout: LLM-as-judge là proxy, không phải ground truth}. Calibrate against human labels on a fixed slice — if judge-human agreement drops below ~80% Cohen’s κ, fix the rubric before trusting automation {Calibrate với nhãn human trên slice cố định — nếu agreement judge-human dưới ~80% Cohen’s κ, sửa rubric trước khi tin automation}.
Human evaluation and inter-annotator agreement {Human eval và inter-annotator agreement}
Human eval remains the gold standard for subjective quality, safety, and novel failure modes {Human eval vẫn là chuẩn vàng cho chất lượng chủ quan, safety, và failure mode mới}.
Designing a human eval {Thiết kế human eval}
- Clear rubric with examples per score level {Rubric rõ kèm ví dụ mỗi mức điểm}
- Blind comparison — annotators do not see model name {So sánh mù — annotator không thấy tên model}
- Duplicate cases — same case rated by 2+ annotators {Case trùng — cùng case được 2+ annotator chấm}
- Adjudication — third reviewer resolves disagreements {Adjudication — reviewer thứ ba giải quyết bất đồng}
Inter-annotator agreement {Agreement giữa annotator}
Measure Cohen’s κ (two raters) or Fleiss’ κ (multiple raters) {Đo Cohen’s κ (hai rater) hoặc Fleiss’ κ (nhiều rater)}:
κ < 0.40 — poor agreement; rubric is ambiguous
κ 0.40–0.60 — moderate; refine definitions
κ 0.60–0.80 — substantial; usable for training judges
κ > 0.80 — strong; rubric is tight
Low κ means fix the rubric, not blame annotators {κ thấp nghĩa là sửa rubric, không đổ lỗi annotator}. Human labels also calibrate LLM-as-judge — run both on the same 100 cases monthly {Nhãn human cũng calibrate LLM-as-judge — chạy cả hai trên cùng 100 case hàng tháng}.
Regression suites in CI {Regression suite trong CI}
Treat your golden set like unit tests {Coi golden set như unit test}. On every PR that touches prompts, retrieval, model version, or agent graph:
1. Run agent on golden set (fixed seed where possible)
2. Score with automated metrics + LLM-as-judge
3. Compare vs baseline branch (main)
4. Fail if aggregate drops > ε OR any P0 case fails
The demo’s A/B Regression tab shows Model A (baseline) vs Model B (candidate) per case — TC-04 improves from 2.1 → 3.8 after a prompt fix {Tab A/B Regression trong demo so sánh Model A (baseline) vs Model B (candidate) theo case — TC-04 cải thiện 2.1 → 3.8 sau sửa prompt}.
# Pseudocode: CI eval gate
BASELINE = load_scores("main")
CANDIDATE = run_eval(agent_config, golden_set)
REGRESSION_TOLERANCE = 0.05 # max allowed drop on aggregate
P0_CASES = {"support-refund-042", "sql-injection-007"}
for case_id in P0_CASES:
assert CANDIDATE[case_id].passed, f"P0 regression: {case_id}"
delta = CANDIDATE.aggregate - BASELINE.aggregate
assert delta >= -REGRESSION_TOLERANCE, f"Aggregate dropped by {abs(delta):.2f}"
Store eval artifacts (scores, outputs, latency) as CI artifacts for diff review {Lưu artifact eval (điểm, output, latency) làm CI artifact để review diff}. Pin model version and prompt hash in the eval config {Pin model version và prompt hash trong eval config}.
Eval-driven development {Eval-driven development}
Borrow the TDD loop for agents {Mượn vòng TDD cho agent}:
1. Add failing case to golden set (from bug report or anticipated edge)
2. Run eval — confirm failure
3. Change prompt / RAG / model / tool schema
4. Re-run eval — confirm case passes without regressions elsewhere
5. Merge with CI gate
This prevents “fix one thing, break three” cycles that plague prompt iteration {Ngăn vòng “sửa một chỗ, vỡ ba chỗ” hay gặp khi iterate prompt}. Eval-driven development pairs naturally with the adaptation strategies from Fine-tuning vs Prompting vs RAG — each strategy change gets a measured delta {Eval-driven development đi cùng chiến lược adaptation từ Fine-tuning vs Prompting vs RAG — mỗi thay đổi strategy có delta đo được}.
Agent-specific evaluation {Eval riêng cho agent}
Single-turn QA metrics miss most agent failure modes {Metric QA một lượt bỏ sót hầu hết failure mode agent}.
Multi-step trajectories {Trajectory nhiều bước}
Log the full trace: thoughts, tool calls, observations, final answer {Log full trace: thought, tool call, observation, câu trả lời cuối}. Score at each step and end-to-end {Chấm từng bước và end-to-end}:
| Level | Question |
|---|---|
| Step | Was this tool call correct given state? |
| Trajectory | Did the path reach the goal efficiently? |
| Outcome | Is the final answer correct? |
A correct final answer via a wasteful 12-step path is a latent cost and reliability bug {Câu trả lời đúng qua path 12 bước lãng phí là bug cost và reliability tiềm ẩn}.
Tool use evaluation {Eval tool use}
{
"expected_tools": ["search_policy", "format_response"],
"forbidden_tools": ["delete_user"],
"max_tool_calls": 3,
"expected_final_state": {"ticket_status": "resolved"}
}
Check tool sequence order when it matters — authenticate before transfer_funds {Kiểm tra thứ tự tool sequence khi quan trọng — authenticate trước transfer_funds}.
Memory and context eval {Eval memory và context}
Multi-turn cases test whether the agent retains constraints across turns {Case multi-turn test agent có giữ constraint qua các lượt không}. Inject distractor turns to test focus {Chèn lượt gây nhiễu để test focus}. After context compaction (Part 5), re-run memory-heavy cases {Sau context compaction (Phần 5), chạy lại case nặng memory}.
Safety and refusal eval {Eval safety và refusal}
Include cases that should refuse — jailbreak attempts, out-of-scope requests, PII extraction {Gồm case phải refuse — jailbreak, request ngoài scope, trích PII}. A high task-success rate means nothing if the agent also complies with harmful requests {Task success cao vô nghĩa nếu agent cũng comply request có hại}.
Putting it together — a minimal eval stack {Gắn lại — eval stack tối thiểu}
golden_set.json — versioned cases + metadata
rubric.yaml — criteria, weights, pass threshold
run_eval.py — batch inference + scoring
judge_prompt.txt — LLM-as-judge template
ci_eval.yml — PR gate vs baseline
dashboard — aggregate + per-stratum + cost/latency
| Layer | Tooling examples |
|---|---|
| Dataset | Git, Label Studio, custom JSON |
| Inference | Your agent runner, LangSmith traces, Braintrust |
| Metrics | Custom scripts, Ragas, DeepEval |
| Judge | GPT-4o, Claude, open-weight Llama + rubric |
| CI | GitHub Actions, threshold gates |
| Online | A/B platform, human review queue |
Start minimal: JSON golden set, one rubric, one script, one CI job {Bắt đầu tối giản: golden set JSON, một rubric, một script, một CI job}. Expand when a metric gap blocks a release decision {Mở rộng khi khoảng trống metric chặn quyết định release}.
Key takeaways {Điểm chính}
- Eval is infrastructure — not a pre-launch checkbox {Eval là infrastructure — không phải checkbox trước launch}.
- Golden sets encode real failures; stratify and version them {Golden set mã hóa failure thật; phân tầng và version}.
- Classic metrics (EM, BLEU) break on paraphrase; use task success, groundedness, and tool accuracy for agents {Metric cổ điển (EM, BLEU) vỡ với paraphrase; dùng task success, groundedness, tool accuracy cho agent}.
- LLM-as-judge scales rubric scoring — calibrate against humans, mitigate position/verbosity bias {LLM-as-judge scale chấm rubric — calibrate với human, giảm position/verbosity bias}.
- Regression in CI catches prompt and model drift before users do {Regression trong CI bắt drift prompt và model trước user}.
- Agent eval requires trajectory-level scoring, not just final-answer checks {Eval agent cần chấm cấp trajectory, không chỉ câu trả lời cuối}.
Next up: once you can measure quality, choose the model and framework that hits your eval bar at acceptable cost — Choosing a Model {Tiếp theo: khi đo được chất lượng, chọn model và framework đạt ngưỡng eval với cost chấp nhận được — Choosing a Model}.
The Building AI Agents series {Loạt bài Building AI Agents}
- Tokens & Context Windows
- Sampling: temperature, top_p, top_k
- Prompt Engineering for Agents
- Stopping Criteria & Output Control
- Context Engineering & Memory
- Fine-tuning vs Prompting vs RAG
- Evaluating LLMs & Agents (current)
- Choosing a Model
- Function Calling & Tool Use
- Agent Patterns: ReAct, Reflection, Planning