Choosing an LLM for Agents: A Durable Framework Beyond Leaderboards
A senior engineer framework for model selection — capability tiers, context, modality, cost, privacy, tool use — plus routing, cascades, and why benchmarks lie.
Part 8 of the Building AI Agents series {Phần 8}. Previous {Trước}: Evaluating LLMs & Agents · Next {Tiếp}: Function Calling & Tool Use.
Model leaderboards go stale before you finish reading them {Bảng xếp hạng model lỗi thời trước khi bạn đọc xong}. Vendors ship new checkpoints weekly, pricing shifts, and your agent workload is nothing like MMLU {Vendor ship checkpoint mới hàng tuần, giá thay đổi, và workload agent của bạn không giống MMLU}. The durable skill is not memorizing which model is #1 today — it is building a selection framework you can re-run every quarter {Kỹ năng bền vững không phải nhớ model #1 hôm nay — mà là xây framework chọn model chạy lại mỗi quý}.
This post gives you that framework: capability tiers, hard constraints, eval-driven shortlists, and production patterns like routing and cascades {Bài này đưa framework đó: capability tier, ràng buộc cứng, shortlist theo eval, và pattern production như routing và cascade}. For a living catalog of specific models, see LLM Models Comparison Guide and Open Source LLM Ecosystem — this article stays at the decision layer {Để xem catalog model cụ thể, xem LLM Models Comparison Guide và Open Source LLM Ecosystem — bài này ở tầng quyết định}.
Open the full demo {Mở demo đầy đủ}: /tools/model-selector-demo/.
Start with constraints, not hype {Bắt đầu từ ràng buộc, không phải hype}
Before comparing benchmarks, write down what your agent must satisfy {Trước khi so benchmark, ghi rõ agent bắt buộc đáp ứng gì}. These are pass/fail gates — a model that scores 95 on a public leaderboard but cannot run in your VPC is disqualified {Đây là cửa pass/fail — model 95 điểm leaderboard nhưng không chạy trong VPC của bạn thì bị loại}.
| Constraint category | Questions to answer |
|---|---|
| Data & privacy | Can prompts leave your network? PII, HIPAA, SOC2? |
| Latency & UX | Sub-second first token? Streaming required? |
| Modality | Text only, or vision/audio for documents and screenshots? |
| Context length | Max input per turn — 8k, 128k, 1M tokens? |
| Output shape | JSON schema, tool calls, free-form prose? |
| Cost envelope | Budget per task, per user/day, or per 1M tokens? |
| Ops maturity | Managed API vs self-host GPUs vs hybrid? |
Callout: Constraints eliminate 80% of candidates before you open a single benchmark page {Callout: Ràng buộc loại 80% ứng viên trước khi bạn mở trang benchmark}. Treat “we might need vision someday” as a soft preference, not a hard gate — unless product already ships image inputs {Coi “có thể cần vision sau” là preference mềm, không phải gate cứng — trừ khi product đã ship input ảnh}.
Capability tiers — think in classes, not SKUs {Capability tier — nghĩ theo class, không theo SKU}
The market clusters into a handful of representative tiers that age slowly even as individual model names churn {Thị trường gom thành vài tier đại diện ít lỗi thời dù tên model thay đổi}. Map your workload to a tier first; pick a specific checkpoint second {Map workload lên tier trước; chọn checkpoint cụ thể sau}.
┌─────────────────────────────────────────────────────────────────┐
│ TIER │ TYPICAL USE IN AGENTS │
├────────────────────┼────────────────────────────────────────────┤
│ Frontier / large │ Hard reasoning, ambiguous specs, codegen │
│ Mid / balanced │ Default production agent loop, tool use │
│ Small / fast │ Routing, classification, simple extract │
│ Reasoning model │ Math, planning, verify-before-act steps │
│ Open-weight │ VPC-only, fine-tune, air-gapped deploy │
└────────────────────┴────────────────────────────────────────────┘
Frontier / large models maximize general capability at the highest cost and latency {Frontier / large tối đa capability tổng quát với cost và latency cao nhất}. Use when errors are expensive — legal review, complex codegen, multi-file refactors {Dùng khi lỗi đắt — legal review, codegen phức tạp, refactor nhiều file}.
Mid / balanced is the workhorse tier for most agent products {Mid / balanced là tier workhorse cho hầu hết agent product}. Tool-calling quality, structured output, and instruction-following are usually “good enough” at 3–10× lower cost than frontier {Tool-calling, structured output, instruction-following thường “đủ tốt” với cost thấp hơn frontier 3–10×}.
Small / fast models shine in routing and prefilter roles — intent detection, safety classifiers, summarizing logs before escalation {Small / fast mạnh ở routing và prefilter — intent detection, safety classifier, tóm tắt log trước khi escalate}. Never assume a small model can replace a mid tier on multi-step tool loops without measuring {Đừng giả định small model thay mid tier trên vòng tool nhiều bước mà không đo}.
Reasoning models (extended thinking, chain-of-thought baked in) trade latency and cost for hard logic {Reasoning model (extended thinking, CoT sẵn) đổi latency và cost lấy logic khó}. They are poor defaults for chat UX but excellent as escalation targets when a cheaper model fails verification {Chúng không phải default cho chat UX nhưng xuất sắc làm mục tiêu escalate khi model rẻ fail verification}.
Open-weight / self-host tiers buy data sovereignty and fine-tune freedom at the cost of GPU ops, quantization tradeoffs, and uneven tool-calling {Open-weight / self-host mua data sovereignty và tự do fine-tune, đổi lại GPU ops, tradeoff quantization, tool-calling không đồng đều}. See Open Source LLM Ecosystem for deployment patterns {Xem Open Source LLM Ecosystem cho pattern deploy}.
Context length — need vs afford {Context length — nhu cầu vs chi trả}
Long context is not free even when advertised {Context dài không miễn phí dù quảng cáo vậy}. Many providers bill input tokens linearly; attention cost grows super-linearly on some architectures {Nhiều provider tính input token tuyến tính; cost attention tăng siêu tuyến tính trên một số kiến trúc}. A 500k-token dump into a 1M window can cost more than summarizing in chunks with a cheaper model {Nhét 500k token vào cửa sổ 1M có thể đắt hơn tóm tắt theo chunk bằng model rẻ}.
| Signal | Prefer long-context tier | Prefer chunk + RAG / summarize |
|---|---|---|
| Whole-repo reasoning in one shot | ✓ | |
| Mostly retrieval over fixed corpus | ✓ | |
| Latency-sensitive chat | ✓ | |
| Legal doc “read everything” audit | ✓ | |
| Recurring same large context | ✓ (cache / index) |
Callout: Context window size is a ceiling, not a quality guarantee {Callout: Context window là trần, không đảm bảo chất lượng}. Models lose “needle” accuracy in the middle of huge prompts — validate with your own long-input evals {Model mất độ chính xác “needle” giữa prompt khổng lồ — validate bằng eval long-input của bạn}. Part 1 covered token economics in depth: Tokens & Context Windows {Phần 1 cover token economics sâu: Tokens & Context Windows}.
Modality — vision, audio, and agent UX {Modality — vision, audio, và agent UX}
Multimodal agents are not “text model + OCR wrapper” in 2026 {Agent multimodal không còn là “text model + OCR wrapper” năm 2026}. Native vision models understand layout, charts, and UI screenshots — critical for browser agents and document workflows {Vision native hiểu layout, chart, screenshot UI — quan trọng cho browser agent và document workflow}.
When evaluating modality fit {Khi đánh giá modality fit}:
- Hard requirement: If users upload images or PDFs rendered as pages, filter to vision-capable tiers early {Bắt buộc: Nếu user upload ảnh hoặc PDF render trang, lọc tier có vision sớm}.
- Audio: Real-time voice agents add streaming ASR/TTS latency budgets separate from LLM latency {Audio: Voice agent real-time thêm budget latency ASR/TTS tách khỏi LLM latency}.
- Structured docs: Tables and forms often need vision or specialized parsers — do not assume markdown conversion preserves semantics {Doc có cấu trúc: Bảng và form thường cần vision hoặc parser chuyên — đừng giả định convert markdown giữ nguyên semantics}.
Reasoning vs non-reasoning models {Reasoning vs non-reasoning model}
Standard chat models answer in one forward pass per token {Chat model chuẩn trả lời một forward pass mỗi token}. Reasoning models allocate extra compute — internal chain-of-thought, self-consistency, or search — before emitting the user-visible answer {Reasoning model dành thêm compute — CoT nội bộ, self-consistency, hoặc search — trước khi phát câu trả lời user thấy}.
| Dimension | Non-reasoning (Mid/Frontier chat) | Reasoning tier |
|---|---|---|
| Latency | Lower, predictable streaming | High, bursty |
| Cost per task | Token-linear | Often 5–20× for hard problems |
| Best for | Tool loops, extraction, dialogue | Proof, planning, ambiguous math |
| Agent pattern | Default loop | Escalation after failed verify |
Use reasoning models surgically {Dùng reasoning model có chọn lọc}: a router sends easy steps to mid tier, hard steps to reasoning tier, never the reverse by default {router gửi bước dễ sang mid tier, bước khó sang reasoning tier, không mặc định ngược lại}. Sampling knobs from Part 2 still apply — low temperature for verification steps {Knob sampling Phần 2 vẫn áp dụng — temperature thấp cho bước verification}.
Latency, throughput, and SLA math {Latency, throughput, và SLA math}
Agent latency is the sum of many LLM calls plus tool I/O {Latency agent là tổng nhiều LLM call cộng tool I/O}. Model choice affects every layer {Chọn model ảnh hưởng mọi tầng}:
User message
→ router model (small, ~100ms)
→ planner model (mid, ~800ms)
→ tool execution (variable)
→ synthesizer model (mid, ~600ms)
→ optional verifier (small or reasoning)
= perceived latency budget
Time-to-first-token (TTFT) matters for streaming UX {TTFT quan trọng cho streaming UX}. Tokens per second matters for long codegen {Tokens/giây quan trọng cho codegen dài}. Batch APIs trade latency for 50% cost cuts — fine for offline eval pipelines, wrong for interactive agents {Batch API đổi latency lấy giảm 50% cost — ổn cho eval offline, sai cho agent tương tác}.
Pick tier per step, not per product {Chọn tier theo bước, không theo product}: the same agent may call small for classify, mid for act, reasoning for repair {Cùng agent có thể gọi small classify, mid act, reasoning repair}.
Cost per token — and per successful task {Cost mỗi token — và mỗi task thành công}
List price is a lower bound {Giá niêm yết là cận dưới}. Real cost includes retries, overlong prompts, failed tool parses, and escalation {Cost thực gồm retry, prompt dài thừa, parse tool fail, và escalate}.
| Cost driver | Mitigation |
|---|---|
| Input-heavy prompts | Prompt compression, caching, retrieve don’t stuff |
| Multi-step loops | Step caps, cheaper model for drafts |
| Reasoning tax | Gate with verifier; don’t run o-class on every turn |
| Output bloat | Stopping criteria (Part 4), max_tokens discipline |
Normalize comparisons to cost per successful task on your eval set, not cost per 1M tokens in isolation {Chuẩn hóa so sánh theo cost mỗi task thành công trên eval set, không chỉ cost/1M token riêng lẻ}. Deep patterns: LLM Cost Optimization Patterns {Pattern sâu: LLM Cost Optimization Patterns}.
Open-weight vs proprietary {Open-weight vs proprietary}
| Factor | Proprietary API | Open-weight self-host |
|---|---|---|
| Time to first agent | Hours | Days–weeks (GPU, quant, serving) |
| Data residency | Vendor DPA / region | Full control |
| Fine-tuning | Often limited / expensive | Full weights, LoRA, distillation |
| Tool calling | Usually mature | Model-dependent, test hard |
| Model churn | Vendor deprecates versions | You control upgrade cadence |
Hybrid is common: proprietary for prototyping, open-weight for regulated production, or open small model for routing + proprietary for hard steps {Hybrid phổ biến: proprietary prototype, open-weight production regulated, hoặc open small routing + proprietary bước khó}.
Fine-tunability — when weights beat prompts {Fine-tunability — khi weight thắng prompt}
Fine-tuning helps when you have many examples of a narrow behavior that prompting cannot stabilize — tone, schema quirks, domain jargon, tool selection bias {Fine-tune giúp khi có nhiều ví dụ hành vi hẹp mà prompt không ổn định — tone, schema lạ, jargon domain, bias chọn tool}. It hurts when the task drifts weekly or you lack eval data {Hại khi task đổi hàng tuần hoặc thiếu eval data}. Part 6 covers the tradeoff space: Fine-tuning vs Prompting vs RAG {Phần 6 cover tradeoff: Fine-tuning vs Prompting vs RAG}.
Open-weight tiers dominate fine-tune scenarios; proprietary APIs increasingly offer adapter-style tuning on mid tiers {Open-weight thống trị fine-tune; API proprietary ngày càng có tuning kiểu adapter trên mid tier}. Always fine-tune against the same eval harness you use for model selection {Luôn fine-tune trên cùng eval harness dùng để chọn model}.
Tool-calling and structured output quality {Tool-calling và structured output}
Agents live or die on reliable function calls and schema-valid JSON {Agent sống chết ở function call đáng tin và JSON đúng schema}. Capability benchmarks rarely measure this; your eval must {Benchmark capability hiếm khi đo; eval của bạn phải đo}.
Checklist when shortlisting models for tool use {Checklist shortlist model cho tool use}:
- Parallel tool calls supported?
- Strict JSON /
response_format/ grammar constraints? - Behavior when tool returns error — retry or hallucinate?
- Multi-turn tool loops stable at temperature 0.1–0.3?
- Native vs prompt-wrapped tool protocols (compare latency)
Part 9 goes deep on implementation: Function Calling & Tool Use {Phần 9 đi sâu implementation: Function Calling & Tool Use}.
Example agent tool schema fragment {Ví dụ fragment schema tool agent}:
{
"name": "search_docs",
"description": "Semantic search over internal wiki",
"parameters": {
"type": "object",
"properties": {
"query": { "type": "string" },
"limit": { "type": "integer", "minimum": 1, "maximum": 20 }
},
"required": ["query"]
}
}
Run 50–200 tool-call scenarios per candidate model; report parse success rate and downstream task success separately {Chạy 50–200 scenario tool-call mỗi model; báo tỷ lệ parse thành công và task success downstream riêng}.
Benchmarks — useful signal, dangerous default {Benchmark — signal hữu ích, default nguy hiểm}
Public benchmarks (MMLU, HumanEval, MATH, etc.) rank general knowledge and exam skills, not your agent’s ticket-triage workflow {Benchmark công khai (MMLU, HumanEval, MATH, v.v.) xếp hạng kiến thức tổng quát và thi cử, không phải workflow triage ticket của agent bạn}. Three systemic problems {Ba vấn đề hệ thống}:
- Contamination — training data overlaps test sets; scores inflate {Contamination — training data trùng test set; điểm phình}.
- Gaming — vendors optimize for leaderboard tasks {Gaming — vendor tối ưu cho task leaderboard}.
- Distribution shift — your users, tools, and failure modes differ {Distribution shift — user, tool, và failure mode của bạn khác}.
Callout: A model +2 points on MMLU is not evidence it will parse your
create_invoicetool correctly {Callout: Model +2 điểm MMLU không chứng minh nó parse toolcreate_invoiceđúng}. Treat public benchmarks as orientation, not selection {Coi benchmark công khai là định hướng, không phải chọn model}.
Build your own evals — Part 7 is the playbook: Evaluating LLMs & Agents {Xây eval riêng — Phần 7 là playbook: Evaluating LLMs & Agents}. Minimum viable selection harness {Harness chọn model tối thiểu}:
1. 30–50 golden tasks from production logs (redacted)
2. Metrics: success, tool accuracy, latency p95, cost per task
3. Run all tier candidates with identical prompts + tools
4. Blind review of failures — model vs prompt vs tool bug
5. Pick winner on Pareto frontier (quality × cost × latency)
Re-run when vendors ship new checkpoints or your task mix shifts {Chạy lại khi vendor ship checkpoint mới hoặc mix task đổi}.
Model routing and fallbacks {Model routing và fallback}
Routing sends each request to the cheapest tier that can handle it {Routing gửi mỗi request tới tier rẻ nhất xử lý được}. Signals for routers {Signal cho router}:
- Intent classification (small model or embeddings)
- Estimated complexity (token count, tool count, user tier)
- Confidence from previous step
- Explicit user mode (“fast” vs “thorough”)
Fallbacks handle provider outages and quality cliffs {Fallback xử lý outage provider và cliff chất lượng}:
Primary: Mid tier (Vendor A)
↓ timeout / 5xx
Fallback: Mid tier (Vendor B)
↓ repeated tool-parse failure
Escalate: Frontier tier
↓ still failing
Degrade: Human handoff + log for eval
Never silently switch tiers without logging — eval drift will mystify you {Đừng đổi tier im lặng không log — eval drift sẽ làm bạn bối rối}. Version the router policy like any other code {Version policy router như code khác}.
Cascades — cheap first, expensive on demand {Cascade — rẻ trước, đắt khi cần}
A cascade runs a fast cheap model first, then re-runs with a stronger model only when verification fails {Cascade chạy model nhanh rẻ trước, chỉ chạy lại model mạnh hơn khi verification fail}. Classic pattern {Pattern kinh điển}:
Small → draft answer or tool plan
Verifier (rules + small LLM) → pass?
yes → return
no → Mid re-run with verifier feedback
still fail? → Reasoning tier
Cascades cut average cost 40–70% on mixed-difficulty workloads when verification is cheap and accurate {Cascade cắt cost trung bình 40–70% trên workload độ khó hỗn hợp khi verification rẻ và chính xác}. The verifier is the linchpin — invest in it {Verifier là then chốt — đầu tư vào nó}.
Combine cascades with caching (Part 5 memory patterns) and prompt templates (Part 3) for compounding savings {Kết hợp cascade với cache (pattern memory Phần 5) và prompt template (Phần 3) để tiết kiệm chồng lên nhau}.
A repeatable selection workflow {Workflow chọn model lặp lại được}
Use this checklist every quarter or before major agent features ship {Dùng checklist này mỗi quý hoặc trước khi ship tính năng agent lớn}:
| Step | Action |
|---|---|
| 1 | Document hard constraints (privacy, latency, modality, context) |
| 2 | Map workload to tier (frontier / mid / small / reasoning / open) |
| 3 | Build shortlist from comparison guide |
| 4 | Run private eval harness (Part 7) — 30+ golden tasks |
| 5 | Measure cost per successful task, not list price |
| 6 | Prototype routing + cascade in staging |
| 7 | Log tier decisions in production for continuous re-eval |
CONSTRAINTS → TIER → SHORTLIST → EVAL → ROUTING DESIGN → SHIP → RE-EVAL
Common anti-patterns {Anti-pattern thường gặp}
- Frontier everywhere — burns budget; mid tier handles 70% of agent steps {Frontier mọi nơi — đốt budget; mid tier xử lý 70% bước agent}.
- Leaderboard shopping — optimizes irrelevant skills {Săn leaderboard — tối ưu skill không liên quan}.
- Ignoring tool-call eval — chat quality ≠ agent quality {Bỏ qua eval tool-call — chat tốt ≠ agent tốt}.
- Single vendor lock-in — no fallback when API deprecates a version {Lock-in một vendor — không fallback khi API deprecate version}.
- Static choice — model picked at hackathon never revisited {Chọn tĩnh — model chọn ở hackathon không review lại}.
Key takeaways {Điểm chính}
- Select by tier and constraints first, specific checkpoint second {Chọn theo tier và ràng buộc trước, checkpoint cụ thể sau}.
- Long context, vision, and reasoning are paid upgrades — use only when eval proves need {Context dài, vision, reasoning là nâng cấp trả phí — chỉ dùng khi eval chứng minh cần}.
- Public benchmarks orient; your evals decide (Part 7) {Benchmark công khai định hướng; eval của bạn quyết định (Phần 7)}.
- Production agents use routing, fallbacks, and cascades — not one model for every step {Agent production dùng routing, fallback, cascade — không một model cho mọi bước}.
- Re-run selection when checkpoints, pricing, or task mix changes {Chạy lại chọn model khi checkpoint, giá, hoặc mix task đổi}.
Next up: once you have a model shortlist, the agent loop depends on reliable tool invocation — Function Calling & Tool Use {Tiếp theo: khi đã có shortlist model, vòng agent phụ thuộc gọi tool đáng tin — Function Calling & Tool Use}.
The Building AI Agents series {Loạt bài Building AI Agents}
- Tokens & Context Windows
- Sampling: temperature, top_p, top_k
- Prompt Engineering for Agents
- Stopping Criteria & Output Control
- Context Engineering & Memory
- Fine-tuning vs Prompting vs RAG
- Evaluating LLMs & Agents
- Choosing a Model (current)
- Function Calling & Tool Use
- Agent Patterns: ReAct, Reflection, Planning