Filter/Tag

#evaluation

1 entries

2026-02-23 ai llm agents

Evaluating LLMs & Agents: Golden Sets, Metrics, LLM-as-Judge, and Regression in CI

Why eval is the hardest part of shipping agents — golden datasets, offline vs online metrics, LLM-as-judge rubrics, human agreement, and regression in CI.