Evaluating LLMs & Agents: Golden Sets, Metrics, LLM-as-Judge, and Regression in CI
Why eval is the hardest part of shipping agents — golden datasets, offline vs online metrics, LLM-as-judge rubrics, human agreement, and regression in CI.
Filter/Tag
1 entries
Why eval is the hardest part of shipping agents — golden datasets, offline vs online metrics, LLM-as-judge rubrics, human agreement, and regression in CI.