LLM & Agent Evaluation

Score a golden dataset with weighted rubrics, simulate LLM-as-judge, and compare model versions A/B — all offline, no API keys.

Educational demo. Scores are computed locally from your rubric weights and 1–5 ratings. "Simulate LLM judge" fills canned scores — not a real model call.
Golden Dataset (5 cases) click to inspect
Rubric Weights

Weights auto-normalize to 100%. Pass threshold: weighted avg ≥ 3.5

Case Detail & Scoring
Aggregate Results
Regression view. Model A (baseline) vs Model B (candidate) on the same golden cases. Run this in CI to catch quality drops before deploy.
Per-Case Comparison
Case Model A Model B Δ Status
Aggregate A/B