Score a golden dataset with weighted rubrics, simulate LLM-as-judge, and compare model versions A/B — all offline, no API keys.
Educational demo. Scores are computed locally from your rubric weights and 1–5 ratings. "Simulate LLM judge" fills canned scores — not a real model call.
Golden Dataset (5 cases)click to inspect
Rubric Weights
Weights auto-normalize to 100%. Pass threshold: weighted avg ≥ 3.5
Case Detail & Scoring
Aggregate Results
Regression view. Model A (baseline) vs Model B (candidate) on the same golden cases. Run this in CI to catch quality drops before deploy.