LLM & Agent Evaluation

Score a golden dataset with weighted rubrics, simulate LLM-as-judge, and compare model versions A/B — all offline, no API keys.

Educational demo. Scores are computed locally from your rubric weights and 1–5 ratings. "Simulate LLM judge" fills canned scores — not a real model call.

Golden Dataset (5 cases) click to inspect

Rubric Weights

Weights auto-normalize to 100%. Pass threshold: weighted avg ≥ 3.5

Case Detail & Scoring

Aggregate Results

Regression view. Model A (baseline) vs Model B (candidate) on the same golden cases. Run this in CI to catch quality drops before deploy.

Per-Case Comparison

Case	Model A	Model B	Δ	Status

Aggregate A/B