Evaluation is the control system that prevents silent quality drift, hallucination creep, and erosion of user trust. Most assistant failures trace back to weak or ad‑hoc evaluation: unrepresentative test sets, missing retrieval metrics, or lack of regression gating. A robust program combines offline benchmarking, online telemetry, and structured human review—continuously feeding prioritization (content gaps vs retrieval tuning vs prompt refinement). This pillar defines assets, metrics, automation patterns and governance needed to industrialize quality.
Why Evaluation Matters
Key risks without discipline:
- Silent Drift: Retrieval precision declines after ingestion changes.
- Hallucination Escalation: Subtle prompt edits reduce refusal rigor.
- Coverage Blind Spots: New product areas launch without test queries.
- Over‑optimization: Prompt tuned for demo examples not real usage.
Evaluation creates early detection loops and objective release gates.
Test Set Construction
Gold query set design:
| Dimension | Examples | Notes |
|---|---|---|
| Intent Type | how-to, config, troubleshooting, comparison | Balance across distribution |
| Difficulty | simple lookup vs multi‑step reasoning | Calibrate score variance |
| Freshness Sensitivity | pricing, release notes | Detect staleness quickly |
| Risk Category | legal, security, limits | Ensure refusal correctness |
| Locale | en, es, fr | Validate fallback behavior |
Sources: anonymized production queries (filtered & paraphrased), product docs, support tickets. Refresh quarterly; retire queries referencing deprecated features.
Retrieval Metrics
Core metrics:
- Recall@k (evidence present in top k)
- Precision@k (relevant / k)
- MRR / nDCG (rank quality)
- Coverage (unique pages surfaced per N queries)
- Redundancy Rate (duplicate source chunks proportion)
Collect per intent_type & locale to uncover skew.
Generation Metrics
Primary:
| Metric | Definition | Collection Mode |
|---|---|---|
| Faithfulness | Unsupported claim ratio | Human + model critique |
| Completeness | Required facts present | Checklist scoring |
| Helpfulness | Overall utility 1–5 | Human rating |
| Citation Accuracy | Citations map to supporting chunks | Automated + spot check |
| Refusal Appropriateness | Correct refusals / total refusals | Manual sample |
| Latency P95 | End-to-end | Telemetry |
Automated LLM grading can triage, but maintain human calibration baseline.
Tooling & Automation
Pipeline:
- Nightly Batch: Run gold queries through retrieval + generation; persist trace.
- Scoring Job: Compute metrics; compare against control thresholds.
- CI Gate: Block merges touching retrieval/prompt code if delta > allowed band.
- Diff Dashboard: Highlight queries with largest negative movement.
- Alerting: Slack/Webhook if faithfulness drops >3pp or Recall@10 drops >5pp.
Store raw traces (query, chunk_ids, scores, answer, model_version) for audit.
Human Review Loop
Weekly sample stratified by: low citation count, high latency, refusals, newly added intent types. Use dual‑review for 10% to measure inter‑rater agreement (Cohen’s kappa >0.75 target). Provide rubric card: definitions, borderline examples, refusal templates.
Drift & Decay Monitoring
Drift detectors:
- Freshness Gap: % of gold queries whose top evidence chunk age > freshness SLA.
- Embedding Version Mix: Cross‑version chunk ratio (should trend down after migrations).
- Retrieval Score Shift: Mean top score delta week‑over‑week.
- Query Distribution Shift: New intent clusters lacking gold coverage.
Escalate if any two drift indicators breach thresholds simultaneously.
Governance & Reporting
Artifacts:
- Monthly Quality Report (metrics vs targets, regression incidents, remediation actions)
- Change Log (model_version, prompt_version, embedding_version, major ingestion updates)
- Risk Register (open issues: hallucination sources, stale coverage)
- Audit Bundle (sample traces, scoring guidelines, reviewer training docs)
Map report KPIs to internal SLAs (e.g., Faithfulness Error <5%).
Optimization Cycle
Lifecycle:
- Identify: Metric anomaly or qualitative feedback.
- Hypothesize: Retrieval tuning vs content gap vs prompt change.
- Experiment: Branch config; run A/B on gold + shadow traffic.
- Validate: Require statistically significant improvement (e.g., >3pp Recall@10 with no faithfulness regression).
- Roll Out: Increment version, update changelog.
- Post‑Monitor: 72h heightened watch for regressions.
Key Takeaways
- Treat evaluation assets (gold sets, rubrics) as production code.
- Retrieval + generation metrics must be separated to localize regressions.
- Automated nightly runs catch silent drift before users do.
- Human review remains essential for faithfulness & nuance.
- Governance turns raw metrics into organizational trust.