Evaluating AI Assistant Quality

9/15/2025

evaluation • quality • metrics • rag

Evaluation is the control system that prevents silent quality drift, hallucination creep, and erosion of user trust. Most assistant failures trace back to weak or ad‑hoc evaluation: unrepresentative test sets, missing retrieval metrics, or lack of regression gating. A robust program combines offline benchmarking, online telemetry, and structured human review—continuously feeding prioritization (content gaps vs retrieval tuning vs prompt refinement). This pillar defines assets, metrics, automation patterns and governance needed to industrialize quality.

Why Evaluation Matters

Key risks without discipline:

Silent Drift: Retrieval precision declines after ingestion changes.
Hallucination Escalation: Subtle prompt edits reduce refusal rigor.
Coverage Blind Spots: New product areas launch without test queries.
Over‑optimization: Prompt tuned for demo examples not real usage.

Evaluation creates early detection loops and objective release gates.

Test Set Construction

Gold query set design:

Dimension	Examples	Notes
Intent Type	how-to, config, troubleshooting, comparison	Balance across distribution
Difficulty	simple lookup vs multi‑step reasoning	Calibrate score variance
Freshness Sensitivity	pricing, release notes	Detect staleness quickly
Risk Category	legal, security, limits	Ensure refusal correctness
Locale	en, es, fr	Validate fallback behavior

Sources: anonymized production queries (filtered & paraphrased), product docs, support tickets. Refresh quarterly; retire queries referencing deprecated features.

Retrieval Metrics

Core metrics:

Recall@k (evidence present in top k)
Precision@k (relevant / k)
MRR / nDCG (rank quality)
Coverage (unique pages surfaced per N queries)
Redundancy Rate (duplicate source chunks proportion)

Collect per intent_type & locale to uncover skew.

Generation Metrics

Primary:

Metric	Definition	Collection Mode
Faithfulness	Unsupported claim ratio	Human + model critique
Completeness	Required facts present	Checklist scoring
Helpfulness	Overall utility 1–5	Human rating
Citation Accuracy	Citations map to supporting chunks	Automated + spot check
Refusal Appropriateness	Correct refusals / total refusals	Manual sample
Latency P95	End-to-end	Telemetry

Automated LLM grading can triage, but maintain human calibration baseline.

Tooling & Automation

Pipeline:

Nightly Batch: Run gold queries through retrieval + generation; persist trace.
Scoring Job: Compute metrics; compare against control thresholds.
CI Gate: Block merges touching retrieval/prompt code if delta > allowed band.
Diff Dashboard: Highlight queries with largest negative movement.
Alerting: Slack/Webhook if faithfulness drops >3pp or Recall@10 drops >5pp.

Store raw traces (query, chunk_ids, scores, answer, model_version) for audit.

Human Review Loop

Weekly sample stratified by: low citation count, high latency, refusals, newly added intent types. Use dual‑review for 10% to measure inter‑rater agreement (Cohen’s kappa >0.75 target). Provide rubric card: definitions, borderline examples, refusal templates.

Drift & Decay Monitoring

Drift detectors:

Freshness Gap: % of gold queries whose top evidence chunk age > freshness SLA.
Embedding Version Mix: Cross‑version chunk ratio (should trend down after migrations).
Retrieval Score Shift: Mean top score delta week‑over‑week.
Query Distribution Shift: New intent clusters lacking gold coverage.

Escalate if any two drift indicators breach thresholds simultaneously.

Governance & Reporting

Artifacts:

Monthly Quality Report (metrics vs targets, regression incidents, remediation actions)
Change Log (model_version, prompt_version, embedding_version, major ingestion updates)
Risk Register (open issues: hallucination sources, stale coverage)
Audit Bundle (sample traces, scoring guidelines, reviewer training docs)

Map report KPIs to internal SLAs (e.g., Faithfulness Error <5%).

Optimization Cycle

Lifecycle:

Identify: Metric anomaly or qualitative feedback.
Hypothesize: Retrieval tuning vs content gap vs prompt change.
Experiment: Branch config; run A/B on gold + shadow traffic.
Validate: Require statistically significant improvement (e.g., >3pp Recall@10 with no faithfulness regression).
Roll Out: Increment version, update changelog.
Post‑Monitor: 72h heightened watch for regressions.

Key Takeaways

Treat evaluation assets (gold sets, rubrics) as production code.
Retrieval + generation metrics must be separated to localize regressions.
Automated nightly runs catch silent drift before users do.
Human review remains essential for faithfulness & nuance.
Governance turns raw metrics into organizational trust.

Next Step

Per-Embed Metrics

Inspect retrieval quality signals

No-Code Configuration

Safely tune prompts & crawl scope

Interactive Demo (5‑Page Crawl)

Spin up a quick test index