Metrics translate raw interaction traces into actionable quality & impact signals. Without a layered framework teams chase vanity stats (total sessions) while missing silent regressions (faithfulness drift, containment collapse in a segment, rising fallback loops). This article defines a balanced scorecard: retrieval performance, answer quality, support outcomes, business impact, and diagnostics—plus instrumentation design and target setting.
Metric Framework Overview
Five layers:
- Retrieval Performance (Can we surface the right evidence?)
- Generation Quality (Do answers reflect evidence & user intent?)
- Support Outcomes (Are users self‑resolving?)
- Business Impact (Does this drive activation / retention?)
- Diagnostics (Why did failures occur?)
Each higher layer depends on stability of the lower.
Retrieval Metrics
| Metric | Definition | Goal (Phase 1) | Notes |
|---|---|---|---|
| Recall@5 | % queries with at least one gold evidence chunk in top 5 | >70% | Gold set derived |
| Precision@5 | Relevant / 5 | >65% | Noise control |
| Coverage | Unique pages referenced / total prioritized pages | >85% | Gap detection |
| Redundancy Rate | Duplicate source chunks ratio | <30% | Tune overlap |
| Freshness Age P50 | Median days since chunk update | <14 | Content ops |
| Retrieval Latency P95 | ms for retrieval stage | <350ms | UX budget |
Generation Metrics
| Metric | Definition | Collection |
|---|---|---|
| Faithfulness Error Rate | Unsupported claim % | Human + model critique |
| Completeness Score | Required facts present (checklist) | Human review |
| Helpfulness | 1–5 rating | User / internal rater |
| Citation Accuracy | Correct citations / total | Automated + sample |
| Refusal Appropriateness | Proper refusals / total refusals | Human sample |
| First Token Latency P95 | Model start time | Telemetry |
| Full Answer Latency P95 | End to end | Telemetry |
Support Outcomes
| Metric | Definition | Insight |
|---|---|---|
| Containment Rate | Sessions resolved w/o escalation | Deflection strength |
| Assisted Resolution Time | Time when agent used AI draft vs manual | Efficiency delta |
| Escalation Rate | Escalated sessions / total | Complexity mix |
| CSAT Delta | Post‑resolution CSAT vs baseline | Experience impact |
| Multi‑Turn Depth | Avg turns per resolved session | Engagement & complexity |
Business Impact
| Metric | Definition | Example Use |
|---|---|---|
| Activation Assist | % new users resolving onboarding blockers | Onboarding success |
| Conversion Influence | Sessions preceding plan upgrade | Attribution indicator |
| Retention Correlation | Churn rate difference cohort using assistant | Renewal predictor |
| Support Cost per Resolved | Total support spend / resolved sessions | Efficiency trend |
| Net Savings | Modeled cost reduction (see CS automation article) | ROI justification |
Diagnostic Metrics
| Metric | Definition | Failure Signal |
|---|---|---|
| Fallback Rate | % responses using generic fallback template | Retrieval gap |
| Low Citation Count Rate | Answers with <2 citations | Context insufficiency |
| Refusal Rate | % queries refused | Over‑strict guardrail (if high) |
| Guardrail Trigger Types | Distribution (PII, injection, policy) | Policy tuning |
| Prompt Version Drift | Sessions by prompt version | Rollout integrity |
Instrumentation Stack
Event schema suggestions:
- query_issued { query_id, session_id, user_tier, locale, tokens }
- retrieval_completed { query_id, candidates:[{chunk_id, score, source}], latency_ms }
- answer_generated { query_id, answer_id, model_version, prompt_version, token_count, latency_ms, citation_count, refusal_flag }
- feedback_submitted { answer_id, rating, reason_codes[] }
- escalation_created { session_id, reason, time_from_first_query_ms }
All events share correlation_id for trace joining.
Benchmarking & Targets
Set phase gates:
- Launch Gate: Faithfulness Error <10%, Containment >35%.
- Scale Gate: Faithfulness Error <7%, Containment >50%, P95 Full Latency <2.5s.
- Optimization Gate: Faithfulness Error <5%, Containment >60%, Precision@5 >70%.
Track variance by segment (locale, tier, intent cluster) to surface hidden regressions.
Continuous Improvement Loop
Loop:
- Detect anomaly (metric breach or downward trend)
- Root cause classify: retrieval, content gap, prompt, model, guardrail
- Form hypothesis & proposed change
- Run controlled experiment / offline benchmark
- Deploy behind flag; monitor leading indicators
- Promote or rollback; update changelog
Key Takeaways
- Layer metrics—don’t conflate retrieval and generation.
- Containment without faithfulness is hollow; faithfulness without containment lacks ROI.
- Diagnostic events enable targeted remediation over guesswork.
- Segment analysis reveals regressions masked in aggregate.
- Treat target gates as quality contracts, not aspirations.