Why Vibe Checks Fail
Manual review of LLM outputs has three fatal properties at scale:
- Inconsistency: Human raters agree on obvious failures and disagree on everything in the middle. Inter-rater reliability (Cohen's κ) on LLM output quality is often below 0.5 — coin-flip territory.
- Cost: At 10,000 daily requests, reviewing 1% of outputs costs ~100 person-hours per day.
- Latency: You cannot run human review in CI before shipping a prompt change. Regressions reach production before anyone notices.
Systematic evaluation replaces subjective impression with measurable signals. The goal isn't a single number — it's a battery of metrics that together surface different failure modes.
The Evaluation Stack
Layer 1: Structural / Format Checks (Free, Instant)
The fastest evals check that outputs are structurally valid before touching semantics:
import json
def eval_format(output: str, expected_schema: dict) -> dict:
results = {"valid_json": False, "schema_match": False, "no_refusal": True}
try:
parsed = json.loads(output)
results["valid_json"] = True
results["schema_match"] = all(k in parsed for k in expected_schema)
except json.JSONDecodeError:
pass
refusal_phrases = ["I cannot", "I'm unable to", "As an AI"]
results["no_refusal"] = not any(p in output for p in refusal_phrases)
return resultsStructural checks should run on 100% of outputs in production as a real-time alert — a spike in invalid JSON means your prompt broke, likely due to a model update or context overflow.
Layer 2: Reference-Based Metrics
For tasks with a known correct answer, compare model output against a reference:
| Metric | What It Measures | Good For | Limitation |
|---|---|---|---|
| Exact Match (EM) | String equality | Classification, extraction | Zero credit for near-correct |
| F1 token overlap | Token-level precision/recall | Extractive QA (SQuAD) | Ignores paraphrase |
| ROUGE-L | Longest common subsequence | Summarization | Ignores meaning, rewards n-gram overlap |
| BERTScore | Contextual embedding similarity | Generation quality | Compute cost, correlation with human varies by task |
| BLEU | n-gram precision | Translation (legacy) | Punishes valid paraphrase harshly |
from bert_score import score
candidates = ["The cat sat on the mat."]
references = ["A feline rested on a rug."]
P, R, F1 = score(candidates, references, lang="en", verbose=False)
print(f"BERTScore F1: {F1.mean():.3f}") # → 0.891 despite different wordingLayer 3: LLM-as-Judge
For open-ended generation (summarization, explanation, code review), there is no single reference answer. LLM-as-judge uses a strong model to evaluate outputs on dimensions like correctness, helpfulness, and faithfulness.
EVAL_PROMPT = """You are evaluating an AI assistant's answer to a question.
Question: {question}
Reference context: {context}
Assistant's answer: {answer}
Rate the answer on three dimensions (1-5 each):
1. Faithfulness: Is every claim in the answer supported by the context? (5 = fully grounded)
2. Completeness: Does the answer address the question fully? (5 = comprehensive)
3. Conciseness: Is the answer appropriately brief without being incomplete? (5 = ideal length)
Respond in JSON: {{"faithfulness": N, "completeness": N, "conciseness": N, "reasoning": "..."}}"""
# Call a strong judge model (separate from the model being evaluated)
judge_response = call_llm(EVAL_PROMPT.format(
question=question,
context=retrieved_chunks,
answer=model_output
))
scores = json.loads(judge_response)Judge models exhibit measurable biases: preference for longer answers (verbosity bias), preference for their own outputs when used to self-evaluate (self-enhancement bias), and sensitivity to answer position when comparing two options (position bias). Mitigate by: calibrating judge scores against human labels, using model-different-from-the-one-being-judged, and averaging across multiple judge calls with shuffled answer order.
Layer 4: RAG-Specific Metrics with RAGAS
RAG systems need evaluation at each pipeline stage. RAGAS measures four distinct properties:
- Faithfulness: Are all claims in the answer entailed by the retrieved context? (Detects hallucination against the context.)
- Answer Relevance: Does the answer actually address the question? (Detects tangential responses.)
- Context Precision: What fraction of the retrieved chunks were actually useful? (Diagnoses over-retrieval.)
- Context Recall: Were all relevant chunks retrieved? (Diagnoses under-retrieval.)
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset
data = {
"question": ["What is the refund policy?"],
"answer": [model_answer],
"contexts": [[chunk1, chunk2, chunk3]], # retrieved chunks
"ground_truth": ["Refunds are processed within 5 business days."]
}
result = evaluate(
Dataset.from_dict(data),
metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
print(result)
# {'faithfulness': 0.94, 'answer_relevancy': 0.87,
# 'context_precision': 0.78, 'context_recall': 0.91}Building an Eval Dataset
The most valuable 2 hours you can invest in LLM evaluation: build a golden dataset of 100–500 representative (question, expected_answer) pairs. This dataset becomes:
- The regression suite that runs in CI when you change prompts
- The calibration set for LLM-as-judge scores
- The benchmark for comparing model versions
Seed it with: your hardest support tickets, edge cases that previously caused failures, and randomly sampled production queries (with human-reviewed answers). Grow it by adding every new failure mode discovered in production.
Wiring Evals into CI
# pytest-style eval harness
import pytest
from pathlib import Path
import json
GOLDEN = json.loads(Path("evals/golden.json").read_text())
FAITHFULNESS_THRESHOLD = 0.85
FORMAT_PASS_RATE = 0.99
def test_faithfulness_regression():
scores = []
for item in GOLDEN:
answer = rag_query(item["question"])
score = judge_faithfulness(answer, item["context"])
scores.append(score)
avg = sum(scores) / len(scores)
assert avg >= FAITHFULNESS_THRESHOLD, f"Faithfulness {avg:.3f} below threshold {FAITHFULNESS_THRESHOLD}"
def test_format_compliance():
passed = sum(1 for item in GOLDEN if is_valid_json(rag_query(item["question"])))
rate = passed / len(GOLDEN)
assert rate >= FORMAT_PASS_RATEProduction Monitoring Signals
Evaluation doesn't end at CI. Production needs continuous monitoring of:
- Refusal rate: Sudden spike → prompt broke or model was updated. Target <0.5% for non-edge-case queries.
- Format error rate: JSON parse failures, missing required fields. Target <0.1%.
- Latency p95/p99: Context length creep silently increases inference time. Set alerts on p95 > 3s.
- User correction rate: If users edit or re-submit after an AI answer, that's an implicit negative signal. Track it.