LLM Evaluation That Actually Works: Beyond Vibe Checks

Shipping LLM features without systematic evaluation is how teams discover hallucination bugs in production. "I read 50 outputs and they seemed good" is not a testing strategy. This guide covers the evaluation stack from fast automated metrics through human-calibrated LLM-as-judge to production monitoring — and shows how to wire them into CI.

Why Vibe Checks Fail

Manual review of LLM outputs has three fatal properties at scale:

Inconsistency: Human raters agree on obvious failures and disagree on everything in the middle. Inter-rater reliability (Cohen's κ) on LLM output quality is often below 0.5 — coin-flip territory.
Cost: At 10,000 daily requests, reviewing 1% of outputs costs ~100 person-hours per day.
Latency: You cannot run human review in CI before shipping a prompt change. Regressions reach production before anyone notices.

Systematic evaluation replaces subjective impression with measurable signals. The goal isn't a single number — it's a battery of metrics that together surface different failure modes.

The Evaluation Stack

Layer 1: Structural / Format Checks (Free, Instant)

The fastest evals check that outputs are structurally valid before touching semantics:

import json

def eval_format(output: str, expected_schema: dict) -> dict:
    results = {"valid_json": False, "schema_match": False, "no_refusal": True}
    try:
        parsed = json.loads(output)
        results["valid_json"] = True
        results["schema_match"] = all(k in parsed for k in expected_schema)
    except json.JSONDecodeError:
        pass
    refusal_phrases = ["I cannot", "I'm unable to", "As an AI"]
    results["no_refusal"] = not any(p in output for p in refusal_phrases)
    return results

Structural checks should run on 100% of outputs in production as a real-time alert — a spike in invalid JSON means your prompt broke, likely due to a model update or context overflow.

Layer 2: Reference-Based Metrics

For tasks with a known correct answer, compare model output against a reference:

Metric	What It Measures	Good For	Limitation
Exact Match (EM)	String equality	Classification, extraction	Zero credit for near-correct
F1 token overlap	Token-level precision/recall	Extractive QA (SQuAD)	Ignores paraphrase
ROUGE-L	Longest common subsequence	Summarization	Ignores meaning, rewards n-gram overlap
BERTScore	Contextual embedding similarity	Generation quality	Compute cost, correlation with human varies by task
BLEU	n-gram precision	Translation (legacy)	Punishes valid paraphrase harshly

from bert_score import score

candidates = ["The cat sat on the mat."]
references = ["A feline rested on a rug."]

P, R, F1 = score(candidates, references, lang="en", verbose=False)
print(f"BERTScore F1: {F1.mean():.3f}")  # → 0.891 despite different wording

Layer 3: LLM-as-Judge

For open-ended generation (summarization, explanation, code review), there is no single reference answer. LLM-as-judge uses a strong model to evaluate outputs on dimensions like correctness, helpfulness, and faithfulness.

EVAL_PROMPT = """You are evaluating an AI assistant's answer to a question.

Question: {question}
Reference context: {context}
Assistant's answer: {answer}

Rate the answer on three dimensions (1-5 each):
1. Faithfulness: Is every claim in the answer supported by the context? (5 = fully grounded)
2. Completeness: Does the answer address the question fully? (5 = comprehensive)
3. Conciseness: Is the answer appropriately brief without being incomplete? (5 = ideal length)

Respond in JSON: {{"faithfulness": N, "completeness": N, "conciseness": N, "reasoning": "..."}}"""

# Call a strong judge model (separate from the model being evaluated)
judge_response = call_llm(EVAL_PROMPT.format(
    question=question,
    context=retrieved_chunks,
    answer=model_output
))
scores = json.loads(judge_response)

⚠️ LLM-as-Judge Biases

Judge models exhibit measurable biases: preference for longer answers (verbosity bias), preference for their own outputs when used to self-evaluate (self-enhancement bias), and sensitivity to answer position when comparing two options (position bias). Mitigate by: calibrating judge scores against human labels, using model-different-from-the-one-being-judged, and averaging across multiple judge calls with shuffled answer order.

Layer 4: RAG-Specific Metrics with RAGAS

RAG systems need evaluation at each pipeline stage. RAGAS measures four distinct properties:

Faithfulness: Are all claims in the answer entailed by the retrieved context? (Detects hallucination against the context.)
Answer Relevance: Does the answer actually address the question? (Detects tangential responses.)
Context Precision: What fraction of the retrieved chunks were actually useful? (Diagnoses over-retrieval.)
Context Recall: Were all relevant chunks retrieved? (Diagnoses under-retrieval.)

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

data = {
    "question": ["What is the refund policy?"],
    "answer": [model_answer],
    "contexts": [[chunk1, chunk2, chunk3]],  # retrieved chunks
    "ground_truth": ["Refunds are processed within 5 business days."]
}
result = evaluate(
    Dataset.from_dict(data),
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
print(result)
# {'faithfulness': 0.94, 'answer_relevancy': 0.87,
#  'context_precision': 0.78, 'context_recall': 0.91}

Building an Eval Dataset

The most valuable 2 hours you can invest in LLM evaluation: build a golden dataset of 100–500 representative (question, expected_answer) pairs. This dataset becomes:

The regression suite that runs in CI when you change prompts
The calibration set for LLM-as-judge scores
The benchmark for comparing model versions

Seed it with: your hardest support tickets, edge cases that previously caused failures, and randomly sampled production queries (with human-reviewed answers). Grow it by adding every new failure mode discovered in production.

Wiring Evals into CI

# pytest-style eval harness
import pytest
from pathlib import Path
import json

GOLDEN = json.loads(Path("evals/golden.json").read_text())
FAITHFULNESS_THRESHOLD = 0.85
FORMAT_PASS_RATE = 0.99

def test_faithfulness_regression():
    scores = []
    for item in GOLDEN:
        answer = rag_query(item["question"])
        score = judge_faithfulness(answer, item["context"])
        scores.append(score)
    avg = sum(scores) / len(scores)
    assert avg >= FAITHFULNESS_THRESHOLD, f"Faithfulness {avg:.3f} below threshold {FAITHFULNESS_THRESHOLD}"

def test_format_compliance():
    passed = sum(1 for item in GOLDEN if is_valid_json(rag_query(item["question"])))
    rate = passed / len(GOLDEN)
    assert rate >= FORMAT_PASS_RATE

Production Monitoring Signals

Evaluation doesn't end at CI. Production needs continuous monitoring of:

Refusal rate: Sudden spike → prompt broke or model was updated. Target <0.5% for non-edge-case queries.
Format error rate: JSON parse failures, missing required fields. Target <0.1%.
Latency p95/p99: Context length creep silently increases inference time. Set alerts on p95 > 3s.
User correction rate: If users edit or re-submit after an AI answer, that's an implicit negative signal. Track it.

Tools-Hut

LLM Evaluation That Actually Works: Beyond Vibe Checks

Why Vibe Checks Fail

The Evaluation Stack

Layer 1: Structural / Format Checks (Free, Instant)

Layer 2: Reference-Based Metrics

Layer 3: LLM-as-Judge

Layer 4: RAG-Specific Metrics with RAGAS

Building an Eval Dataset

Wiring Evals into CI

Production Monitoring Signals

Continue the AI Learning Path

Why Vibe Checks Fail

The Evaluation Stack

Layer 1: Structural / Format Checks (Free, Instant)

Layer 2: Reference-Based Metrics

Layer 3: LLM-as-Judge

Layer 4: RAG-Specific Metrics with RAGAS

Building an Eval Dataset

Wiring Evals into CI

Production Monitoring Signals

Continue the AI Learning Path

Related Articles