The Prompt Anti-Patterns That Kill Teams
Before the solution, the problem taxonomy:
- Hardcoded strings: Prompt buried in application code. Changing it requires a full deploy. No change history, no review, no rollback.
- Database prompts with no version history: Updated by a product manager through an admin UI. No diff, no audit trail, no way to know why quality degraded last Tuesday.
- Shared prompts across environments: The same prompt string in staging and production. A "test" prompt edit accidentally ships to users.
- No regression testing: A prompt change that improves one use case silently breaks five others. No one knows until user complaints arrive.
Prompts as Files in Version Control
The simplest improvement: move prompts out of code and into files, then track them with git.
prompts/
├── summarizer/
│ ├── system.txt # System prompt
│ ├── user.jinja2 # User prompt template (Jinja2 for variable injection)
│ └── metadata.json # Model, temperature, max_tokens settings
├── classifier/
│ ├── system.txt
│ └── user.jinja2
└── extraction/
├── system.txt
├── user.jinja2
└── tools.json # Tool/function definitions# metadata.json — version and model config alongside the prompt
{
"version": "2.3.1",
"model": "gpt-4o-mini",
"temperature": 0.2,
"max_tokens": 512,
"last_modified": "2026-06-14",
"changelog": "Added explicit JSON output instruction, reduced hallucination rate 12%"
}# user.jinja2 — Jinja2 template for dynamic prompt assembly
Summarize the following article in {{ max_words }} words or fewer.
Focus on: {{ focus_areas | join(', ') }}.
Output language: {{ output_lang | default('English') }}.
Article:
{{ article_text }}
Summary:# Python loader
from jinja2 import Environment, FileSystemLoader
import json
from pathlib import Path
def load_prompt(name: str, variables: dict) -> dict:
base = Path("prompts") / name
env = Environment(loader=FileSystemLoader(str(base)))
system = (base / "system.txt").read_text()
user = env.get_template("user.jinja2").render(**variables)
config = json.loads((base / "metadata.json").read_text())
return {
"messages": [{"role": "system", "content": system},
{"role": "user", "content": user}],
"model": config["model"],
"temperature": config["temperature"],
"max_tokens": config["max_tokens"]
}Testing Prompts: The Eval Suite
Every prompt file should have a corresponding test file:
evals/
├── summarizer/
│ ├── cases.jsonl # test cases: {input, expected_properties}
│ └── test_summarizer.py
└── classifier/
├── cases.jsonl
└── test_classifier.py# evals/summarizer/cases.jsonl (one JSON object per line)
{"id":"s001", "input":{"max_words":50,"article_text":"..."}, "must_not_contain":["I cannot"], "max_word_count":60}
{"id":"s002", "input":{"max_words":100,"article_text":"..."}, "must_not_contain":["As an AI"], "max_word_count":110}# evals/summarizer/test_summarizer.py
import json, pytest
from pathlib import Path
CASES = [json.loads(l) for l in Path("evals/summarizer/cases.jsonl").read_text().splitlines()]
@pytest.mark.parametrize("case", CASES, ids=[c["id"] for c in CASES])
def test_summarizer(case):
prompt = load_prompt("summarizer", case["input"])
output = call_llm(**prompt)
word_count = len(output.split())
assert word_count <= case["max_word_count"], \
f"Output too long: {word_count} words (limit: {case['max_word_count']})"
for phrase in case.get("must_not_contain", []):
assert phrase not in output, f"Output contained forbidden phrase: '{phrase}'"CI/CD Pipeline for Prompts
A PR touching any file under prompts/ should trigger the full eval suite before merge:
# .github/workflows/prompt-eval.yml
name: Prompt Evaluation
on:
pull_request:
paths: ['prompts/**', 'evals/**']
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: {python-version: '3.12'}
- run: pip install -r requirements.txt
- run: pytest evals/ -v --tb=short
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
- name: Post eval summary to PR
if: always()
uses: actions/github-script@v7
with:
script: |
github.rest.issues.createComment({
issue_number: context.issue.number,
body: 'Prompt eval complete. See Actions tab for results.'
})Running LLM evals in CI has real cost (API calls) and non-determinism (temperature > 0 produces different outputs each run). Mitigate both: use temperature: 0 for deterministic evals in CI; cache responses by (prompt_hash + model + input_hash) so re-runs don't re-bill; set a budget alert on the CI service account. A 500-case eval suite at gpt-4o-mini prices costs ~$0.05 per run — run it on every PR without hesitation.
Environment-Specific Prompt Deployment
Separate prompt versions across environments to prevent staging changes from affecting production:
# config.py
PROMPT_ENV = os.getenv("PROMPT_ENV", "production") # staging | production | dev
def get_prompt_version(name: str) -> str:
# Read from environment-specific config or feature flag system
versions = {
"dev": {"summarizer": "main"}, # always use latest from branch
"staging": {"summarizer": "v2.4.0-rc1"}, # pre-release candidate
"production": {"summarizer": "v2.3.1"} # stable release
}
return versions[PROMPT_ENV][name]A/B Testing Prompts
Once evals pass in CI, validate in production with a controlled rollout:
- Deploy new prompt version to 5% of traffic (feature flag or hash-based routing)
- Log all prompt versions alongside responses in your observability system
- Compare format error rate, refusal rate, user correction rate, and latency between versions
- If new version is statistically better (run for ≥1,000 samples), ramp to 100% and retire the old version
This is the same rollout discipline used for code changes. Prompts deserve identical rigor — a bad prompt change at 100% traffic is as damaging as a bad code deploy.
Prompt Observability
Log every prompt call with enough context to reproduce and debug it:
log_entry = {
"timestamp": datetime.utcnow().isoformat(),
"prompt_name": "summarizer",
"prompt_version": "v2.3.1",
"model": "gpt-4o-mini",
"input_hash": sha256(user_input), # don't log raw PII
"input_tokens": usage["prompt_tokens"],
"output_tokens": usage["completion_tokens"],
"latency_ms": elapsed_ms,
"finish_reason": finish_reason,
"format_valid": is_valid_json(output),
"request_id": response_id # for OpenAI support tickets
}