Why Not Just Put Everything in the Prompt?
The naive solution to "LLMs don't know my private data" is to paste all relevant documents into the system prompt. This fails on three fronts:
- Context limits: Even with 128k-token windows, a typical corporate knowledge base has millions of tokens. You cannot fit it all.
- Cost: Every token in the prompt is billed at input token rates. Injecting 50,000 tokens into every request costs ~$0.10 per call at current prices — $10,000 per 100,000 requests.
- Quality degradation: Research consistently shows that model performance drops with very long contexts. The "lost in the middle" effect means the model attends poorly to content in the middle of a large context, preferring content at the start and end.
RAG solves all three: retrieve only the 3–5 chunks most relevant to the current query, then inject only those.
The Two-Phase Architecture
Phase 1 — Indexing (offline, runs once):
- Load documents from source (files, database, API)
- Chunk into smaller pieces with overlap
- Embed each chunk into a vector
- Store vectors in a vector index
Phase 2 — Query (online, runs per request):
- Embed the user's question into a vector
- Retrieve top-k most similar chunks from the index
- Build a prompt: system instructions + retrieved chunks + user question
- Call the LLM API and return the response
Chunking: The Decision That Matters Most
Poor chunking is the most common cause of RAG failures. The tradeoff is fundamental:
- Small chunks (128 tokens): Retrieval is precise — the retrieved text matches the query tightly. But the chunk may lack the surrounding context needed to actually answer the question.
- Large chunks (1024 tokens): More context per chunk, but embeddings average over more content and become less discriminating. A large chunk about "Python packaging" will match queries about virtualenvs, pip, pyproject.toml, and wheels equally.
The sweet spot for most text is 256–512 tokens with 64-token overlap. The overlap ensures sentences that cross chunk boundaries are fully represented in at least one chunk.
def chunk_text(text: str, chunk_size: int = 400, overlap: int = 64) -> list[dict]:
"""Split text into overlapping chunks, tracking character positions."""
words = text.split()
chunks = []
i = 0
while i < len(words):
chunk_words = words[i : i + chunk_size]
chunk_text = " ".join(chunk_words)
chunks.append({
"text": chunk_text,
"start_word": i,
"end_word": i + len(chunk_words)
})
i += chunk_size - overlap # advance by (chunk_size - overlap)
return chunksFor structured documents (Markdown, HTML), split on headers and paragraphs first, then apply size limits. This preserves semantic boundaries — a section on "Authentication" won't be split mid-explanation.
The Complete RAG Pipeline (No Frameworks)
import httpx
import hnswlib
import numpy as np
from sentence_transformers import SentenceTransformer
# ── 1. Embed model (free, local, 384 dims) ──────────────────────────────
embedder = SentenceTransformer("all-MiniLM-L6-v2")
# ── 2. Build index from documents ───────────────────────────────────────
documents = load_your_documents() # list of strings
all_chunks = [c for doc in documents for c in chunk_text(doc)]
texts = [c["text"] for c in all_chunks]
vectors = embedder.encode(texts, batch_size=64, normalize_embeddings=True)
index = hnswlib.Index(space="cosine", dim=384)
index.init_index(max_elements=len(texts), ef_construction=200, M=16)
index.add_items(vectors)
index.set_ef(50)
# ── 3. Query function ────────────────────────────────────────────────────
def rag_query(question: str, k: int = 5, api_key: str = "") -> str:
# Retrieve
q_vec = embedder.encode([question], normalize_embeddings=True)
ids, _ = index.knn_query(q_vec, k=k)
retrieved = [texts[i] for i in ids[0]]
# Build prompt
context = "\n\n---\n\n".join(retrieved)
prompt = f"""Answer the question using only the provided context.
If the context doesn't contain the answer, say so explicitly.
Context:
{context}
Question: {question}"""
# Call API directly — no SDK
resp = httpx.post(
"https://api.openai.com/v1/chat/completions",
headers={"Authorization": f"Bearer {api_key}"},
json={
"model": "gpt-4o-mini",
"messages": [{"role": "user", "content": prompt}],
"max_tokens": 512
},
timeout=30
)
return resp.json()["choices"][0]["message"]["content"]The "Lost in the Middle" Problem
If you retrieve 10 chunks and inject them in retrieval-score order, the model will pay the most attention to chunks at positions 1 and 10 — not 5 through 8. This is documented in the research literature as the "lost in the middle" effect.
Two mitigations:
- Retrieve more, inject fewer: Retrieve top-20 by vector similarity, then use a cross-encoder reranker to rescore and keep only the top-5. Cross-encoders (e.g.,
ms-marco-MiniLM-L-6-v2) are much more accurate than bi-encoder similarity because they process query+document together. - Position the most relevant chunks first and last: If you must inject many chunks, put the highest-scored ones at positions 1 and N, and fill middle positions with lower-scored context.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
# Score all 20 candidates against the query
pairs = [(question, chunk) for chunk in top20_chunks]
scores = reranker.predict(pairs) # [20] — higher is more relevant
top5_idx = np.argsort(scores)[::-1][:5]
final_chunks = [top20_chunks[i] for i in top5_idx]Hybrid Search: BM25 + Vector
Pure vector search misses exact keyword matches. If a user asks about "RFC 7519", no semantic similarity will help — the model needs to find that exact string. BM25 (the algorithm behind Elasticsearch/Solr full-text search) excels at keyword retrieval.
Hybrid search combines both using Reciprocal Rank Fusion (RRF): get the top-20 results from BM25, get the top-20 from vector search, then merge by reciprocal rank score. Documents that appear highly ranked in both lists score highest.
def rrf_merge(bm25_ids: list, vector_ids: list, k: int = 60) -> list:
"""Reciprocal Rank Fusion. k=60 is the standard constant."""
scores = {}
for rank, doc_id in enumerate(bm25_ids, 1):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
for rank, doc_id in enumerate(vector_ids, 1):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
return sorted(scores, key=scores.get, reverse=True)Common RAG Failure Modes
| Symptom | Root Cause | Fix |
|---|---|---|
| Model ignores retrieved context, hallucinates | Prompt doesn't constrain model to context | Add explicit: "Answer ONLY from the context below. If unsure, say 'I don't know.'" |
| Right documents not retrieved | Chunk size too large, embeddings diluted | Reduce chunk size to 256 tokens, add chunking on structural boundaries |
| Keyword queries fail ("RFC 7519") | Pure vector search misses exact terms | Add BM25 hybrid search with RRF merging |
| Stale answers after document updates | Index not refreshed after source changes | Implement delta indexing: track doc modification timestamps |
| Context too long, cost spikes | Retrieving too many chunks (k=20) | Use cross-encoder reranker to reduce to top-3 high-quality chunks |
The model used to embed documents during indexing must be the same model used to embed queries at search time. If you switch embedding models (e.g., upgrade from all-MiniLM-L6-v2 to text-embedding-3-large), you must re-embed and re-index all documents from scratch. The embedding spaces are not compatible.