RAG Without the Framework: HTTP Calls and a Vector Index

Retrieval-Augmented Generation is three operations: embed documents into a vector index, embed a query and retrieve similar chunks, inject those chunks into a prompt. LangChain wraps these three operations in ~50 abstraction layers. Stripping that away and building RAG from scratch in 80 lines of Python reveals what actually matters — and what commonly goes wrong.

Why Not Just Put Everything in the Prompt?

The naive solution to "LLMs don't know my private data" is to paste all relevant documents into the system prompt. This fails on three fronts:

Context limits: Even with 128k-token windows, a typical corporate knowledge base has millions of tokens. You cannot fit it all.
Cost: Every token in the prompt is billed at input token rates. Injecting 50,000 tokens into every request costs ~$0.10 per call at current prices — $10,000 per 100,000 requests.
Quality degradation: Research consistently shows that model performance drops with very long contexts. The "lost in the middle" effect means the model attends poorly to content in the middle of a large context, preferring content at the start and end.

RAG solves all three: retrieve only the 3–5 chunks most relevant to the current query, then inject only those.

The Two-Phase Architecture

Phase 1 — Indexing (offline, runs once):

Load documents from source (files, database, API)
Chunk into smaller pieces with overlap
Embed each chunk into a vector
Store vectors in a vector index

Phase 2 — Query (online, runs per request):

Embed the user's question into a vector
Retrieve top-k most similar chunks from the index
Build a prompt: system instructions + retrieved chunks + user question
Call the LLM API and return the response

Chunking: The Decision That Matters Most

Poor chunking is the most common cause of RAG failures. The tradeoff is fundamental:

Small chunks (128 tokens): Retrieval is precise — the retrieved text matches the query tightly. But the chunk may lack the surrounding context needed to actually answer the question.
Large chunks (1024 tokens): More context per chunk, but embeddings average over more content and become less discriminating. A large chunk about "Python packaging" will match queries about virtualenvs, pip, pyproject.toml, and wheels equally.

The sweet spot for most text is 256–512 tokens with 64-token overlap. The overlap ensures sentences that cross chunk boundaries are fully represented in at least one chunk.

def chunk_text(text: str, chunk_size: int = 400, overlap: int = 64) -> list[dict]:
    """Split text into overlapping chunks, tracking character positions."""
    words = text.split()
    chunks = []
    i = 0
    while i < len(words):
        chunk_words = words[i : i + chunk_size]
        chunk_text = " ".join(chunk_words)
        chunks.append({
            "text": chunk_text,
            "start_word": i,
            "end_word": i + len(chunk_words)
        })
        i += chunk_size - overlap  # advance by (chunk_size - overlap)
    return chunks

For structured documents (Markdown, HTML), split on headers and paragraphs first, then apply size limits. This preserves semantic boundaries — a section on "Authentication" won't be split mid-explanation.

The Complete RAG Pipeline (No Frameworks)

import httpx
import hnswlib
import numpy as np
from sentence_transformers import SentenceTransformer

# ── 1. Embed model (free, local, 384 dims) ──────────────────────────────
embedder = SentenceTransformer("all-MiniLM-L6-v2")

# ── 2. Build index from documents ───────────────────────────────────────
documents = load_your_documents()  # list of strings
all_chunks = [c for doc in documents for c in chunk_text(doc)]
texts = [c["text"] for c in all_chunks]

vectors = embedder.encode(texts, batch_size=64, normalize_embeddings=True)

index = hnswlib.Index(space="cosine", dim=384)
index.init_index(max_elements=len(texts), ef_construction=200, M=16)
index.add_items(vectors)
index.set_ef(50)

# ── 3. Query function ────────────────────────────────────────────────────
def rag_query(question: str, k: int = 5, api_key: str = "") -> str:
    # Retrieve
    q_vec = embedder.encode([question], normalize_embeddings=True)
    ids, _ = index.knn_query(q_vec, k=k)
    retrieved = [texts[i] for i in ids[0]]

    # Build prompt
    context = "\n\n---\n\n".join(retrieved)
    prompt = f"""Answer the question using only the provided context.
If the context doesn't contain the answer, say so explicitly.

Context:
{context}

Question: {question}"""

    # Call API directly — no SDK
    resp = httpx.post(
        "https://api.openai.com/v1/chat/completions",
        headers={"Authorization": f"Bearer {api_key}"},
        json={
            "model": "gpt-4o-mini",
            "messages": [{"role": "user", "content": prompt}],
            "max_tokens": 512
        },
        timeout=30
    )
    return resp.json()["choices"][0]["message"]["content"]

The "Lost in the Middle" Problem

If you retrieve 10 chunks and inject them in retrieval-score order, the model will pay the most attention to chunks at positions 1 and 10 — not 5 through 8. This is documented in the research literature as the "lost in the middle" effect.

Two mitigations:

Retrieve more, inject fewer: Retrieve top-20 by vector similarity, then use a cross-encoder reranker to rescore and keep only the top-5. Cross-encoders (e.g., ms-marco-MiniLM-L-6-v2) are much more accurate than bi-encoder similarity because they process query+document together.
Position the most relevant chunks first and last: If you must inject many chunks, put the highest-scored ones at positions 1 and N, and fill middle positions with lower-scored context.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

# Score all 20 candidates against the query
pairs = [(question, chunk) for chunk in top20_chunks]
scores = reranker.predict(pairs)  # [20] — higher is more relevant
top5_idx = np.argsort(scores)[::-1][:5]
final_chunks = [top20_chunks[i] for i in top5_idx]

Hybrid Search: BM25 + Vector

Pure vector search misses exact keyword matches. If a user asks about "RFC 7519", no semantic similarity will help — the model needs to find that exact string. BM25 (the algorithm behind Elasticsearch/Solr full-text search) excels at keyword retrieval.

Hybrid search combines both using Reciprocal Rank Fusion (RRF): get the top-20 results from BM25, get the top-20 from vector search, then merge by reciprocal rank score. Documents that appear highly ranked in both lists score highest.

def rrf_merge(bm25_ids: list, vector_ids: list, k: int = 60) -> list:
    """Reciprocal Rank Fusion. k=60 is the standard constant."""
    scores = {}
    for rank, doc_id in enumerate(bm25_ids, 1):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    for rank, doc_id in enumerate(vector_ids, 1):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    return sorted(scores, key=scores.get, reverse=True)

Common RAG Failure Modes

Symptom	Root Cause	Fix
Model ignores retrieved context, hallucinates	Prompt doesn't constrain model to context	Add explicit: "Answer ONLY from the context below. If unsure, say 'I don't know.'"
Right documents not retrieved	Chunk size too large, embeddings diluted	Reduce chunk size to 256 tokens, add chunking on structural boundaries
Keyword queries fail ("RFC 7519")	Pure vector search misses exact terms	Add BM25 hybrid search with RRF merging
Stale answers after document updates	Index not refreshed after source changes	Implement delta indexing: track doc modification timestamps
Context too long, cost spikes	Retrieving too many chunks (k=20)	Use cross-encoder reranker to reduce to top-3 high-quality chunks

⚠️ Embedding Model Consistency

The model used to embed documents during indexing must be the same model used to embed queries at search time. If you switch embedding models (e.g., upgrade from all-MiniLM-L6-v2 to text-embedding-3-large), you must re-embed and re-index all documents from scratch. The embedding spaces are not compatible.

Tools-Hut

RAG Without the Framework: HTTP Calls and a Vector Index

Why Not Just Put Everything in the Prompt?

The Two-Phase Architecture

Chunking: The Decision That Matters Most

The Complete RAG Pipeline (No Frameworks)

The "Lost in the Middle" Problem

Hybrid Search: BM25 + Vector

Common RAG Failure Modes

Continue the AI Learning Path

Why Not Just Put Everything in the Prompt?

The Two-Phase Architecture

Chunking: The Decision That Matters Most

The Complete RAG Pipeline (No Frameworks)

The "Lost in the Middle" Problem

Hybrid Search: BM25 + Vector

Common RAG Failure Modes

Continue the AI Learning Path

Related Articles