1. The Core Problem RAG Solves
An LLM's knowledge is frozen at its training cutoff. Ask GPT-4 or Claude about a company's Q1 2026 earnings and it will either hallucinate or correctly say it doesn't know. The parameters cannot be updated at inference time.
The naive fix — fine-tuning — is expensive, requires high-quality labelled data, and still doesn't solve the stale knowledge problem (the new fine-tune is also frozen at its training date). RAG takes a different approach: instead of baking knowledge into the model weights, retrieve relevant context at query time and inject it into the prompt.
RAG = Retrieve relevant context + Augment the prompt with it + Generate a grounded answer. The model's weights never change. Only the prompt changes.
2. Classic RAG Pipeline
RAG has two phases: an offline indexing phase (run once or periodically) and an online query phase (run on every user request).
3. Chunking: The Hidden Bottleneck
The quality of a RAG system lives and dies on chunking strategy. If your chunks are too large, the vector embedding averages out signal and retrieval becomes imprecise. If chunks are too small, retrieved context lacks coherence and the LLM can't answer from fragments.
| Strategy | When to use | Watch out for |
|---|---|---|
| Fixed-size (512 tokens) | Homogeneous content (FAQs, logs) | Splits mid-sentence, losing context |
| Sentence / paragraph | Prose documents, legal text | Uneven chunk sizes |
| Recursive character split | General purpose (LangChain default) | Doesn't respect semantic boundaries |
| Semantic chunking | Mixed content (reports, articles) | Slower; needs embedding model at chunk time |
| Document-structure aware | Code repos, HTML/PDF with headings | Requires parser per document type |
The overlap trick: When using fixed-size chunking, add a 10–20% token overlap between chunks (e.g., the last 50 tokens of chunk N appear again at the start of chunk N+1). This prevents key context from falling at a chunk boundary.
4. Agentic RAG: When Classic RAG Isn't Enough
Classic RAG works well for single-hop questions — "What does the refund policy say about digital goods?" — where one vector search surfaces the right context. It breaks down when:
- The question requires multi-hop reasoning: "Compare the onboarding process described in the HR handbook with what the 2024 audit report recommends"
- The answer requires real-time data not in the vector store (stock prices, current weather, live APIs)
- The user's query needs clarification or decomposition before retrieval can even begin
- The system needs to take actions (run SQL, call APIs, write files) not just answer questions
Agentic RAG wraps the retrieval system with an LLM-driven agent (typically a ReAct or function-calling loop). The agent decides whether to retrieve, what to retrieve, how many times, and which tool to use.
5. The ReAct Loop in Detail
ReAct (Reasoning + Acting) is the most common agentic pattern. The model alternates between thinking (internal reasoning in the prompt) and acting (calling a tool). The tool result is fed back as an "observation", and the loop continues until the model decides it has enough to answer.
Thought: The user is asking about current mortgage rates.
My training data is from 2025. I should search the web.
Action: web_search("current 30-year fixed mortgage rate June 2026")
Observation: [6.8% average per Freddie Mac, June 5, 2026]
Thought: Now I have current data. I can answer confidently.
Final Answer: As of June 2026, the average 30-year fixed
mortgage rate is approximately 6.8% …Each iteration adds to the context window. For long-running agents, this means context can fill up — a critical failure mode to design for. Common mitigations: summarize intermediate observations, use a sliding window, or store observations in a vector store and retrieve only what's relevant.
6. Classic RAG vs Agentic RAG: When to Use Which
• Questions are single-hop and can be answered from one document chunk
• Latency is critical (agentic loops add 2–10x latency per iteration)
• Cost matters — each tool call + LLM response adds tokens
• You have a well-defined, bounded knowledge base
• You don't need real-time data or external APIs
• Questions require combining information from multiple sources
• You need real-time data (web, APIs, databases)
• The task involves taking actions, not just answering
• Users ask complex multi-step questions in a chat interface
• You want the system to ask clarifying questions before answering
7. Evaluation: How Do You Know It's Working?
RAG systems are notoriously hard to evaluate because failure modes are subtle. A response can look correct but be unsupported by the retrieved context (the LLM hallucinated), or be correct but sourced from a poor chunk (got lucky). The standard eval framework covers three dimensions:
- Context Relevance: Are the retrieved chunks actually relevant to the query? Measured by human label or using an LLM-as-judge prompt.
- Groundedness / Faithfulness: Does the answer contain only claims that are supported by the retrieved context? (No hallucination)
- Answer Relevance: Does the final answer actually address what the user asked?
Tools like RAGAS, TruLens, and DeepEval automate these metrics using LLM-as-judge approaches, making it practical to run eval suites in CI.
8. Production Checklist
- Re-index on content change — don't let your vector store go stale
- Cache frequent queries — embedding a query is cheap; the LLM call is not
- Set a max iteration limit on agentic loops to prevent runaway costs
- Log every retrieval — debugging RAG requires knowing exactly what was retrieved
- Use hybrid search — combine vector similarity with BM25 keyword search for better recall
- Monitor chunk hit rates — if the same chunks are never retrieved, they may be chunked too large
JSON Validator
Working with RAG pipelines usually means wrestling with LLM JSON outputs. Validate and format them instantly with our client-side JSON tool.
Open JSON Validator