The RAG Architecture at a Glance
RAG splits into two phases. Ingestion runs offline (or on a schedule): load documents → split into chunks → embed each chunk → store in a vector database. Retrieval runs on every user query: embed the question → find the most similar chunks → inject them into the Claude/OpenAI prompt → return the grounded answer.
| Phase | Components | Spring AI Class |
|---|---|---|
| Load | PDF, Word, web, text | TikaDocumentReader, WebPageDocumentReader |
| Split | Chunk by sentence/token | TokenTextSplitter |
| Embed | Convert chunks to vectors | EmbeddingModel (OpenAI, Cohere) |
| Store | Vector database | VectorStore (PGVector, Redis, Pinecone) |
| Retrieve | Similarity search | VectorStore.similaritySearch() |
| Generate | Prompt + context → LLM | QuestionAnswerAdvisor |
Dependencies
<!-- pom.xml — add to your Spring AI BOM project -->
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-pgvector-store-spring-boot-starter</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-tika-document-reader</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-openai-spring-boot-starter</artifactId>
<!-- OpenAI text-embedding-3-small is the most cost-effective embedding model -->
</dependency>PGVector: PostgreSQL as a Vector Store
PGVector extends PostgreSQL with a vector column type and ANN (approximate nearest neighbour) indices. It's the best choice when you already run Postgres — no new database to operate.
Start PGVector locally with Docker:
docker run -d --name pgvector \
-e POSTGRES_DB=vectordb \
-e POSTGRES_USER=app \
-e POSTGRES_PASSWORD=secret \
-p 5432:5432 \
pgvector/pgvector:pg16Spring AI creates the required tables automatically. Configure in application.yml:
spring:
datasource:
url: jdbc:postgresql://localhost:5432/vectordb
username: app
password: ${DB_PASSWORD}
ai:
vectorstore:
pgvector:
initialize-schema: true # creates vector_store table on startup
index-type: HNSW # fastest for query; IVFFlat for large datasets
distance-type: COSINE_DISTANCE
dimensions: 1536 # must match your embedding model output size
openai:
api-key: ${OPENAI_API_KEY}
embedding:
options:
model: text-embedding-3-small # 1536 dimensions, $0.02/M tokensThe dimension count in pgvector.dimensions must exactly match the embedding model's output. OpenAI text-embedding-3-small outputs 1536 dimensions. Cohere embed-english-v3 outputs 1024. If you change embedding models, you must re-embed all your documents — the old vectors are incompatible.
Phase 1: Document Ingestion Pipeline
import org.springframework.ai.document.Document;
import org.springframework.ai.reader.tika.TikaDocumentReader;
import org.springframework.ai.transformer.splitter.TokenTextSplitter;
import org.springframework.ai.vectorstore.VectorStore;
import org.springframework.core.io.FileSystemResource;
@Service
public class DocumentIngestionService {
private final VectorStore vectorStore;
// Ingest any file type Tika supports: PDF, Word, Excel, HTML, etc.
public void ingest(Path filePath) {
// 1. Load — Tika extracts text from any format
var reader = new TikaDocumentReader(new FileSystemResource(filePath));
List<Document> docs = reader.get();
// 2. Enrich metadata before splitting
docs.forEach(doc -> {
doc.getMetadata().put("source", filePath.getFileName().toString());
doc.getMetadata().put("ingestedAt", Instant.now().toString());
});
// 3. Split — 512 token chunks with 50 token overlap
var splitter = TokenTextSplitter.builder()
.withChunkSize(512)
.withMinChunkSizeChars(50)
.withMinChunkLengthToEmbed(5)
.withMaxNumChunks(10000)
.withKeepSeparator(true)
.build();
List<Document> chunks = splitter.apply(docs);
// 4. Embed + Store — Spring AI calls the embedding model and stores vectors
vectorStore.add(chunks);
log.info("Ingested {} chunks from {}", chunks.size(), filePath.getFileName());
}
// Ingest a directory of documents
public void ingestDirectory(Path dir) throws IOException {
try (Stream<Path> files = Files.walk(dir)) {
files.filter(Files::isRegularFile)
.filter(f -> isSupportedFormat(f.getFileName().toString()))
.forEach(this::ingest);
}
}
}Phase 2: Query with QuestionAnswerAdvisor
QuestionAnswerAdvisor is Spring AI's built-in RAG advisor. It intercepts the prompt, retrieves relevant chunks from the vector store, and injects them as context before calling the LLM:
import org.springframework.ai.chat.client.advisor.QuestionAnswerAdvisor;
import org.springframework.ai.vectorstore.SearchRequest;
@Service
public class RagService {
private final ChatClient chatClient;
public RagService(ChatClient.Builder builder, VectorStore vectorStore) {
this.chatClient = builder
.defaultSystem("""
You are a helpful assistant. Answer questions based ONLY on the
provided context. If the context doesn't contain the answer,
say "I don't have that information in my knowledge base."
Do not make up information.
""")
.defaultAdvisors(
new QuestionAnswerAdvisor(vectorStore,
SearchRequest.defaults()
.withTopK(5) // retrieve 5 most similar chunks
.withSimilarityThreshold(0.7) // reject low-similarity matches
)
)
.build();
}
public String answer(String question) {
return chatClient.prompt()
.user(question)
.call()
.content();
}
// Filter by metadata — e.g. only search within a specific document
public String answerFromSource(String question, String sourceFileName) {
return chatClient.prompt()
.user(question)
.advisors(a -> a.param(
QuestionAnswerAdvisor.FILTER_EXPRESSION,
"source == '" + sourceFileName + "'"
))
.call()
.content();
}
}REST API for Ingestion and Q&A
@RestController
@RequestMapping("/api/rag")
public class RagController {
private final DocumentIngestionService ingestionService;
private final RagService ragService;
@PostMapping("/ingest")
public ResponseEntity<Map<String, String>> ingest(
@RequestParam MultipartFile file) throws IOException {
Path temp = Files.createTempFile("upload-", file.getOriginalFilename());
file.transferTo(temp);
ingestionService.ingest(temp);
Files.deleteIfExists(temp);
return ResponseEntity.ok(Map.of("status", "ingested"));
}
@PostMapping("/ask")
public AnswerResponse ask(@RequestBody QuestionRequest request) {
String answer = ragService.answer(request.question());
return new AnswerResponse(answer);
}
record QuestionRequest(String question) {}
record AnswerResponse(String answer) {}
}Manual Retrieval: Inspect What Was Found
For debugging or building custom UI that shows sources, you can call the vector store directly:
public RagResult answerWithSources(String question) {
// Embed the question and retrieve similar chunks
List<Document> context = vectorStore.similaritySearch(
SearchRequest.query(question)
.withTopK(5)
.withSimilarityThreshold(0.65)
);
if (context.isEmpty()) {
return new RagResult("No relevant information found.", List.of());
}
// Build context string with chunk metadata
String contextText = context.stream()
.map(doc -> "[Source: " + doc.getMetadata().get("source") + "]\n"
+ doc.getContent())
.collect(Collectors.joining("\n\n---\n\n"));
// Call LLM with retrieved context
String answer = chatClient.prompt()
.system("Answer based on this context:\n\n" + contextText)
.user(question)
.call()
.content();
// Return answer + source citations
List<String> sources = context.stream()
.map(doc -> (String) doc.getMetadata().get("source"))
.distinct()
.collect(Collectors.toList());
return new RagResult(answer, sources);
}
record RagResult(String answer, List<String> sources) {}Chunk Size Strategy: The Most Important RAG Parameter
Chunk size has the single largest impact on retrieval quality:
| Chunk Size | Pros | Cons | Best For |
|---|---|---|---|
| 128–256 tokens | High precision, specific matches | Missing context around the match | FAQ lookup, factual Q&A |
| 512 tokens | Good balance (sweet spot) | – | General-purpose RAG |
| 1024+ tokens | Full context preserved | Noisy retrieval, higher cost | Long-form analysis |
Overlap: A 10% overlap between chunks (e.g., last 50 tokens of chunk N become first 50 tokens of chunk N+1) prevents answers from being split across a chunk boundary.
Hybrid Search: Combining Semantic and Keyword
Semantic search alone misses exact phrase matches (product codes, version numbers). Combine vector search with PostgreSQL full-text search for better coverage:
@Repository
public class HybridSearchRepository {
private final JdbcTemplate jdbc;
private final EmbeddingModel embeddingModel;
// Reciprocal Rank Fusion combines both result sets
public List<Document> hybridSearch(String query, int topK) {
float[] queryEmbedding = embeddingModel.embed(query);
String sql = """
WITH vector_results AS (
SELECT id, content, metadata,
1 - (embedding <=> ?::vector) AS vector_score,
ROW_NUMBER() OVER (ORDER BY embedding <=> ?::vector) AS v_rank
FROM vector_store
ORDER BY embedding <=> ?::vector
LIMIT ?
),
text_results AS (
SELECT id, content, metadata,
ts_rank(to_tsvector('english', content),
plainto_tsquery('english', ?)) AS text_score,
ROW_NUMBER() OVER (ORDER BY text_score DESC) AS t_rank
FROM vector_store
WHERE to_tsvector('english', content) @@
plainto_tsquery('english', ?)
LIMIT ?
)
SELECT COALESCE(v.id, t.id) AS id,
COALESCE(v.content, t.content) AS content,
COALESCE(v.metadata, t.metadata) AS metadata,
(1.0/(60 + COALESCE(v.v_rank, 1000))
+ 1.0/(60 + COALESCE(t.t_rank, 1000))) AS rrf_score
FROM vector_results v
FULL OUTER JOIN text_results t ON v.id = t.id
ORDER BY rrf_score DESC
LIMIT ?
""";
return jdbc.query(sql, (rs, rowNum) -> {
var doc = new Document(rs.getString("content"));
// parse metadata from JSONB column
return doc;
}, embeddingString, embeddingString, embeddingString, topK,
query, query, topK, topK);
}
}- Re-ingest on document updates — delete old vectors by source metadata, then re-ingest
- Monitor retrieval quality — log the chunks retrieved for each query; review periodically
- Similarity threshold is critical — 0.7 is a reasonable starting point; tune based on your domain
- Test with adversarial queries — what happens when the user asks something completely off-topic?
- Index selection — HNSW is fastest for queries; use IVFFLAT for datasets >1M vectors
- Embedding model consistency — all ingestion and queries must use the same embedding model
Scheduled Re-ingestion
@Component
public class DocumentSyncJob {
private final DocumentIngestionService ingestionService;
private final VectorStore vectorStore;
// Re-sync documents every night at 2am
@Scheduled(cron = "0 0 2 * * *")
public void syncDocuments() {
var docsDir = Paths.get("/data/knowledge-base");
// Delete vectors for stale documents
vectorStore.delete(List.of(
FilterExpressionBuilder.eq("lastSynced", "<yesterday>")
));
// Re-ingest fresh versions
ingestionService.ingestDirectory(docsDir);
log.info("Document sync completed");
}
}