The RAG Architecture at a Glance

RAG splits into two phases. Ingestion runs offline (or on a schedule): load documents → split into chunks → embed each chunk → store in a vector database. Retrieval runs on every user query: embed the question → find the most similar chunks → inject them into the Claude/OpenAI prompt → return the grounded answer.

PhaseComponentsSpring AI Class
LoadPDF, Word, web, textTikaDocumentReader, WebPageDocumentReader
SplitChunk by sentence/tokenTokenTextSplitter
EmbedConvert chunks to vectorsEmbeddingModel (OpenAI, Cohere)
StoreVector databaseVectorStore (PGVector, Redis, Pinecone)
RetrieveSimilarity searchVectorStore.similaritySearch()
GeneratePrompt + context → LLMQuestionAnswerAdvisor

Dependencies

<!-- pom.xml — add to your Spring AI BOM project -->
<dependency>
  <groupId>org.springframework.ai</groupId>
  <artifactId>spring-ai-pgvector-store-spring-boot-starter</artifactId>
</dependency>
<dependency>
  <groupId>org.springframework.ai</groupId>
  <artifactId>spring-ai-tika-document-reader</artifactId>
</dependency>
<dependency>
  <groupId>org.springframework.ai</groupId>
  <artifactId>spring-ai-openai-spring-boot-starter</artifactId>
  <!-- OpenAI text-embedding-3-small is the most cost-effective embedding model -->
</dependency>

PGVector: PostgreSQL as a Vector Store

PGVector extends PostgreSQL with a vector column type and ANN (approximate nearest neighbour) indices. It's the best choice when you already run Postgres — no new database to operate.

Start PGVector locally with Docker:

docker run -d --name pgvector \
  -e POSTGRES_DB=vectordb \
  -e POSTGRES_USER=app \
  -e POSTGRES_PASSWORD=secret \
  -p 5432:5432 \
  pgvector/pgvector:pg16

Spring AI creates the required tables automatically. Configure in application.yml:

spring:
  datasource:
    url: jdbc:postgresql://localhost:5432/vectordb
    username: app
    password: ${DB_PASSWORD}
  ai:
    vectorstore:
      pgvector:
        initialize-schema: true      # creates vector_store table on startup
        index-type: HNSW              # fastest for query; IVFFlat for large datasets
        distance-type: COSINE_DISTANCE
        dimensions: 1536             # must match your embedding model output size
    openai:
      api-key: ${OPENAI_API_KEY}
      embedding:
        options:
          model: text-embedding-3-small   # 1536 dimensions, $0.02/M tokens
Dimensions must match

The dimension count in pgvector.dimensions must exactly match the embedding model's output. OpenAI text-embedding-3-small outputs 1536 dimensions. Cohere embed-english-v3 outputs 1024. If you change embedding models, you must re-embed all your documents — the old vectors are incompatible.

Phase 1: Document Ingestion Pipeline

import org.springframework.ai.document.Document;
import org.springframework.ai.reader.tika.TikaDocumentReader;
import org.springframework.ai.transformer.splitter.TokenTextSplitter;
import org.springframework.ai.vectorstore.VectorStore;
import org.springframework.core.io.FileSystemResource;

@Service
public class DocumentIngestionService {

    private final VectorStore vectorStore;

    // Ingest any file type Tika supports: PDF, Word, Excel, HTML, etc.
    public void ingest(Path filePath) {
        // 1. Load — Tika extracts text from any format
        var reader = new TikaDocumentReader(new FileSystemResource(filePath));
        List<Document> docs = reader.get();

        // 2. Enrich metadata before splitting
        docs.forEach(doc -> {
            doc.getMetadata().put("source", filePath.getFileName().toString());
            doc.getMetadata().put("ingestedAt", Instant.now().toString());
        });

        // 3. Split — 512 token chunks with 50 token overlap
        var splitter = TokenTextSplitter.builder()
            .withChunkSize(512)
            .withMinChunkSizeChars(50)
            .withMinChunkLengthToEmbed(5)
            .withMaxNumChunks(10000)
            .withKeepSeparator(true)
            .build();

        List<Document> chunks = splitter.apply(docs);

        // 4. Embed + Store — Spring AI calls the embedding model and stores vectors
        vectorStore.add(chunks);

        log.info("Ingested {} chunks from {}", chunks.size(), filePath.getFileName());
    }

    // Ingest a directory of documents
    public void ingestDirectory(Path dir) throws IOException {
        try (Stream<Path> files = Files.walk(dir)) {
            files.filter(Files::isRegularFile)
                 .filter(f -> isSupportedFormat(f.getFileName().toString()))
                 .forEach(this::ingest);
        }
    }
}

Phase 2: Query with QuestionAnswerAdvisor

QuestionAnswerAdvisor is Spring AI's built-in RAG advisor. It intercepts the prompt, retrieves relevant chunks from the vector store, and injects them as context before calling the LLM:

import org.springframework.ai.chat.client.advisor.QuestionAnswerAdvisor;
import org.springframework.ai.vectorstore.SearchRequest;

@Service
public class RagService {

    private final ChatClient chatClient;

    public RagService(ChatClient.Builder builder, VectorStore vectorStore) {
        this.chatClient = builder
            .defaultSystem("""
                You are a helpful assistant. Answer questions based ONLY on the
                provided context. If the context doesn't contain the answer,
                say "I don't have that information in my knowledge base."
                Do not make up information.
                """)
            .defaultAdvisors(
                new QuestionAnswerAdvisor(vectorStore,
                    SearchRequest.defaults()
                        .withTopK(5)                    // retrieve 5 most similar chunks
                        .withSimilarityThreshold(0.7)   // reject low-similarity matches
                )
            )
            .build();
    }

    public String answer(String question) {
        return chatClient.prompt()
            .user(question)
            .call()
            .content();
    }

    // Filter by metadata — e.g. only search within a specific document
    public String answerFromSource(String question, String sourceFileName) {
        return chatClient.prompt()
            .user(question)
            .advisors(a -> a.param(
                QuestionAnswerAdvisor.FILTER_EXPRESSION,
                "source == '" + sourceFileName + "'"
            ))
            .call()
            .content();
    }
}

REST API for Ingestion and Q&A

@RestController
@RequestMapping("/api/rag")
public class RagController {

    private final DocumentIngestionService ingestionService;
    private final RagService ragService;

    @PostMapping("/ingest")
    public ResponseEntity<Map<String, String>> ingest(
            @RequestParam MultipartFile file) throws IOException {
        Path temp = Files.createTempFile("upload-", file.getOriginalFilename());
        file.transferTo(temp);
        ingestionService.ingest(temp);
        Files.deleteIfExists(temp);
        return ResponseEntity.ok(Map.of("status", "ingested"));
    }

    @PostMapping("/ask")
    public AnswerResponse ask(@RequestBody QuestionRequest request) {
        String answer = ragService.answer(request.question());
        return new AnswerResponse(answer);
    }

    record QuestionRequest(String question) {}
    record AnswerResponse(String answer) {}
}

Manual Retrieval: Inspect What Was Found

For debugging or building custom UI that shows sources, you can call the vector store directly:

public RagResult answerWithSources(String question) {
    // Embed the question and retrieve similar chunks
    List<Document> context = vectorStore.similaritySearch(
        SearchRequest.query(question)
            .withTopK(5)
            .withSimilarityThreshold(0.65)
    );

    if (context.isEmpty()) {
        return new RagResult("No relevant information found.", List.of());
    }

    // Build context string with chunk metadata
    String contextText = context.stream()
        .map(doc -> "[Source: " + doc.getMetadata().get("source") + "]\n"
                  + doc.getContent())
        .collect(Collectors.joining("\n\n---\n\n"));

    // Call LLM with retrieved context
    String answer = chatClient.prompt()
        .system("Answer based on this context:\n\n" + contextText)
        .user(question)
        .call()
        .content();

    // Return answer + source citations
    List<String> sources = context.stream()
        .map(doc -> (String) doc.getMetadata().get("source"))
        .distinct()
        .collect(Collectors.toList());

    return new RagResult(answer, sources);
}

record RagResult(String answer, List<String> sources) {}

Chunk Size Strategy: The Most Important RAG Parameter

Chunk size has the single largest impact on retrieval quality:

Chunk SizeProsConsBest For
128–256 tokensHigh precision, specific matchesMissing context around the matchFAQ lookup, factual Q&A
512 tokensGood balance (sweet spot)General-purpose RAG
1024+ tokensFull context preservedNoisy retrieval, higher costLong-form analysis

Overlap: A 10% overlap between chunks (e.g., last 50 tokens of chunk N become first 50 tokens of chunk N+1) prevents answers from being split across a chunk boundary.

Hybrid Search: Combining Semantic and Keyword

Semantic search alone misses exact phrase matches (product codes, version numbers). Combine vector search with PostgreSQL full-text search for better coverage:

@Repository
public class HybridSearchRepository {

    private final JdbcTemplate jdbc;
    private final EmbeddingModel embeddingModel;

    // Reciprocal Rank Fusion combines both result sets
    public List<Document> hybridSearch(String query, int topK) {
        float[] queryEmbedding = embeddingModel.embed(query);

        String sql = """
            WITH vector_results AS (
                SELECT id, content, metadata,
                       1 - (embedding <=> ?::vector) AS vector_score,
                       ROW_NUMBER() OVER (ORDER BY embedding <=> ?::vector) AS v_rank
                FROM vector_store
                ORDER BY embedding <=> ?::vector
                LIMIT ?
            ),
            text_results AS (
                SELECT id, content, metadata,
                       ts_rank(to_tsvector('english', content),
                               plainto_tsquery('english', ?)) AS text_score,
                       ROW_NUMBER() OVER (ORDER BY text_score DESC) AS t_rank
                FROM vector_store
                WHERE to_tsvector('english', content) @@
                      plainto_tsquery('english', ?)
                LIMIT ?
            )
            SELECT COALESCE(v.id, t.id) AS id,
                   COALESCE(v.content, t.content) AS content,
                   COALESCE(v.metadata, t.metadata) AS metadata,
                   (1.0/(60 + COALESCE(v.v_rank, 1000))
                  + 1.0/(60 + COALESCE(t.t_rank, 1000))) AS rrf_score
            FROM vector_results v
            FULL OUTER JOIN text_results t ON v.id = t.id
            ORDER BY rrf_score DESC
            LIMIT ?
            """;

        return jdbc.query(sql, (rs, rowNum) -> {
            var doc = new Document(rs.getString("content"));
            // parse metadata from JSONB column
            return doc;
        }, embeddingString, embeddingString, embeddingString, topK,
           query, query, topK, topK);
    }
}
Production RAG Checklist
  • Re-ingest on document updates — delete old vectors by source metadata, then re-ingest
  • Monitor retrieval quality — log the chunks retrieved for each query; review periodically
  • Similarity threshold is critical — 0.7 is a reasonable starting point; tune based on your domain
  • Test with adversarial queries — what happens when the user asks something completely off-topic?
  • Index selection — HNSW is fastest for queries; use IVFFLAT for datasets >1M vectors
  • Embedding model consistency — all ingestion and queries must use the same embedding model

Scheduled Re-ingestion

@Component
public class DocumentSyncJob {

    private final DocumentIngestionService ingestionService;
    private final VectorStore vectorStore;

    // Re-sync documents every night at 2am
    @Scheduled(cron = "0 0 2 * * *")
    public void syncDocuments() {
        var docsDir = Paths.get("/data/knowledge-base");

        // Delete vectors for stale documents
        vectorStore.delete(List.of(
            FilterExpressionBuilder.eq("lastSynced", "<yesterday>")
        ));

        // Re-ingest fresh versions
        ingestionService.ingestDirectory(docsDir);
        log.info("Document sync completed");
    }
}