RAG with Spring AI: PGVector, Embeddings & Smart Retrieval

Retrieval-Augmented Generation (RAG) is the most practical pattern for grounding LLM responses in your own data. Instead of fine-tuning (expensive, slow to update), RAG retrieves relevant document chunks at query time and includes them in the prompt. Spring AI provides a complete ETL pipeline: document readers, transformers, vector stores, and the QuestionAnswerAdvisor that wires it all together. This guide builds a production-ready RAG service from scratch.

The RAG Architecture at a Glance

RAG splits into two phases. Ingestion runs offline (or on a schedule): load documents → split into chunks → embed each chunk → store in a vector database. Retrieval runs on every user query: embed the question → find the most similar chunks → inject them into the Claude/OpenAI prompt → return the grounded answer.

Phase	Components	Spring AI Class
Load	PDF, Word, web, text	`TikaDocumentReader`, `WebPageDocumentReader`
Split	Chunk by sentence/token	`TokenTextSplitter`
Embed	Convert chunks to vectors	`EmbeddingModel` (OpenAI, Cohere)
Store	Vector database	`VectorStore` (PGVector, Redis, Pinecone)
Retrieve	Similarity search	`VectorStore.similaritySearch()`
Generate	Prompt + context → LLM	`QuestionAnswerAdvisor`

Dependencies

<!-- pom.xml — add to your Spring AI BOM project -->
<dependency>
  <groupId>org.springframework.ai</groupId>
  <artifactId>spring-ai-pgvector-store-spring-boot-starter</artifactId>
</dependency>
<dependency>
  <groupId>org.springframework.ai</groupId>
  <artifactId>spring-ai-tika-document-reader</artifactId>
</dependency>
<dependency>
  <groupId>org.springframework.ai</groupId>
  <artifactId>spring-ai-openai-spring-boot-starter</artifactId>
  <!-- OpenAI text-embedding-3-small is the most cost-effective embedding model -->
</dependency>

PGVector: PostgreSQL as a Vector Store

PGVector extends PostgreSQL with a vector column type and ANN (approximate nearest neighbour) indices. It's the best choice when you already run Postgres — no new database to operate.

Start PGVector locally with Docker:

docker run -d --name pgvector \
  -e POSTGRES_DB=vectordb \
  -e POSTGRES_USER=app \
  -e POSTGRES_PASSWORD=secret \
  -p 5432:5432 \
  pgvector/pgvector:pg16

Spring AI creates the required tables automatically. Configure in application.yml:

spring:
  datasource:
    url: jdbc:postgresql://localhost:5432/vectordb
    username: app
    password: ${DB_PASSWORD}
  ai:
    vectorstore:
      pgvector:
        initialize-schema: true      # creates vector_store table on startup
        index-type: HNSW              # fastest for query; IVFFlat for large datasets
        distance-type: COSINE_DISTANCE
        dimensions: 1536             # must match your embedding model output size
    openai:
      api-key: ${OPENAI_API_KEY}
      embedding:
        options:
          model: text-embedding-3-small   # 1536 dimensions, $0.02/M tokens

Dimensions must match

The dimension count in pgvector.dimensions must exactly match the embedding model's output. OpenAI text-embedding-3-small outputs 1536 dimensions. Cohere embed-english-v3 outputs 1024. If you change embedding models, you must re-embed all your documents — the old vectors are incompatible.

Phase 1: Document Ingestion Pipeline

import org.springframework.ai.document.Document;
import org.springframework.ai.reader.tika.TikaDocumentReader;
import org.springframework.ai.transformer.splitter.TokenTextSplitter;
import org.springframework.ai.vectorstore.VectorStore;
import org.springframework.core.io.FileSystemResource;

@Service
public class DocumentIngestionService {

    private final VectorStore vectorStore;

    // Ingest any file type Tika supports: PDF, Word, Excel, HTML, etc.
    public void ingest(Path filePath) {
        // 1. Load — Tika extracts text from any format
        var reader = new TikaDocumentReader(new FileSystemResource(filePath));
        List<Document> docs = reader.get();

        // 2. Enrich metadata before splitting
        docs.forEach(doc -> {
            doc.getMetadata().put("source", filePath.getFileName().toString());
            doc.getMetadata().put("ingestedAt", Instant.now().toString());
        });

        // 3. Split — 512 token chunks with 50 token overlap
        var splitter = TokenTextSplitter.builder()
            .withChunkSize(512)
            .withMinChunkSizeChars(50)
            .withMinChunkLengthToEmbed(5)
            .withMaxNumChunks(10000)
            .withKeepSeparator(true)
            .build();

        List<Document> chunks = splitter.apply(docs);

        // 4. Embed + Store — Spring AI calls the embedding model and stores vectors
        vectorStore.add(chunks);

        log.info("Ingested {} chunks from {}", chunks.size(), filePath.getFileName());
    }

    // Ingest a directory of documents
    public void ingestDirectory(Path dir) throws IOException {
        try (Stream<Path> files = Files.walk(dir)) {
            files.filter(Files::isRegularFile)
                 .filter(f -> isSupportedFormat(f.getFileName().toString()))
                 .forEach(this::ingest);
        }
    }
}

Phase 2: Query with QuestionAnswerAdvisor

QuestionAnswerAdvisor is Spring AI's built-in RAG advisor. It intercepts the prompt, retrieves relevant chunks from the vector store, and injects them as context before calling the LLM:

import org.springframework.ai.chat.client.advisor.QuestionAnswerAdvisor;
import org.springframework.ai.vectorstore.SearchRequest;

@Service
public class RagService {

    private final ChatClient chatClient;

    public RagService(ChatClient.Builder builder, VectorStore vectorStore) {
        this.chatClient = builder
            .defaultSystem("""
                You are a helpful assistant. Answer questions based ONLY on the
                provided context. If the context doesn't contain the answer,
                say "I don't have that information in my knowledge base."
                Do not make up information.
                """)
            .defaultAdvisors(
                new QuestionAnswerAdvisor(vectorStore,
                    SearchRequest.defaults()
                        .withTopK(5)                    // retrieve 5 most similar chunks
                        .withSimilarityThreshold(0.7)   // reject low-similarity matches
                )
            )
            .build();
    }

    public String answer(String question) {
        return chatClient.prompt()
            .user(question)
            .call()
            .content();
    }

    // Filter by metadata — e.g. only search within a specific document
    public String answerFromSource(String question, String sourceFileName) {
        return chatClient.prompt()
            .user(question)
            .advisors(a -> a.param(
                QuestionAnswerAdvisor.FILTER_EXPRESSION,
                "source == '" + sourceFileName + "'"
            ))
            .call()
            .content();
    }
}

REST API for Ingestion and Q&A

@RestController
@RequestMapping("/api/rag")
public class RagController {

    private final DocumentIngestionService ingestionService;
    private final RagService ragService;

    @PostMapping("/ingest")
    public ResponseEntity<Map<String, String>> ingest(
            @RequestParam MultipartFile file) throws IOException {
        Path temp = Files.createTempFile("upload-", file.getOriginalFilename());
        file.transferTo(temp);
        ingestionService.ingest(temp);
        Files.deleteIfExists(temp);
        return ResponseEntity.ok(Map.of("status", "ingested"));
    }

    @PostMapping("/ask")
    public AnswerResponse ask(@RequestBody QuestionRequest request) {
        String answer = ragService.answer(request.question());
        return new AnswerResponse(answer);
    }

    record QuestionRequest(String question) {}
    record AnswerResponse(String answer) {}
}

Manual Retrieval: Inspect What Was Found

For debugging or building custom UI that shows sources, you can call the vector store directly:

public RagResult answerWithSources(String question) {
    // Embed the question and retrieve similar chunks
    List<Document> context = vectorStore.similaritySearch(
        SearchRequest.query(question)
            .withTopK(5)
            .withSimilarityThreshold(0.65)
    );

    if (context.isEmpty()) {
        return new RagResult("No relevant information found.", List.of());
    }

    // Build context string with chunk metadata
    String contextText = context.stream()
        .map(doc -> "[Source: " + doc.getMetadata().get("source") + "]\n"
                  + doc.getContent())
        .collect(Collectors.joining("\n\n---\n\n"));

    // Call LLM with retrieved context
    String answer = chatClient.prompt()
        .system("Answer based on this context:\n\n" + contextText)
        .user(question)
        .call()
        .content();

    // Return answer + source citations
    List<String> sources = context.stream()
        .map(doc -> (String) doc.getMetadata().get("source"))
        .distinct()
        .collect(Collectors.toList());

    return new RagResult(answer, sources);
}

record RagResult(String answer, List<String> sources) {}

Chunk Size Strategy: The Most Important RAG Parameter

Chunk size has the single largest impact on retrieval quality:

Chunk Size	Pros	Cons	Best For
128–256 tokens	High precision, specific matches	Missing context around the match	FAQ lookup, factual Q&A
512 tokens	Good balance (sweet spot)	–	General-purpose RAG
1024+ tokens	Full context preserved	Noisy retrieval, higher cost	Long-form analysis

Overlap: A 10% overlap between chunks (e.g., last 50 tokens of chunk N become first 50 tokens of chunk N+1) prevents answers from being split across a chunk boundary.

Hybrid Search: Combining Semantic and Keyword

Semantic search alone misses exact phrase matches (product codes, version numbers). Combine vector search with PostgreSQL full-text search for better coverage:

@Repository
public class HybridSearchRepository {

    private final JdbcTemplate jdbc;
    private final EmbeddingModel embeddingModel;

    // Reciprocal Rank Fusion combines both result sets
    public List<Document> hybridSearch(String query, int topK) {
        float[] queryEmbedding = embeddingModel.embed(query);

        String sql = """
            WITH vector_results AS (
                SELECT id, content, metadata,
                       1 - (embedding <=> ?::vector) AS vector_score,
                       ROW_NUMBER() OVER (ORDER BY embedding <=> ?::vector) AS v_rank
                FROM vector_store
                ORDER BY embedding <=> ?::vector
                LIMIT ?
            ),
            text_results AS (
                SELECT id, content, metadata,
                       ts_rank(to_tsvector('english', content),
                               plainto_tsquery('english', ?)) AS text_score,
                       ROW_NUMBER() OVER (ORDER BY text_score DESC) AS t_rank
                FROM vector_store
                WHERE to_tsvector('english', content) @@
                      plainto_tsquery('english', ?)
                LIMIT ?
            )
            SELECT COALESCE(v.id, t.id) AS id,
                   COALESCE(v.content, t.content) AS content,
                   COALESCE(v.metadata, t.metadata) AS metadata,
                   (1.0/(60 + COALESCE(v.v_rank, 1000))
                  + 1.0/(60 + COALESCE(t.t_rank, 1000))) AS rrf_score
            FROM vector_results v
            FULL OUTER JOIN text_results t ON v.id = t.id
            ORDER BY rrf_score DESC
            LIMIT ?
            """;

        return jdbc.query(sql, (rs, rowNum) -> {
            var doc = new Document(rs.getString("content"));
            // parse metadata from JSONB column
            return doc;
        }, embeddingString, embeddingString, embeddingString, topK,
           query, query, topK, topK);
    }
}

Production RAG Checklist

Re-ingest on document updates — delete old vectors by source metadata, then re-ingest
Monitor retrieval quality — log the chunks retrieved for each query; review periodically
Similarity threshold is critical — 0.7 is a reasonable starting point; tune based on your domain
Test with adversarial queries — what happens when the user asks something completely off-topic?
Index selection — HNSW is fastest for queries; use IVFFLAT for datasets >1M vectors
Embedding model consistency — all ingestion and queries must use the same embedding model

Scheduled Re-ingestion

@Component
public class DocumentSyncJob {

    private final DocumentIngestionService ingestionService;
    private final VectorStore vectorStore;

    // Re-sync documents every night at 2am
    @Scheduled(cron = "0 0 2 * * *")
    public void syncDocuments() {
        var docsDir = Paths.get("/data/knowledge-base");

        // Delete vectors for stale documents
        vectorStore.delete(List.of(
            FilterExpressionBuilder.eq("lastSynced", "<yesterday>")
        ));

        // Re-ingest fresh versions
        ingestionService.ingestDirectory(docsDir);
        log.info("Document sync completed");
    }
}

Tools-Hut

RAG with Spring AI: PGVector, Embeddings & Smart Retrieval

The RAG Architecture at a Glance

Dependencies

PGVector: PostgreSQL as a Vector Store

Phase 1: Document Ingestion Pipeline

Phase 2: Query with QuestionAnswerAdvisor

REST API for Ingestion and Q&A

Manual Retrieval: Inspect What Was Found

Chunk Size Strategy: The Most Important RAG Parameter

Hybrid Search: Combining Semantic and Keyword

Scheduled Re-ingestion

Java & Spring AI Series

The RAG Architecture at a Glance

Dependencies

PGVector: PostgreSQL as a Vector Store

Phase 1: Document Ingestion Pipeline

Phase 2: Query with QuestionAnswerAdvisor

REST API for Ingestion and Q&A

Manual Retrieval: Inspect What Was Found

Chunk Size Strategy: The Most Important RAG Parameter

Hybrid Search: Combining Semantic and Keyword

Scheduled Re-ingestion

Java & Spring AI Series

Related Articles