The AI Microservice Problem

LLM calls break several microservices assumptions:

Normal HTTP callLLM API call
10–100ms latency1–15 seconds latency
Fails fast on errorRetries with backoff (adds seconds)
StatelessContext window must be managed
Predictable costCost varies with input/output length
Scales horizontallyRate-limited by provider

The patterns in this article address all five differences.

Service Decomposition: Where Does AI Live?

The most common antipattern is putting the LLM call inside every service that needs AI. This creates: N services all needing API key management, N separate retry/circuit-breaker configs, N separate cost tracking implementations.

Instead, extract AI into a dedicated service:

// Recommended: dedicated AI service that other services call
┌─────────────────┐     HTTP/gRPC    ┌──────────────────────┐
│ order-service   │ ───────────────→ │  ai-service           │
│ product-service │ ───────────────→ │  - Spring AI          │
│ support-service │ ───────────────→ │  - Rate limiting      │
└─────────────────┘                  │  - Cost tracking      │
                                     │  - Circuit breaker    │
                                     │  - Prompt management  │
                                     └──────────────────────┘
                                              │
                                        Claude / OpenAI

The dedicated AI service owns the API keys, retry logic, cost monitoring, and prompt versioning. Other services call it via a typed client with their business context.

The AI Service: Internal API Design

@RestController
@RequestMapping("/api/v1/ai")
public class AiServiceController {

    // Typed endpoint for each AI feature — not a generic "send prompt" API

    @PostMapping("/summarize")
    public SummaryResponse summarize(@RequestBody SummarizeRequest req) {
        return summarizationService.summarize(req);
    }

    @PostMapping("/classify")
    public ClassificationResponse classify(@RequestBody ClassifyRequest req) {
        return classificationService.classify(req);
    }

    @PostMapping("/extract")
    public ExtractionResponse extract(@RequestBody ExtractRequest req) {
        return extractionService.extract(req);
    }

    @PostMapping("/chat")
    public ChatResponse chat(@RequestBody ChatRequest req) {
        return chatService.respond(req);
    }
}

Calling the AI Service: Spring Cloud OpenFeign

// In product-service: typed Feign client for the AI service
@FeignClient(name = "ai-service", url = "${services.ai.url}")
public interface AiServiceClient {

    @PostMapping("/api/v1/ai/classify")
    ClassificationResponse classifyProduct(@RequestBody ClassifyRequest request);

    @PostMapping("/api/v1/ai/extract")
    ExtractionResponse extractAttributes(@RequestBody ExtractRequest request);
}

// Usage in ProductService:
@Service
public class ProductEnrichmentService {

    private final AiServiceClient aiClient;

    public void enrichProduct(Product product) {
        var categoryResult = aiClient.classifyProduct(
            new ClassifyRequest(
                product.getTitle() + "\n" + product.getDescription(),
                "electronics, clothing, food, sports, home, other"
            )
        );
        product.setCategory(categoryResult.label());
        productRepo.save(product);
    }
}

Resilience4j: Circuit Breaker for LLM Calls

LLM providers have incidents. Without a circuit breaker, a provider outage cascades: all threads waiting for the LLM time out, thread pools exhaust, your service falls over. Resilience4j prevents this:

# application.yml — in the ai-service
resilience4j:
  circuitbreaker:
    instances:
      anthropic:
        registerHealthIndicator: true
        slidingWindowSize: 10           # evaluate last 10 calls
        failureRateThreshold: 50         # open at 50% failure rate
        waitDurationInOpenState: 30s     # try again after 30s
        permittedCallsInHalfOpenState: 3
        slowCallDurationThreshold: 10s   # treat calls >10s as slow
        slowCallRateThreshold: 80         # open if 80% are slow
      openai:
        registerHealthIndicator: true
        slidingWindowSize: 10
        failureRateThreshold: 50
        waitDurationInOpenState: 30s
@Service
public class ResilientAiService {

    private final ChatClient anthropicClient;
    private final ChatClient openAiClient;

    @CircuitBreaker(name = "anthropic", fallbackMethod = "fallbackToOpenAi")
    public String complete(String prompt) {
        return anthropicClient.prompt()
            .user(prompt)
            .call()
            .content();
    }

    // Resilience4j calls this when the circuit is open
    private String fallbackToOpenAi(String prompt, Throwable t) {
        log.warn("Anthropic circuit open ({}), falling back to OpenAI", t.getMessage());
        return openAiClient.prompt()
            .user(prompt)
            .call()
            .content();
    }
}

Async AI Processing with Kafka

For non-real-time AI tasks (document analysis, batch classification, email summarization), a Kafka-based async pipeline is far better than synchronous HTTP. The calling service publishes a request and gets on with other work; the AI service processes at its own rate.

// Producer: product-service sends enrichment requests
@Service
public class ProductPublisher {

    private final KafkaTemplate<String, ProductEnrichmentRequest> kafkaTemplate;

    public void requestEnrichment(Product product) {
        kafkaTemplate.send("ai.product.enrich",
            product.getId(),
            new ProductEnrichmentRequest(
                product.getId(),
                product.getTitle(),
                product.getDescription()
            )
        );
    }
}

// Consumer: ai-service processes enrichment requests
@Service
public class ProductEnrichmentConsumer {

    private final ChatClient chatClient;
    private final KafkaTemplate<String, ProductEnrichmentResult> resultTemplate;

    @KafkaListener(topics = "ai.product.enrich", groupId = "ai-enrichment")
    public void enrich(ProductEnrichmentRequest request) {
        try {
            ClassificationResult result = chatClient.prompt()
                .system("Classify the product into exactly one category: "
                      + "electronics, clothing, food, sports, home, beauty, or other. "
                      + "Respond with JSON: {\"category\": \"...\", \"confidence\": 0.0-1.0}")
                .user(request.title() + "\n" + request.description())
                .call()
                .entity(ClassificationResult.class);

            resultTemplate.send("ai.product.enriched",
                request.productId(),
                new ProductEnrichmentResult(
                    request.productId(),
                    result.category(),
                    result.confidence()
                )
            );
        } catch (Exception e) {
            log.error("Failed to enrich product {}: {}",
                request.productId(), e.getMessage());
            // DLQ handling: message goes to ai.product.enrich.DLT after max retries
        }
    }
}

Rate Limit Coordination Across Service Instances

When multiple instances of your AI service are running, they share the same provider rate limit. Without coordination, instance 1 and instance 2 both try to use the full rate limit simultaneously, causing 429s. Use Redis to share rate limit state:

@Service
public class DistributedRateLimiter {

    private final RedisTemplate<String, Long> redis;
    private static final int MAX_REQUESTS_PER_MINUTE = 500;

    public boolean tryAcquire(String provider) {
        String key = "rate:" + provider + ":"
            + Instant.now().truncatedTo(ChronoUnit.MINUTES).getEpochSecond();

        Long count = redis.opsForValue().increment(key);
        if (count == 1) {
            redis.expire(key, Duration.ofMinutes(2));  // TTL safety net
        }
        return count <= MAX_REQUESTS_PER_MINUTE;
    }
}
Architecture Summary

The pattern that works at scale: (1) Dedicated AI service owns all LLM calls, API keys, and rate limits. (2) Synchronous calls via Feign for real-time features (chatbots, instant classification). (3) Async Kafka pipeline for batch or non-blocking features (document processing, background enrichment). (4) Resilience4j circuit breaker with provider fallback. (5) Redis for distributed rate limit coordination across service instances. This gives you resilience against provider outages, predictable costs, and horizontal scalability.