The AI Microservice Problem
LLM calls break several microservices assumptions:
| Normal HTTP call | LLM API call |
|---|---|
| 10–100ms latency | 1–15 seconds latency |
| Fails fast on error | Retries with backoff (adds seconds) |
| Stateless | Context window must be managed |
| Predictable cost | Cost varies with input/output length |
| Scales horizontally | Rate-limited by provider |
The patterns in this article address all five differences.
Service Decomposition: Where Does AI Live?
The most common antipattern is putting the LLM call inside every service that needs AI. This creates: N services all needing API key management, N separate retry/circuit-breaker configs, N separate cost tracking implementations.
Instead, extract AI into a dedicated service:
// Recommended: dedicated AI service that other services call
┌─────────────────┐ HTTP/gRPC ┌──────────────────────┐
│ order-service │ ───────────────→ │ ai-service │
│ product-service │ ───────────────→ │ - Spring AI │
│ support-service │ ───────────────→ │ - Rate limiting │
└─────────────────┘ │ - Cost tracking │
│ - Circuit breaker │
│ - Prompt management │
└──────────────────────┘
│
Claude / OpenAIThe dedicated AI service owns the API keys, retry logic, cost monitoring, and prompt versioning. Other services call it via a typed client with their business context.
The AI Service: Internal API Design
@RestController
@RequestMapping("/api/v1/ai")
public class AiServiceController {
// Typed endpoint for each AI feature — not a generic "send prompt" API
@PostMapping("/summarize")
public SummaryResponse summarize(@RequestBody SummarizeRequest req) {
return summarizationService.summarize(req);
}
@PostMapping("/classify")
public ClassificationResponse classify(@RequestBody ClassifyRequest req) {
return classificationService.classify(req);
}
@PostMapping("/extract")
public ExtractionResponse extract(@RequestBody ExtractRequest req) {
return extractionService.extract(req);
}
@PostMapping("/chat")
public ChatResponse chat(@RequestBody ChatRequest req) {
return chatService.respond(req);
}
}Calling the AI Service: Spring Cloud OpenFeign
// In product-service: typed Feign client for the AI service
@FeignClient(name = "ai-service", url = "${services.ai.url}")
public interface AiServiceClient {
@PostMapping("/api/v1/ai/classify")
ClassificationResponse classifyProduct(@RequestBody ClassifyRequest request);
@PostMapping("/api/v1/ai/extract")
ExtractionResponse extractAttributes(@RequestBody ExtractRequest request);
}
// Usage in ProductService:
@Service
public class ProductEnrichmentService {
private final AiServiceClient aiClient;
public void enrichProduct(Product product) {
var categoryResult = aiClient.classifyProduct(
new ClassifyRequest(
product.getTitle() + "\n" + product.getDescription(),
"electronics, clothing, food, sports, home, other"
)
);
product.setCategory(categoryResult.label());
productRepo.save(product);
}
}Resilience4j: Circuit Breaker for LLM Calls
LLM providers have incidents. Without a circuit breaker, a provider outage cascades: all threads waiting for the LLM time out, thread pools exhaust, your service falls over. Resilience4j prevents this:
# application.yml — in the ai-service
resilience4j:
circuitbreaker:
instances:
anthropic:
registerHealthIndicator: true
slidingWindowSize: 10 # evaluate last 10 calls
failureRateThreshold: 50 # open at 50% failure rate
waitDurationInOpenState: 30s # try again after 30s
permittedCallsInHalfOpenState: 3
slowCallDurationThreshold: 10s # treat calls >10s as slow
slowCallRateThreshold: 80 # open if 80% are slow
openai:
registerHealthIndicator: true
slidingWindowSize: 10
failureRateThreshold: 50
waitDurationInOpenState: 30s@Service
public class ResilientAiService {
private final ChatClient anthropicClient;
private final ChatClient openAiClient;
@CircuitBreaker(name = "anthropic", fallbackMethod = "fallbackToOpenAi")
public String complete(String prompt) {
return anthropicClient.prompt()
.user(prompt)
.call()
.content();
}
// Resilience4j calls this when the circuit is open
private String fallbackToOpenAi(String prompt, Throwable t) {
log.warn("Anthropic circuit open ({}), falling back to OpenAI", t.getMessage());
return openAiClient.prompt()
.user(prompt)
.call()
.content();
}
}Async AI Processing with Kafka
For non-real-time AI tasks (document analysis, batch classification, email summarization), a Kafka-based async pipeline is far better than synchronous HTTP. The calling service publishes a request and gets on with other work; the AI service processes at its own rate.
// Producer: product-service sends enrichment requests
@Service
public class ProductPublisher {
private final KafkaTemplate<String, ProductEnrichmentRequest> kafkaTemplate;
public void requestEnrichment(Product product) {
kafkaTemplate.send("ai.product.enrich",
product.getId(),
new ProductEnrichmentRequest(
product.getId(),
product.getTitle(),
product.getDescription()
)
);
}
}
// Consumer: ai-service processes enrichment requests
@Service
public class ProductEnrichmentConsumer {
private final ChatClient chatClient;
private final KafkaTemplate<String, ProductEnrichmentResult> resultTemplate;
@KafkaListener(topics = "ai.product.enrich", groupId = "ai-enrichment")
public void enrich(ProductEnrichmentRequest request) {
try {
ClassificationResult result = chatClient.prompt()
.system("Classify the product into exactly one category: "
+ "electronics, clothing, food, sports, home, beauty, or other. "
+ "Respond with JSON: {\"category\": \"...\", \"confidence\": 0.0-1.0}")
.user(request.title() + "\n" + request.description())
.call()
.entity(ClassificationResult.class);
resultTemplate.send("ai.product.enriched",
request.productId(),
new ProductEnrichmentResult(
request.productId(),
result.category(),
result.confidence()
)
);
} catch (Exception e) {
log.error("Failed to enrich product {}: {}",
request.productId(), e.getMessage());
// DLQ handling: message goes to ai.product.enrich.DLT after max retries
}
}
}Rate Limit Coordination Across Service Instances
When multiple instances of your AI service are running, they share the same provider rate limit. Without coordination, instance 1 and instance 2 both try to use the full rate limit simultaneously, causing 429s. Use Redis to share rate limit state:
@Service
public class DistributedRateLimiter {
private final RedisTemplate<String, Long> redis;
private static final int MAX_REQUESTS_PER_MINUTE = 500;
public boolean tryAcquire(String provider) {
String key = "rate:" + provider + ":"
+ Instant.now().truncatedTo(ChronoUnit.MINUTES).getEpochSecond();
Long count = redis.opsForValue().increment(key);
if (count == 1) {
redis.expire(key, Duration.ofMinutes(2)); // TTL safety net
}
return count <= MAX_REQUESTS_PER_MINUTE;
}
}The pattern that works at scale: (1) Dedicated AI service owns all LLM calls, API keys, and rate limits. (2) Synchronous calls via Feign for real-time features (chatbots, instant classification). (3) Async Kafka pipeline for batch or non-blocking features (document processing, background enrichment). (4) Resilience4j circuit breaker with provider fallback. (5) Redis for distributed rate limit coordination across service instances. This gives you resilience against provider outages, predictable costs, and horizontal scalability.