Spring Cloud Gateway as AI Router: Multi-Provider Load Balancing

Your AI microservice shouldn't be tightly coupled to a single LLM provider. Spring Cloud Gateway sits in front of your AI service and acts as an intelligent router: it can select providers based on headers, route to fallbacks when a provider is down, enforce per-tenant rate limits, and inject API keys without exposing them to upstream services. This article builds a production AI router from scratch.

Why a Dedicated AI Router?

Without a router, provider switching requires code changes and redeployment. With Spring Cloud Gateway as an AI router, you get:

Provider abstraction: Upstream services send requests to /ai/claude or /ai/openai — the gateway proxies to the actual providers
API key isolation: API keys live only in the gateway; microservices never touch them
Centralized rate limiting: One Redis-backed rate limiter enforces limits across all upstream services
Automatic failover: If Anthropic returns 5xx, the gateway retries with OpenAI transparently
Cost routing: Use cheap models for low-priority traffic, premium models for paying customers

Dependencies

<!-- Spring Cloud Gateway with reactive stack -->
<dependency>
  <groupId>org.springframework.cloud</groupId>
  <artifactId>spring-cloud-starter-gateway</artifactId>
</dependency>
<dependency>
  <groupId>org.springframework.cloud</groupId>
  <artifactId>spring-cloud-starter-circuitbreaker-reactor-resilience4j</artifactId>
</dependency>
<dependency>
  <groupId>org.springframework.boot</groupId>
  <artifactId>spring-boot-starter-data-redis-reactive</artifactId>
</dependency>

Basic Route Configuration

# application.yml — gateway routes
spring:
  cloud:
    gateway:
      routes:
        # Route: Claude API proxy
        - id: claude-proxy
          uri: https://api.anthropic.com
          predicates:
            - Path=/ai/anthropic/**
          filters:
            - StripPrefix=2              # strip /ai/anthropic from path
            - AddRequestHeader=x-api-key,${ANTHROPIC_API_KEY}
            - AddRequestHeader=anthropic-version,2023-06-01
            - name: CircuitBreaker
              args:
                name: anthropic
                fallbackUri: forward:/ai/openai-fallback

        # Route: OpenAI API proxy
        - id: openai-proxy
          uri: https://api.openai.com
          predicates:
            - Path=/ai/openai/**
          filters:
            - StripPrefix=2
            - AddRequestHeader=Authorization,Bearer ${OPENAI_API_KEY}
            - name: CircuitBreaker
              args:
                name: openai

Custom Filter: Provider Selection by Header

Instead of hard-coded routes, allow upstream services to request a provider via a header, with intelligent fallback:

@Component
public class ProviderRoutingFilter implements GlobalFilter, Ordered {

    @Override
    public Mono<Void> filter(ServerWebExchange exchange, GatewayFilterChain chain) {
        ServerHttpRequest request = exchange.getRequest();
        String requestedProvider = request.getHeaders()
            .getFirst("X-AI-Provider");  // "anthropic", "openai", or null

        String tier = request.getHeaders().getFirst("X-User-Tier");  // "free", "pro"

        // Cost routing: free tier → cheapest model; pro → best model
        String provider = selectProvider(requestedProvider, tier);

        String targetUri = provider.equals("anthropic")
            ? "https://api.anthropic.com/v1/messages"
            : "https://api.openai.com/v1/chat/completions";

        ServerHttpRequest modified = request.mutate()
            .uri(URI.create(targetUri))
            .header("X-Resolved-Provider", provider)
            .build();

        return chain.filter(exchange.mutate().request(modified).build());
    }

    private String selectProvider(String requested, String tier) {
        if ("free".equals(tier)) return "openai";    // cheapest for free tier
        if (requested != null) return requested;     // honor explicit request
        return "anthropic";                           // default for pro
    }

    @Override
    public int getOrder() { return -1; }  // run before other filters
}

Redis Rate Limiting Per Tenant

Spring Cloud Gateway's built-in Redis rate limiter uses the token bucket algorithm. Configure per-tier limits:

# application.yml
spring:
  cloud:
    gateway:
      routes:
        - id: ai-gateway
          uri: lb://ai-service
          predicates:
            - Path=/api/ai/**
          filters:
            - name: RequestRateLimiter
              args:
                redis-rate-limiter.replenishRate: 10     # tokens added per second
                redis-rate-limiter.burstCapacity: 20     # max burst
                redis-rate-limiter.requestedTokens: 1    # tokens per request
                keyResolver: "#{@tenantKeyResolver}"

// Key resolver: rate limit per tenant (extracted from JWT)
@Bean
public KeyResolver tenantKeyResolver() {
    return exchange -> {
        String tenantId = exchange.getRequest().getHeaders()
            .getFirst("X-Tenant-ID");
        return Mono.just(
            tenantId != null ? "tenant:" + tenantId : "anonymous"
        );
    };
}

// Programmatic rate limiter with tier-based limits:
@Bean
public RedisRateLimiter tieredRateLimiter() {
    return new RedisRateLimiter(10, 20);  // base config; override per-request below
}

@Component
public class TieredRateLimiterFilter implements GlobalFilter {

    private static final Map<String, int[]> TIER_LIMITS = Map.of(
        "free",  new int[]{2, 5},   // 2 req/s, burst 5
        "basic", new int[]{10, 20}, // 10 req/s, burst 20
        "pro",   new int[]{50, 100} // 50 req/s, burst 100
    );
}

Response Caching for Identical Prompts

LLMs are deterministic at temperature=0. Cache responses for identical prompt+model combinations to eliminate repeated API calls:

@Component
public class AiResponseCacheFilter implements GlobalFilter {

    private final ReactiveRedisTemplate<String, String> redis;

    @Override
    public Mono<Void> filter(ServerWebExchange exchange, GatewayFilterChain chain) {
        // Only cache at temperature=0 (deterministic responses)
        String temperature = exchange.getRequest().getHeaders()
            .getFirst("X-AI-Temperature");
        if (!"0".equals(temperature)) {
            return chain.filter(exchange);
        }

        return exchange.getRequest().getBody()
            .collectList()
            .flatMap(body -> {
                String cacheKey = "ai:cache:" + hashBody(body);
                return redis.opsForValue().get(cacheKey)
                    .flatMap(cached -> writeCachedResponse(exchange, cached))
                    .switchIfEmpty(chain.filter(exchange)
                        .doOnSuccess(v -> cacheResponse(exchange, cacheKey)));
            });
    }
}

Observability: Log Every Routed Request

@Component
public class AiAuditFilter implements GlobalFilter, Ordered {

    @Override
    public Mono<Void> filter(ServerWebExchange exchange, GatewayFilterChain chain) {
        long start = System.currentTimeMillis();
        String requestId = UUID.randomUUID().toString();

        exchange.getResponse().beforeCommit(() -> {
            long duration = System.currentTimeMillis() - start;
            log.info("AI Gateway | reqId={} provider={} status={} duration={}ms",
                requestId,
                exchange.getRequest().getHeaders().getFirst("X-Resolved-Provider"),
                exchange.getResponse().getStatusCode(),
                duration
            );
            // Emit metric for Grafana/CloudWatch dashboards
            meterRegistry.timer("ai.gateway.request",
                "provider", exchange.getRequest().getHeaders()
                    .getFirst("X-Resolved-Provider"),
                "status", String.valueOf(exchange.getResponse().getStatusCode()))
                .record(duration, TimeUnit.MILLISECONDS);
            return Mono.empty();
        });

        return chain.filter(exchange);
    }

    @Override
    public int getOrder() { return Ordered.HIGHEST_PRECEDENCE; }
}

Gateway Architecture Summary

The AI gateway pattern: (1) Upstream services call a single internal endpoint, not provider URLs directly. (2) The gateway injects API keys, selects providers, and enforces rate limits — all centrally. (3) Circuit breakers at the gateway level provide automatic failover. (4) Redis caches responses for deterministic calls. (5) Every request is logged with provider, latency, and status for cost and reliability dashboards. This architecture means provider migrations are zero-downtime config changes, not code changes.

Tools-Hut

Spring Cloud Gateway as AI Router: Multi-Provider Load Balancing

Why a Dedicated AI Router?

Dependencies

Basic Route Configuration

Custom Filter: Provider Selection by Header

Redis Rate Limiting Per Tenant

Response Caching for Identical Prompts

Observability: Log Every Routed Request

Java & Spring AI Series

Why a Dedicated AI Router?

Dependencies

Basic Route Configuration

Custom Filter: Provider Selection by Header

Redis Rate Limiting Per Tenant

Response Caching for Identical Prompts

Observability: Log Every Routed Request

Java & Spring AI Series

Related Articles