Why a Dedicated AI Router?

Without a router, provider switching requires code changes and redeployment. With Spring Cloud Gateway as an AI router, you get:

  • Provider abstraction: Upstream services send requests to /ai/claude or /ai/openai — the gateway proxies to the actual providers
  • API key isolation: API keys live only in the gateway; microservices never touch them
  • Centralized rate limiting: One Redis-backed rate limiter enforces limits across all upstream services
  • Automatic failover: If Anthropic returns 5xx, the gateway retries with OpenAI transparently
  • Cost routing: Use cheap models for low-priority traffic, premium models for paying customers

Dependencies

<!-- Spring Cloud Gateway with reactive stack -->
<dependency>
  <groupId>org.springframework.cloud</groupId>
  <artifactId>spring-cloud-starter-gateway</artifactId>
</dependency>
<dependency>
  <groupId>org.springframework.cloud</groupId>
  <artifactId>spring-cloud-starter-circuitbreaker-reactor-resilience4j</artifactId>
</dependency>
<dependency>
  <groupId>org.springframework.boot</groupId>
  <artifactId>spring-boot-starter-data-redis-reactive</artifactId>
</dependency>

Basic Route Configuration

# application.yml — gateway routes
spring:
  cloud:
    gateway:
      routes:
        # Route: Claude API proxy
        - id: claude-proxy
          uri: https://api.anthropic.com
          predicates:
            - Path=/ai/anthropic/**
          filters:
            - StripPrefix=2              # strip /ai/anthropic from path
            - AddRequestHeader=x-api-key,${ANTHROPIC_API_KEY}
            - AddRequestHeader=anthropic-version,2023-06-01
            - name: CircuitBreaker
              args:
                name: anthropic
                fallbackUri: forward:/ai/openai-fallback

        # Route: OpenAI API proxy
        - id: openai-proxy
          uri: https://api.openai.com
          predicates:
            - Path=/ai/openai/**
          filters:
            - StripPrefix=2
            - AddRequestHeader=Authorization,Bearer ${OPENAI_API_KEY}
            - name: CircuitBreaker
              args:
                name: openai

Custom Filter: Provider Selection by Header

Instead of hard-coded routes, allow upstream services to request a provider via a header, with intelligent fallback:

@Component
public class ProviderRoutingFilter implements GlobalFilter, Ordered {

    @Override
    public Mono<Void> filter(ServerWebExchange exchange, GatewayFilterChain chain) {
        ServerHttpRequest request = exchange.getRequest();
        String requestedProvider = request.getHeaders()
            .getFirst("X-AI-Provider");  // "anthropic", "openai", or null

        String tier = request.getHeaders().getFirst("X-User-Tier");  // "free", "pro"

        // Cost routing: free tier → cheapest model; pro → best model
        String provider = selectProvider(requestedProvider, tier);

        String targetUri = provider.equals("anthropic")
            ? "https://api.anthropic.com/v1/messages"
            : "https://api.openai.com/v1/chat/completions";

        ServerHttpRequest modified = request.mutate()
            .uri(URI.create(targetUri))
            .header("X-Resolved-Provider", provider)
            .build();

        return chain.filter(exchange.mutate().request(modified).build());
    }

    private String selectProvider(String requested, String tier) {
        if ("free".equals(tier)) return "openai";    // cheapest for free tier
        if (requested != null) return requested;     // honor explicit request
        return "anthropic";                           // default for pro
    }

    @Override
    public int getOrder() { return -1; }  // run before other filters
}

Redis Rate Limiting Per Tenant

Spring Cloud Gateway's built-in Redis rate limiter uses the token bucket algorithm. Configure per-tier limits:

# application.yml
spring:
  cloud:
    gateway:
      routes:
        - id: ai-gateway
          uri: lb://ai-service
          predicates:
            - Path=/api/ai/**
          filters:
            - name: RequestRateLimiter
              args:
                redis-rate-limiter.replenishRate: 10     # tokens added per second
                redis-rate-limiter.burstCapacity: 20     # max burst
                redis-rate-limiter.requestedTokens: 1    # tokens per request
                keyResolver: "#{@tenantKeyResolver}"
// Key resolver: rate limit per tenant (extracted from JWT)
@Bean
public KeyResolver tenantKeyResolver() {
    return exchange -> {
        String tenantId = exchange.getRequest().getHeaders()
            .getFirst("X-Tenant-ID");
        return Mono.just(
            tenantId != null ? "tenant:" + tenantId : "anonymous"
        );
    };
}

// Programmatic rate limiter with tier-based limits:
@Bean
public RedisRateLimiter tieredRateLimiter() {
    return new RedisRateLimiter(10, 20);  // base config; override per-request below
}

@Component
public class TieredRateLimiterFilter implements GlobalFilter {

    private static final Map<String, int[]> TIER_LIMITS = Map.of(
        "free",  new int[]{2, 5},   // 2 req/s, burst 5
        "basic", new int[]{10, 20}, // 10 req/s, burst 20
        "pro",   new int[]{50, 100} // 50 req/s, burst 100
    );
}

Response Caching for Identical Prompts

LLMs are deterministic at temperature=0. Cache responses for identical prompt+model combinations to eliminate repeated API calls:

@Component
public class AiResponseCacheFilter implements GlobalFilter {

    private final ReactiveRedisTemplate<String, String> redis;

    @Override
    public Mono<Void> filter(ServerWebExchange exchange, GatewayFilterChain chain) {
        // Only cache at temperature=0 (deterministic responses)
        String temperature = exchange.getRequest().getHeaders()
            .getFirst("X-AI-Temperature");
        if (!"0".equals(temperature)) {
            return chain.filter(exchange);
        }

        return exchange.getRequest().getBody()
            .collectList()
            .flatMap(body -> {
                String cacheKey = "ai:cache:" + hashBody(body);
                return redis.opsForValue().get(cacheKey)
                    .flatMap(cached -> writeCachedResponse(exchange, cached))
                    .switchIfEmpty(chain.filter(exchange)
                        .doOnSuccess(v -> cacheResponse(exchange, cacheKey)));
            });
    }
}

Observability: Log Every Routed Request

@Component
public class AiAuditFilter implements GlobalFilter, Ordered {

    @Override
    public Mono<Void> filter(ServerWebExchange exchange, GatewayFilterChain chain) {
        long start = System.currentTimeMillis();
        String requestId = UUID.randomUUID().toString();

        exchange.getResponse().beforeCommit(() -> {
            long duration = System.currentTimeMillis() - start;
            log.info("AI Gateway | reqId={} provider={} status={} duration={}ms",
                requestId,
                exchange.getRequest().getHeaders().getFirst("X-Resolved-Provider"),
                exchange.getResponse().getStatusCode(),
                duration
            );
            // Emit metric for Grafana/CloudWatch dashboards
            meterRegistry.timer("ai.gateway.request",
                "provider", exchange.getRequest().getHeaders()
                    .getFirst("X-Resolved-Provider"),
                "status", String.valueOf(exchange.getResponse().getStatusCode()))
                .record(duration, TimeUnit.MILLISECONDS);
            return Mono.empty();
        });

        return chain.filter(exchange);
    }

    @Override
    public int getOrder() { return Ordered.HIGHEST_PRECEDENCE; }
}
Gateway Architecture Summary

The AI gateway pattern: (1) Upstream services call a single internal endpoint, not provider URLs directly. (2) The gateway injects API keys, selects providers, and enforces rate limits — all centrally. (3) Circuit breakers at the gateway level provide automatic failover. (4) Redis caches responses for deterministic calls. (5) Every request is logged with provider, latency, and status for cost and reliability dashboards. This architecture means provider migrations are zero-downtime config changes, not code changes.