Deployment Option Comparison

OptionBest ForKey Consideration
ECS FargateLong-running services, streaming responses, websocketsAlways-on cost; no cold start; 15min+ request support
Lambda + SnapStartBursty, infrequent AI calls; async processing15min max timeout; SnapStart eliminates cold start for Java
AWS BedrockReplace Anthropic/OpenAI with AWS-managed modelsIAM auth (no API keys); stays within AWS; data residency
EC2 + Auto ScalingGPU inference or extremely high throughputOverkill for API-based LLM calls; use only for on-prem models

ECS Fargate: Container Sizing for Java AI

Java AI services need more memory than typical Spring Boot apps. The JVM baseline, Spring AI libraries, embedded vector client, and concurrent LLM responses in flight add up quickly:

ComponentMemory
JVM base (Java 21, G1GC)~200 MB
Spring Boot + Spring AI + dependencies~350 MB
PGVector client + connection pool~50 MB
Concurrent LLM responses in flight (10 × 8k tokens)~400 MB
Headroom (GC, metaspace, thread stacks)~200 MB
Recommended minimum2 GB

Start with 1 vCPU / 2 GB for development. Scale to 2 vCPU / 4 GB for production handling 10+ concurrent AI requests.

ECS Task Definition (CloudFormation)

# cloudformation/ai-service.yaml
AiServiceTaskDefinition:
  Type: AWS::ECS::TaskDefinition
  Properties:
    Family: ai-service
    NetworkMode: awsvpc
    RequiresCompatibilities: [FARGATE]
    Cpu: '2048'         # 2 vCPU
    Memory: '4096'       # 4 GB
    ExecutionRoleArn: !GetAtt EcsExecutionRole.Arn
    TaskRoleArn: !GetAtt AiServiceTaskRole.Arn
    ContainerDefinitions:
      - Name: ai-service
        Image: !Sub '${AWS::AccountId}.dkr.ecr.${AWS::Region}.amazonaws.com/ai-service:${ImageTag}'
        PortMappings:
          - ContainerPort: 8080
        Environment:
          - Name: ENVIRONMENT
            Value: !Ref Environment
          - Name: JAVA_OPTS
            # -XX:MaxRAMPercentage=75 = 3GB heap in a 4GB container
            Value: "-XX:MaxRAMPercentage=75.0 -XX:+UseG1GC -XX:+UseStringDeduplication"
        Secrets:
          # Keys loaded from Secrets Manager at task start — never in env
          - Name: ANTHROPIC_API_KEY
            ValueFrom: !Sub 'arn:aws:secretsmanager:${AWS::Region}:${AWS::AccountId}:secret:ai-service/anthropic-api-key'
          - Name: OPENAI_API_KEY
            ValueFrom: !Sub 'arn:aws:secretsmanager:${AWS::Region}:${AWS::AccountId}:secret:ai-service/openai-api-key'
        HealthCheck:
          Command: [CMD, curl, -f, "http://localhost:8080/actuator/health"]
          Interval: 30
          Timeout: 5
          Retries: 3
          StartPeriod: 60   # Java needs time to start; allow 60s before health checks
        LogConfiguration:
          LogDriver: awslogs
          Options:
            awslogs-group: !Ref AiServiceLogGroup
            awslogs-region: !Ref AWS::Region
            awslogs-stream-prefix: ai-service

Secrets Manager: Zero Plaintext API Keys

# Store the API key — run once during provisioning
aws secretsmanager create-secret \
  --name ai-service/anthropic-api-key \
  --secret-string "sk-ant-..." \
  --region us-east-1

# IAM policy for the ECS task role — least privilege
AiServiceTaskRole:
  Type: AWS::IAM::Role
  Properties:
    Policies:
      - PolicyName: SecretsAccess
        PolicyDocument:
          Statement:
            - Effect: Allow
              Action: [secretsmanager:GetSecretValue]
              Resource:
                - !Sub 'arn:aws:secretsmanager:${AWS::Region}:${AWS::AccountId}:secret:ai-service/*'
Why Secrets Manager, not SSM Parameter Store?

Both work, but Secrets Manager has automatic rotation support, a 90-day rotation reminder, and the ability to version secrets. For API keys that cost money if leaked, Secrets Manager's audit trail via CloudTrail and automatic rotation scheduling are worth the small additional cost ($0.40/secret/month vs. free for SSM Standard).

Lambda + SnapStart for Infrequent AI Calls

Lambda's cold start problem (5–15 seconds for Java) is eliminated by SnapStart, which restores from a pre-warmed snapshot:

# Lambda function with SnapStart (CloudFormation)
AiLambdaFunction:
  Type: AWS::Lambda::Function
  Properties:
    FunctionName: ai-classifier
    Runtime: java21
    Handler: com.example.ai.LambdaHandler::handleRequest
    MemorySize: 2048      # 2GB for Java AI Lambda
    Timeout: 900          # 15min max — LLM calls can be slow
    SnapStart:
      ApplyOn: PublishedVersions  # enables SnapStart
    Environment:
      Variables:
        ENVIRONMENT: !Ref Environment
    VpcConfig:                   # if accessing RDS/ElastiCache
      SecurityGroupIds: [!Ref LambdaSecurityGroup]
      SubnetIds: !Ref PrivateSubnets
// Lambda handler — implement CRaC for SnapStart compatibility
import org.crac.Context;
import org.crac.Core;
import org.crac.Resource;

@SpringBootApplication
public class LambdaHandler implements
        RequestHandler<APIGatewayProxyRequestEvent, APIGatewayProxyResponseEvent>,
        Resource {

    private static ConfigurableApplicationContext context;

    static {
        context = SpringApplication.run(LambdaHandler.class);
        Core.getGlobalContext().register(new LambdaHandler());
    }

    @Override
    public void beforeCheckpoint(Context<? extends Resource> ctx) {
        // Close connections before snapshot is taken
        DataSource ds = context.getBean(DataSource.class);
        ((HikariDataSource) ds).getHikariPoolMXBean().softEvictConnections();
    }

    @Override
    public void afterRestore(Context<? extends Resource> ctx) {
        // Re-establish connections after snapshot restore
        // Spring will handle this automatically on first use
    }

    @Override
    public APIGatewayProxyResponseEvent handleRequest(
            APIGatewayProxyRequestEvent event, Context lambdaContext) {
        var handler = context.getBean(AiRequestHandler.class);
        return handler.handle(event);
    }
}

AWS Bedrock: Managed Models Without Third-Party API Keys

AWS Bedrock provides Claude (Anthropic), Llama (Meta), Titan (Amazon), and others via IAM authentication — no Anthropic API key required. All traffic stays within AWS, which matters for data residency and compliance:

# application.yml — Spring AI Bedrock config
spring:
  ai:
    bedrock:
      aws:
        region: us-east-1
        # No API key — uses IAM role of ECS task / Lambda execution role
      anthropic:
        chat:
          enabled: true
          model: anthropic.claude-sonnet-4-6-20251001-v2:0
          options:
            max-tokens: 2048
            temperature: 0.7

<dependency>
  <groupId>org.springframework.ai</groupId>
  <artifactId>spring-ai-bedrock-converse-spring-boot-starter</artifactId>
</dependency>
# IAM policy for ECS task — grants Bedrock model invocation
BedrockPolicy:
  Type: AWS::IAM::ManagedPolicy
  Properties:
    PolicyDocument:
      Statement:
        - Effect: Allow
          Action:
            - bedrock:InvokeModel
            - bedrock:InvokeModelWithResponseStream
          Resource:
            - "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-*"
            - "arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-*"

Auto-Scaling: AI-Aware Scaling Policy

Standard CPU-based auto-scaling doesn't reflect AI workload well — threads spend most time waiting for the LLM API, keeping CPU low. Scale on custom metrics instead:

# Scale on pending AI requests in the queue (CloudWatch custom metric)
AiServiceScalingPolicy:
  Type: AWS::ApplicationAutoScaling::ScalingPolicy
  Properties:
    PolicyName: AiQueueDepthScaling
    PolicyType: TargetTrackingScaling
    TargetTrackingScalingPolicyConfiguration:
      CustomizedMetricSpecification:
        Namespace: AiService/production
        MetricName: ai.requests.pending
        Statistic: Average
      TargetValue: 10.0    # scale out when avg pending requests > 10
      ScaleInCooldown: 300  # 5min before scaling in (LLM calls are slow)
      ScaleOutCooldown: 60

VPC Configuration for AI Services

# ECS tasks in private subnets; NAT Gateway for outbound LLM API calls
AiServiceSecurityGroup:
  Type: AWS::EC2::SecurityGroup
  Properties:
    GroupDescription: AI Service
    VpcId: !Ref VpcId
    SecurityGroupIngress:
      - IpProtocol: tcp
        FromPort: 8080
        ToPort: 8080
        SourceSecurityGroupId: !Ref AlbSecurityGroupId
    SecurityGroupEgress:
      - IpProtocol: tcp    # HTTPS to Anthropic/OpenAI APIs
        FromPort: 443
        ToPort: 443
        CidrIp: 0.0.0.0/0
      - IpProtocol: tcp    # PostgreSQL
        FromPort: 5432
        ToPort: 5432
        DestinationSecurityGroupId: !Ref DatabaseSecurityGroupId
      - IpProtocol: tcp    # Redis
        FromPort: 6379
        ToPort: 6379
        DestinationSecurityGroupId: !Ref RedisSecurityGroupId
Production Deployment Checklist
  • Container memory — min 2 GB for Java AI; 4 GB for production with concurrent calls
  • JVM flags-XX:MaxRAMPercentage=75.0 prevents OOM; -XX:+UseG1GC for balanced pauses
  • Health check start period — set StartPeriod: 60; Java takes 30–45s to start
  • Task timeout — ALB idle timeout must exceed max LLM latency (set to 300s)
  • Secrets Manager — API keys as Secrets; never in environment variables visible in console
  • SnapStart — mandatory for Java Lambda; eliminates 5–15s cold start
  • Bedrock consideration — if data residency or compliance matters, Bedrock keeps everything in AWS
  • Scale on custom metrics — CPU is a poor AI scaling signal; use pending request count
  • NAT Gateway — private subnets need NAT for outbound calls to Anthropic/OpenAI

Series Complete: Your Java AI Stack

You've now covered the complete Java AI stack from first call to production deployment:

  1. Spring AI — ChatClient, PromptTemplate, streaming, conversation memory
  2. Anthropic SDK — direct API access, streaming, tool use, vision, prompt caching
  3. RAG — PGVector, document ingestion, hybrid search, QuestionAnswerAdvisor
  4. Function Calling — @Tool, FunctionCallback, role-based tool authorization
  5. Microservices — service decomposition, Kafka async, Resilience4j, distributed rate limiting
  6. AI Router — Spring Cloud Gateway, provider selection, Redis rate limiting, response caching
  7. Security — OWASP LLM Top 10, prompt injection, PII scrubbing, token budgets
  8. CI/CD — GitHub Actions, WireMock testing, Docker multi-stage, ECS blue/green
  9. Observability — Micrometer metrics, cost tracking, circuit breaker health, Grafana dashboards
  10. AWS Deployment — ECS Fargate, Lambda SnapStart, Bedrock, Secrets Manager, CloudFormation