Your AI model works perfectly in development. You deploy it to production. Three weeks later, your AWS bill arrives and SageMaker costs are 4x what you budgeted. The model hasn’t changed. Traffic is exactly what you estimated. But you picked the wrong inference option.
SageMaker offers four inference modes: real-time endpoints, serverless inference, asynchronous inference, and batch transform. Each has a different pricing model. Each optimizes for different traffic patterns. Most teams start with real-time endpoints because they seem straightforward, then discover they’re paying for capacity they don’t use.
The decision isn’t obvious from AWS pricing pages. You need to understand both the billing mechanics and your actual usage patterns. A real-time endpoint that costs $1,000/month for predictable traffic might cost $200/month as serverless inference with the same total request volume. Or it might cost $2,000/month if your traffic pattern is wrong for serverless.
Real-Time Endpoints: Pay for Time, Not Usage
Real-time endpoints run continuously. You provision one or more instances, deploy your model, and pay hourly regardless of whether requests arrive. The instance runs 24/7 until you delete the endpoint.
A single ml.m5.xlarge instance costs $0.269/hour, or about $196/month. An ml.g5.xlarge GPU instance costs about $1.41/hour (check your region), or roughly $1,030/month. These prices are for us-east-1 as of October 2025 and vary by region.
The math is simple but unforgiving. If you provision two ml.g5.xlarge instances for high availability across availability zones, you pay $2,060/month even if your model only serves 100 requests per day. The cost is entirely time-based, not usage-based.
Real-time endpoints make economic sense when utilization is high. If your model serves requests continuously throughout the day, you get predictable low-latency inference without cold starts. Autoscaling lets you add capacity during peak hours and reduce it during off-hours, but you’re always paying for at least your minimum instance count.
The break-even point depends on your specific requirements. For a model requiring GPU inference, real-time endpoints typically become cost-effective when you’re serving requests more than 30% of the time. Below that threshold, you’re paying for idle capacity.
Instance selection matters significantly. SageMaker offers dozens of instance types. An ml.m5.large costs $0.134/hour while an ml.m5.24xlarge costs $6.461/hour. GPU instances range from ml.g4dn.xlarge at $0.736/hour to ml.p4d.24xlarge at $37.688/hour.
Right-sizing is critical but not obvious. A model that runs fine on ml.m5.xlarge in development might need ml.m5.4xlarge in production when handling concurrent requests. SageMaker Inference Recommender helps identify optimal instance types, but many teams skip this step and over-provision.
Data transfer adds to costs. Cross-AZ data transfer costs $0.01/GB. If your application and SageMaker endpoint are in different availability zones and you’re moving large payloads (image data, long text), this adds up quickly. Co-locate when possible.
Serverless Inference: Pay per Request with Cold Starts
Serverless inference scales capacity automatically based on traffic. You pay for compute time and the number of requests. When idle, you pay nothing. This sounds perfect, but the pricing structure requires careful analysis.
Serverless inference charges two components: compute time and request count. Compute time is measured in milliseconds and scales with the memory configuration you choose. Memory options range from 1GB to 6GB. AWS lists Serverless Inference pricing per GB-second (approximately $0.0000200/s at 1GB in public guidance; multiply by chosen memory). A 4GB memory configuration costs approximately $0.00008 per inference second. Always verify the latest rates for your region.
Request pricing adds $0.20 per 1,000 requests regardless of memory configuration or inference time. This flat request fee matters significantly at high volumes.
Cold start behavior is the key tradeoff. When your endpoint hasn’t received requests recently, the first request after an idle period takes 1-3 seconds extra while SageMaker provisions capacity. Subsequent requests process normally. If requests continue, capacity stays warm. If requests stop, capacity scales to zero.
The economics work differently than real-time endpoints. Consider a model that takes 500ms per inference using 4GB memory. Each inference costs $0.00004 (500ms * $0.00008) plus $0.0002 (request fee), totaling $0.00024 per request.
At 10,000 requests per month, serverless costs $2.40. At 100,000 requests per month, serverless costs $24.00. A real-time ml.m5.xlarge endpoint costs $196/month regardless of request count. Serverless is cheaper up to about 800,000 requests per month at this inference time and memory configuration.
But traffic patterns matter critically. If your requests are evenly distributed (one request every few minutes), you hit cold starts frequently. Users experience inconsistent latency. If requests arrive in bursts followed by idle periods, serverless works well – capacity scales up during bursts and down during idle time.
Serverless inference currently supports CPU only. If your model requires GPU, you need real-time or asynchronous endpoints. This limitation eliminates serverless for many use cases including large language models and computer vision models that need GPU acceleration.
The memory configuration affects both cost and performance. Higher memory allocates more CPU proportionally. A model might run in 500ms with 4GB but 200ms with 6GB. The 6GB configuration costs 50% more per second ($0.00012 vs $0.00008) but finishes faster, potentially resulting in lower total cost per inference.
Serverless inference supports request payloads up to 4MB with approximately 60-second processing time limits. Real-time endpoints support payloads up to 25MB (60-second regular responses; 8 minutes with streaming). If your use case requires larger payloads, you need asynchronous inference.
Asynchronous Inference: Queue-Based Processing
Asynchronous inference handles requests through a queue. Clients submit requests to S3, SageMaker processes them when capacity is available, and writes results back to S3. Clients poll for results or get notified via SNS.
The pricing model resembles real-time endpoints – you pay hourly for instances. An ml.m5.xlarge costs the same $0.269/hour whether used for real-time or asynchronous inference. The difference is in scaling behavior and request handling.
Asynchronous endpoints can scale to zero when the request queue is empty. You stop paying for instances during idle periods. When requests arrive, SageMaker provisions instances automatically, though initial provisioning takes several minutes. This makes asynchronous inference economically similar to serverless but with different operational characteristics.
Cold start time for asynchronous endpoints is typically 3-5 minutes on first spin-up. This matters less because clients expect delayed results anyway. If you’re processing batch jobs overnight or handling background tasks, the cold start is irrelevant.
Asynchronous inference supports request payloads up to 1GB – significantly larger than real-time (25MB) or serverless (4MB) endpoints. This makes async the only viable option for processing large images, videos, or documents.
Asynchronous inference makes sense for specific use cases: batch processing large datasets, long-running inference (minutes per request), processing workflows where immediate results aren’t needed, and traffic with highly variable volume.
The queue provides natural buffering. If traffic spikes, requests queue up rather than overwhelming instances. SageMaker scales capacity to process the queue. This prevents overload and provides cost efficiency – you only pay for instances when actively processing.
Batch Transform: Offline Processing
Batch transform processes entire datasets offline. You provide input data in S3, specify an instance type and count, and SageMaker spins up instances, processes all data, writes results to S3, and terminates instances.
Pricing is purely usage-based. You pay for instance hours actually used. If batch transform takes 2 hours with 4 ml.m5.xlarge instances, you pay for 8 instance-hours ($2.15).
Batch transform makes economic sense when you need to process large datasets that don’t require real-time results. Examples include scoring all customers for churn risk overnight, processing archived images for classification, or running model inference on historical data.
The cost advantage comes from no idle time. Unlike real-time endpoints running 24/7, batch transform instances exist only during the actual job. You provision exactly the capacity needed, process data, and release resources.
You control parallelism explicitly. Need to process 1 million images? Provision 10 instances and batch transform distributes the workload. More instances finish faster but cost more per hour. Fewer instances take longer but spread cost over time.
Batch transform supports data parallelism automatically. It splits your input data across instances. Each instance processes its shard independently. Results combine in S3 when all instances complete.
The limitations are straightforward. Batch transform doesn’t provide an API endpoint. You can’t send individual requests. It’s designed for offline batch processing, not production serving. If you need online inference, batch transform isn’t an option.
Data Transfer and Storage Costs
SageMaker inference billing includes more than compute. Data transfer and storage add costs that surprise teams who focus only on instance pricing.
Input and output data stored in S3 incurs standard S3 costs. If you’re using asynchronous inference or batch transform with large datasets, S3 costs can exceed compute costs. S3 Standard storage costs $0.023/GB per month for the first 50TB.
Data transfer between S3 and SageMaker in the same region is free. Data transfer out of AWS to the internet costs $0.09/GB for the first 10TB per month. If clients download large model outputs, this matters.
Model artifacts stored in S3 cost money. A 10GB model costs $0.23/month to store. Models don’t change frequently, so storage costs are typically negligible compared to compute. But if you’re version-controlling dozens of model variants, storage adds up.
CloudWatch logs and metrics generate costs. SageMaker logs all requests by default. Log storage costs $0.50/GB per month. Log ingestion costs $0.50/GB. For high-traffic endpoints, logging costs can reach hundreds of dollars monthly. Consider sampling logs rather than capturing every request.
If you use Interface VPC Endpoints (PrivateLink) to reach SageMaker or other AWS services, expect about $0.01/hour per availability zone ($7.30/month per AZ) plus $0.01/GB data processing through the endpoint. Also remember cross-AZ traffic effectively costs $0.02/GB (billed $0.01/GB in and $0.01/GB out). If you’re running SageMaker in a VPC for security, these costs apply regardless of usage.
Multi-Model Endpoints: Amortizing Fixed Costs
Multi-model endpoints let you host multiple models on a single endpoint. SageMaker loads models dynamically based on which model the request targets. This amortizes the fixed cost of running instances across many models.
The economics work when you have many small models rather than one large model. Instead of running 20 separate real-time endpoints (20 * $196/month = $3,920/month for ml.m5.xlarge), you run one multi-model endpoint serving all 20 models ($196/month).
Multi-model endpoints store models in S3 and load them on demand. The first request for a model experiences higher latency while SageMaker loads it into memory. Subsequent requests hit the cached model and process normally. If memory fills, SageMaker evicts least-recently-used models.
This pattern fits specific use cases well: per-customer models where each customer has a customized model, A/B testing many model variants, or microservice architectures where different services use different models.
The limitation is memory constraints. All models combined must fit in instance memory when simultaneously loaded. If models are large or traffic spreads across many models, you’ll experience frequent evictions and loading latency.
Multi-model endpoints work with real-time inference only. You can’t use this pattern with serverless, asynchronous, or batch transform.
Shadow Traffic and Cost Implications
Production testing new models typically requires shadow traffic – sending copies of production requests to new model versions to compare performance without impacting users. This doubles your inference costs during the testing period.
SageMaker doesn’t provide native shadow traffic. You implement it at the application layer by sending each request to both production and canary endpoints. Both endpoints process the request. You pay for both inferences.
For real-time endpoints, shadow testing doubles compute costs but not instance costs if you’re using the same instance types. You’re paying for two endpoints running simultaneously.
For serverless inference, shadow testing exactly doubles costs because you pay per request. Every production request generates two inferences – one to prod, one to canary.
Budget for this explicitly. If your baseline inference cost is $1,000/month and you run shadow traffic for two weeks monthly while testing new models, your effective cost is $1,500/month.
Alternatives exist. Sample production traffic (10% instead of 100%) for canary testing. This reduces shadow costs by 90% while still providing model performance data. Or use batch transform offline on recorded production traffic rather than real-time shadow testing.
Cost Optimization Strategies
Right-size instances using SageMaker Inference Recommender before deploying to production. Testing instance types manually is time-consuming and error-prone. Inference Recommender runs automated load tests and recommends optimal configurations.
Use Savings Plans if running real-time endpoints continuously. A one-year commitment provides up to 27% discount on compute costs. Three-year commitments save up to 50%. This only makes sense for stable, long-running production workloads.
Implement request batching when possible. Many models process batches more efficiently than individual requests. If clients can tolerate 50-100ms extra latency, batch requests before sending to SageMaker. This reduces request count fees for serverless and improves throughput for real-time.
Monitor actual utilization. CloudWatch provides metrics on CPU, memory, and disk usage. Many teams over-provision because they guess capacity needs rather than measuring. If CPU utilization averages 20%, you can likely downgrade instance size.
Use spot instances for batch transform and asynchronous inference. Spot instances cost 70% less than on-demand but can be interrupted. For workloads that tolerate interruption (batch jobs, async queue processing), spot dramatically reduces costs.
Set up auto-scaling for real-time endpoints carefully. The default scaling policies often react too slowly (scaling up after latency already degraded) or too aggressively (constantly adding/removing instances). Custom policies based on request count or queue depth work better than CPU-based scaling.
Hidden Costs and Budget Surprises
VPC data processing charges catch teams off guard. If your SageMaker endpoint is in a VPC, data processed through the VPC endpoint costs $0.01/GB. This applies to both requests and responses. At high volumes, this adds hundreds of dollars monthly.
Endpoint warm-up costs money even before production traffic starts. When you create an endpoint, SageMaker provisions instances immediately. Testing and validation before sending real traffic means you’re paying for idle instances. Budget for this setup period.
Failed requests still incur costs. If a request times out or errors, you still pay for the compute time used before failure. High error rates don’t just impact users – they waste money processing requests that don’t produce value.
Model artifact downloads happen on endpoint creation and updates. SageMaker downloads your model from S3 to each instance. For large models (10GB+), this data transfer takes time and incurs costs. Frequent model updates multiply this cost.
Development and testing environments often run 24/7. Teams spin up SageMaker endpoints for testing and forget to delete them. These idle endpoints cost the same as production endpoints. Implement automatic cleanup policies for non-production resources.
When to Choose Each Option
Real-time endpoints fit when you have predictable, steady traffic, need consistent low latency (under 100ms), require GPU inference, or serve high request volumes (millions per day). The fixed cost becomes economical at high utilization.
Serverless inference works for variable traffic with idle periods, CPU-based models, unpredictable scaling needs, or development and staging environments. You trade occasional cold starts for paying only for actual usage.
Asynchronous inference suits batch processing, long-running inference (minutes per request), highly variable traffic, or scenarios where immediate results aren’t needed. The queue buffers traffic spikes and scales to zero during idle periods.
Batch transform applies to offline processing, one-time or scheduled jobs, processing existing datasets, or scenarios with no latency requirements. You pay only for the time actually processing data.
Most production deployments use combinations. A common pattern: real-time endpoints for primary traffic, asynchronous inference for batch user-submitted jobs, batch transform for overnight processing of analytics data.
The right choice depends on your specific traffic patterns, latency requirements, and cost constraints. Measure actual usage before committing to infrastructure. Start with serverless for variable workloads or batch transform for offline processing. Move to real-time endpoints only when economics clearly justify the fixed cost.
Need help optimizing your AWS AI infrastructure costs? At ZirconTech, we’ve helped enterprises reduce SageMaker inference costs by 40-60% through proper service selection and configuration. We analyze your traffic patterns, model requirements, and business constraints to recommend the most cost-effective deployment strategy. Get in touch to discuss your specific situation.