A team I worked with ran their entire product on Claude 3.5 Sonnet. Every request – from simple classification to complex document analysis – hit the same model at $6 per million input tokens and $30 per million output tokens. When they implemented Bedrock’s Intelligent Prompt Routing to split traffic between Haiku and Sonnet based on query complexity, their inference costs dropped by what AWS estimates as up to 30% with no measurable quality degradation on their internal evaluation suite. The model didn’t change. The architecture did.
Amazon Bedrock now lists over 100 foundation models from multiple providers: six Claude variants from Anthropic, six Amazon Nova models, multiple Llama 4 and 3.x versions from Meta, Mistral Large 3 at 675 billion parameters, DeepSeek-R1 and V3.1, the Qwen3 family including Coder at 480 billion parameters, and even OpenAI GPT OSS models. The selection keeps growing. But the most impactful architectural decisions aren’t about which model to pick – they’re about how to route requests, structure retrieval, and optimize inference across a system where different components have fundamentally different performance and cost requirements.
Intelligent Prompt Routing: Cost Optimization Without Quality Loss
Bedrock Intelligent Prompt Routing automatically routes prompts between two models from the same family based on query complexity. The router uses advanced prompt matching to predict which model delivers the best response at the lowest cost, with full traceability showing which model handled each query.
Supported routing pairs include Anthropic Claude (Haiku and Sonnet variants), Meta Llama (3.1, 3.2, 3.3), and Amazon Nova (Lite and Pro). The constraint that both models must be from the same family is deliberate – it ensures consistent response format and capability boundaries so downstream code doesn’t need to handle model-specific output differences.
The 30% cost reduction AWS claims comes from routing simpler queries to smaller, cheaper models. In practice, the savings depend on your query distribution. If 80% of your traffic is straightforward classification or extraction and 20% requires complex reasoning, routing delivers close to the theoretical maximum. If most queries genuinely require the larger model’s capabilities, routing saves less but still avoids paying the premium price for the minority of simple requests.
When to Build Custom Routing
Intelligent Prompt Routing works within model families. If your architecture spans families – using Claude for reasoning, Nova for multimodal processing, and a Cohere model for embeddings – you need custom routing logic. The pattern that works is a lightweight classifier (often a smaller model or a rules engine) that categorizes incoming requests and dispatches them to the appropriate model pipeline.
Step Functions provides the orchestration layer for these multi-model workflows, with native Bedrock integration and human-in-the-loop controls. Bedrock Flows offers a visual builder for the same patterns, combining prompts, agents, knowledge bases, guardrails, and Lambda functions in a drag-and-drop interface with built-in versioning for A/B testing different routing strategies.
The routing decision should consider three factors beyond raw capability: latency requirements (Nova Micro and Claude Haiku respond faster than their larger siblings), cost per token (which varies by an order of magnitude across model families), and throughput limits (each model has different tokens-per-minute quotas). A well-designed routing layer optimizes across all three simultaneously.
RAG Architecture on AWS: Beyond Simple Vector Search
Amazon Bedrock Knowledge Bases provides the managed RAG pipeline. Data sources include S3, Confluence, Salesforce, SharePoint, a web crawler (preview), and programmatic document ingestion for streaming data. The pipeline handles chunking, embedding, vector storage, and retrieval as a managed service.
The chunking strategy significantly impacts retrieval quality. Knowledge Bases supports semantic chunking (splits on meaning boundaries), hierarchical chunking (parent-child relationships between chunks), fixed-size chunking (predictable token counts), and custom Lambda-based chunking for domain-specific splitting logic. Integration with LangChain and LlamaIndex chunking implementations is also available, which matters for teams that already have tuned chunking pipelines they don’t want to rebuild.
Vector Store Selection
Amazon OpenSearch Serverless is AWS’s recommended vector database for Bedrock, and the recommendation has merit. It combines vector embeddings with text-based keyword queries in a single search request – hybrid search that often outperforms pure vector similarity for real-world retrieval tasks. Support for both HNSW and IVF approximate nearest neighbor algorithms plus exact k-NN gives flexibility to optimize for latency versus accuracy. AWS documents support for large-scale vector indexing for high-volume corpus ingestion. OpenSearch Serverless also offers integrations with DynamoDB and DocumentDB for ingesting operational data into vector search indexes.
But OpenSearch Serverless isn’t the only option. Knowledge Bases also supports Aurora PostgreSQL (good for teams already running Aurora who want to avoid a new service), Neptune Analytics for graph-based retrieval, MongoDB, Pinecone, Redis Enterprise Cloud, and Amazon Kendra for hybrid search. The choice depends on existing infrastructure, query patterns, and operational preferences rather than a universal “best” option.
GraphRAG and Structured Data
Two extensions to basic RAG address common failure modes. GraphRAG, supported through Neptune Analytics, represents relationships between entities explicitly rather than relying on embedding similarity alone. For domains where the connections between concepts matter as much as the concepts themselves – organizational structures, regulatory dependencies, technical architectures – graph-based retrieval finds relevant information that pure vector search misses because the relevant context is structurally related rather than semantically similar.
Knowledge Bases also supports natural language to SQL conversion for querying data warehouses and data lakes without migration. This means a single agent can retrieve context from both unstructured documents (via vector search) and structured data (via SQL generation) depending on the query type, eliminating the common pattern of building separate pipelines for structured and unstructured data.
Multimodal Parsing and Reranking
Knowledge Bases handles tables, figures, charts, and diagrams through Bedrock Data Automation or foundation models, extracting structured content from visual elements that basic text extraction misses. Reranker models enhance retrieval relevance across multimedia content. Source attribution provides visual citations to minimize hallucinations – the system traces each claim in the generated response back to specific source chunks, giving users verifiable references rather than trusting the model’s synthesis.
Inference Optimization: Choosing Your Compute Backend
The inference optimization decision has more variables than it did a year ago. Bedrock on-demand pricing, Bedrock batch inference, SageMaker endpoints, Inferentia2 instances, and Trainium2 instances each serve different workload profiles.
Bedrock Pricing Tiers
Bedrock offers four pricing tiers: Standard (on-demand, pay per token), Flex (lower priority with cost savings), Priority (guaranteed throughput), and Reserved (committed capacity at the lowest per-token rate). Batch inference runs at a 50% discount over on-demand for workloads that tolerate asynchronous processing. Prompt caching reduces costs for applications that repeatedly send similar context (RAG systems, multi-turn conversations with shared system prompts).
For most applications, the combination of Intelligent Prompt Routing (to match queries with appropriately-sized models) and prompt caching (to avoid reprocessing repeated context) delivers the largest cost improvement with the least engineering effort. Reserved throughput adds further savings for predictable workloads but requires traffic forecasting that many teams struggle to do accurately for AI workloads.
Model Distillation
Bedrock Model Distillation creates smaller “student” models from larger “teacher” models through automated knowledge transfer. AWS claims up to 500% faster inference and up to 75% lower cost with less than 2% accuracy loss. Supported teacher models include Claude 3.5 Sonnet v2, Nova Premier, and Llama 3.3 70B.
Distillation works best for high-volume, narrow tasks where a large model’s general capabilities are overkill. A classification task that Claude Sonnet handles at $30 per million output tokens might perform equally well on a distilled model at a fraction of the cost. The key constraint is that the student model inherits the teacher’s behavior on your specific training data, so the distilled model’s quality directly reflects the quality and representativeness of your training examples.
Custom Silicon: Inferentia2 and Trainium2
For teams that need to self-host models (data residency requirements, custom fine-tunes, or cost optimization at extreme scale), AWS’s custom silicon provides the inference backbone.
Inferentia2 (Inf2 instances) delivers up to 190 TFLOPS of FP16 performance per chip with 32 GB HBM – 4x more memory and 10x more bandwidth than the first generation. Up to 12 Inferentia2 chips per instance support large model inference. The 50% better performance per watt over comparable GPU instances makes Inf2 the cost-optimized choice for inference workloads that don’t require training.
Trainium2 (Trn2 instances) offers what AWS describes as 4x performance over Trainium1 with 30-40% better price performance than GPU-based P5e and P5en instances. Trn2 UltraServers pack up to 64 Trainium2 chips with NeuronLink interconnect for models up to one trillion parameters. AWS has announced Trainium3 as the next-generation chip on its roadmap, though detailed specifications and availability dates have not been finalized publicly.
The decision between Bedrock managed inference and self-hosted on custom silicon depends on scale and customization needs. Below approximately 10 million tokens per day, Bedrock’s on-demand or reserved pricing is almost always cheaper than maintaining dedicated instances. Above that threshold, the economics shift, especially for fine-tuned or distilled models that you can’t run through Bedrock’s managed API.
SageMaker HyperPod
SageMaker HyperPod manages clusters of thousands of AI accelerators for training, fine-tuning, and inference with automatic fault detection and recovery. Checkpointless training enables continuous forward progress without manual checkpoint management. AWS claims up to 40% reduction in model development costs and 40% training time savings. HyperPod supports deployment recipes for Amazon Nova models and integrates with SageMaker JumpStart for open-weights model acceleration.
HyperPod targets teams that train or fine-tune foundation models rather than teams that consume them through APIs. If your AI workload is primarily inference against existing models, Bedrock is the right starting point. If you’re training custom models, running large-scale fine-tuning, or deploying open-weights models that Bedrock doesn’t support, HyperPod provides the managed infrastructure.
Observability for Multi-Model Systems
Monitoring a single-model system requires tracking latency, error rates, and cost. Monitoring a multi-model system requires all of that plus routing decisions, per-model quality metrics, retrieval relevance, and end-to-end pipeline performance.
Bedrock AgentCore provides a CloudWatch-integrated observability layer. AWS’s documentation references quality evaluation capabilities and OpenTelemetry integration for forwarding traces to third-party platforms like Datadog and Splunk, though the specific evaluation metrics available may vary as some features remain in preview. Bedrock Guardrails adds safety-specific monitoring – content moderation events, prompt attack detections, PII detection hits, and contextual grounding check results.
For RAG systems, retrieval quality metrics are often more diagnostic than generation quality metrics. If the retrieval step returns irrelevant chunks, no model will produce a good answer. Tracking retrieval precision, recall, and the reranker’s impact on result quality provides earlier signal about system degradation than monitoring final output quality alone.
Bedrock Flows includes built-in traceability of inputs and outputs at every workflow node, which is essential for debugging multi-step pipelines. When a customer complaint reveals a wrong answer, you need to trace from the final response back through the generation step, the retrieval step, the routing decision, and the original query to identify where the pipeline failed.
Putting the Architecture Together
A production multi-model AI system on AWS typically combines these components in layers:
The routing layer uses Intelligent Prompt Routing for within-family optimization and custom logic (Step Functions or Bedrock Flows) for cross-family dispatching. Guardrails apply at this layer to filter inputs before they reach any model.
The retrieval layer uses Knowledge Bases with the appropriate vector store and chunking strategy for the domain. GraphRAG through Neptune Analytics handles relationship-dependent queries. SQL generation handles structured data queries.
The inference layer routes to Bedrock managed models for standard workloads and SageMaker or custom silicon for specialized or high-volume workloads. Model Distillation creates optimized models for high-volume narrow tasks.
The observability layer spans all three previous layers, tracking routing decisions, retrieval quality, inference performance, safety events, and end-to-end latency through CloudWatch and OpenTelemetry.
Each layer involves trade-offs that depend on workload characteristics, cost constraints, and operational maturity. The teams that get the best results start with the simplest viable architecture – often just Bedrock with a single model and basic RAG – and add complexity (routing, multiple models, custom inference) only when specific bottlenecks or cost pressures justify it. The most expensive AI architecture is one that’s more sophisticated than the problem requires.