The Complete Guide to RAG on AWS — Architecture, Deep Dives & Evaluation
⚠️ Disclaimer: The code examples, architecture patterns, and configurations in this article are illustrative and intended for educational purposes only. Always review, test, and adapt them to your specific use case, security requirements, and AWS account configuration before deploying to production.
1. Why RAG Still Wins
Every few months, someone declares RAG dead. A new model with a million-token context window launches, and the argument goes: “Just stuff everything in the prompt — who needs retrieval?” And every few months, practitioners building real systems quietly disagree.
The empirical evidence is clear. A 2024 study by Leng et al. compared RAG against long-context approaches on multi-document question answering tasks. The result: RAG consistently outperformed long-context stuffing for corpora exceeding 50 documents, with the gap widening as corpus size increased. Long-context models showed significant degradation in faithfulness as the input exceeded 100K tokens — a phenomenon researchers call “lost in the middle,” where models attend heavily to the beginning and end of context while neglecting the middle.
Source: Liu, N. et al., “Lost in the Middle: How Language Models Use Long Contexts,” 2023, https://arxiv.org/abs/2307.03172
Here’s why RAG remains the dominant architecture for enterprise AI applications:
Data freshness without retraining. Fine-tuning bakes knowledge into model weights. When your product documentation changes weekly, your compliance policies update quarterly, and your knowledge base grows daily, fine-tuning becomes an expensive treadmill. RAG retrieves from live data sources — update the document, and the next query reflects the change. This is not a theoretical advantage — it is the reason most enterprise knowledge assistants use RAG. A compliance team that needs answers grounded in last week’s regulatory update cannot wait for a fine-tuning cycle.
Access control and auditability. When a financial analyst asks about Q3 earnings, they should only see documents they’re authorized to access. RAG naturally supports document-level permissions because retrieval happens at query time against a permissioned index. You tag chunks with access control metadata during ingestion and filter at retrieval time — a pattern well-supported by OpenSearch Serverless and Bedrock Knowledge Bases. With long-context approaches, you’d need to pre-filter and reconstruct prompts per user — operationally painful and security-risky.
Cost at scale. Feeding 500,000 tokens into every API call is expensive. RAG retrieves the 5-10 most relevant chunks (typically 2,000-5,000 tokens) and sends only those. At thousands of queries per day, the cost difference is orders of magnitude. Consider a concrete comparison: a 10,000-document knowledge base where each document averages 5,000 tokens. Stuffing even 100 documents into a single prompt costs ~$1.50 per query with a frontier model at $3/MTok input. RAG retrieves 5-10 relevant chunks (~3,000 tokens total) for ~$0.009 per query. At 5,000 queries per day, that’s $7,500/day versus $45/day — a 166× difference.
Grounded, verifiable answers. RAG provides source attribution by design. The model’s answer can point to specific documents, paragraphs, and pages. This isn’t just nice to have — in regulated industries (healthcare, finance, legal), it’s a requirement. When an auditor asks “where did this answer come from?”, a RAG system can point to the exact chunk and document. A long-context system can only gesture at the 500K-token prompt.
Composability and modularity. A RAG pipeline is modular — you can swap the embedding model without changing retrieval logic, upgrade the vector store without touching the generation layer, or add a reranker without modifying anything else. This composability matters at enterprise scale, where different teams own different components and upgrades must be incremental. Long-context approaches couple everything into a single monolithic prompt, making iterative improvement difficult.
RAG + long context is not either/or. The most effective production systems combine both. Use RAG to retrieve the 10-20 most relevant chunks, then leverage a long-context model to reason over those chunks with full attention. This “retrieve then reason” pattern gets you the precision of RAG with the synthesis capabilities of large context windows. Amazon Bedrock Knowledge Bases supports this pattern natively through the RetrieveAndGenerate API with configurable context windows.
When RAG is NOT the answer: For tasks requiring deep reasoning over a small, stable corpus (e.g., analyzing a single contract), long-context approaches work well. For teaching a model new behaviors or styles, fine-tuning is appropriate. For simple classification or extraction from short texts, neither RAG nor fine-tuning is needed — a well-crafted prompt suffices. RAG excels when you need accurate, sourced answers over large, dynamic, access-controlled knowledge bases — which describes most enterprise use cases.
Source: Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” 2020, https://arxiv.org/abs/2005.11401
Source: Gao, Y. et al., “Retrieval-Augmented Generation for Large Language Models: A Survey,” 2024, https://arxiv.org/abs/2312.10997
2. RAG Architecture Overview
A production RAG system has two pipelines: ingestion (offline) and query (online). Understanding both — and the design decisions at each stage — is essential for building systems that scale beyond a prototype.
Ingestion Pipeline (Offline)
The ingestion pipeline converts raw documents into searchable, retrievable units. It runs asynchronously — typically triggered when new documents are added or existing ones are updated.
┌─────────────────────────────────────────────────────────────────────┐ │ INGESTION PIPELINE (Offline) │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────────┐ │ │ │ Data │ │ Parsing │ │ Chunking │ │ Embedding │ │ │ │ Sources │──▶│ & │──▶│ Strategy │──▶│ Model │ │ │ │ │ │ Extract │ │ │ │ │ │ │ └──────────┘ └──────────┘ └──────────┘ └───────┬────────┘ │ │ S3, Web, Textract, Fixed, Semantic, │ │ │ Confluence, Unstructured, Hierarchical, ┌────▼────────┐ │ │ SharePoint, Tika Structure-aware │ Vector Store │ │ │ Databases │ + Metadata │ │ │ │ Index │ │ │ └─────────────┘ │ │ OpenSearch, │ │ pgvector, etc. │ └─────────────────────────────────────────────────────────────────────┘
Each stage has meaningful design decisions:
- Data source connectors. Where do your documents live? S3 is the most common starting point, but production systems often pull from multiple sources — Confluence wikis, SharePoint sites, databases, and web crawlers. Bedrock Knowledge Bases supports S3, Confluence, SharePoint, Salesforce, and web crawlers as native data sources.
- Parsing and extraction. Raw documents must be converted to clean text. PDFs require layout-aware parsing (Amazon Textract for tables and forms, or third-party parsers like Unstructured.io). HTML requires boilerplate removal. Structured data (JSON, CSV) requires schema-aware extraction. This stage is often underinvested — poor parsing propagates errors through the entire pipeline.
- Chunking. The most impactful decision in the ingestion pipeline. How you split documents into retrieval units determines the upper bound of your system’s quality. Section 3 covers this in depth.
- Embedding. Each chunk is converted to a dense vector representation using an embedding model. The choice of model, dimensionality, and normalization directly affects retrieval quality. Section 4 covers this.
- Indexing and storage. Embeddings are stored in a vector store with metadata for filtering. The index structure (HNSW, IVF, flat) affects the recall-latency trade-off at query time.
Query Pipeline (Online)
The query pipeline handles real-time user requests. Every millisecond of latency is felt by the user, so efficiency matters.
┌─────────────────────────────────────────────────────────────────────┐ │ QUERY PIPELINE (Online) │ │ │ │ ┌──────────┐ ┌──────────────┐ ┌────────────┐ ┌──────────┐ │ │ │ User │ │ Query │ │ Retrieval │ │ Reranking│ │ │ │ Query │──▶│ Understanding│──▶│ (Hybrid) │──▶│ │ │ │ │ │ │ & Enhancement│ │ │ │ │ │ │ └──────────┘ └──────────────┘ └────────────┘ └────┬─────┘ │ │ Rewrite, Expand, Dense + Sparse │ │ │ Decompose, Route + Metadata Filter ┌───▼──────┐ │ │ │ Context │ │ │ ┌──────────┐ ┌──────────────┐ ┌────────────┐ │ Assembly │ │ │ │ Response │◀──│ Guardrails │◀──│ LLM │◀──│ │ │ │ │ + Cite │ │ (PII, ground│ │ Generation│ └──────────┘ │ │ └──────────┘ │ check) │ └────────────┘ │ │ └──────────────┘ │ └─────────────────────────────────────────────────────────────────────┘
The query pipeline stages in detail:
- Query understanding and enhancement (Section 5). The user’s raw query is rarely optimal for retrieval. This stage rewrites, expands, decomposes, or routes the query based on intent classification. It is one of the highest-ROI investments in a production RAG system.
- Retrieval. The enhanced query is used to search the vector store. Production systems almost always use hybrid search — combining dense (semantic) and sparse (keyword/BM25) retrieval — to cover both semantic and lexical matches. Metadata filters narrow the search space before vector similarity is computed.
- Reranking. The initial retrieval returns a candidate set (typically 20-50 chunks). A cross-encoder reranker re-scores these candidates with a model that sees the query and each candidate together, producing a much more accurate relevance ranking. This step adds 50-150ms but typically improves answer quality by 15-25%.
- Context assembly. The top-ranked chunks are assembled into a prompt context. This involves deduplication (if multiple retrieval paths returned the same chunk), ordering (most relevant first), and truncation (ensuring the total context fits within the LLM’s effective window).
- LLM generation. The assembled context plus the user’s query are sent to the LLM with a system prompt that instructs grounded generation with source citation.
- Guardrails and post-processing. The generated response is validated for content safety, PII, grounding (is every claim supported by the context?), and formatting. Bedrock Guardrails supports all of these checks natively.
AWS Reference Architecture
On AWS, you have two primary approaches:
Option A: Fully Managed (Bedrock Knowledge Bases)
S3 / Confluence / SharePoint / Web Crawler
→ Bedrock Knowledge Base (managed parsing, chunking, embedding)
→ OpenSearch Serverless / Aurora pgvector / Pinecone / MongoDB Atlas / Neptune Analytics
→ Bedrock RetrieveAndGenerate API (managed retrieval + generation)
→ Bedrock Guardrails (content safety + grounding check)
→ Response with source attributions
This approach minimizes infrastructure management. Bedrock handles chunking, embedding, indexing, retrieval, and generation through a single API. You configure the chunking strategy, select an embedding model, and choose a vector store — Bedrock manages the rest. Best for teams that want to ship quickly and optimize later.
Option B: Custom Pipeline (Maximum Control)
S3 → Lambda (custom parsing) → Lambda (custom chunking)
→ Bedrock Embedding API (Titan V2 / Cohere Embed v3)
→ OpenSearch Serverless (self-managed index)
→ Lambda / ECS (custom query pipeline with enhancement, retrieval, reranking)
→ Bedrock InvokeModel API (generation)
→ Bedrock Guardrails
→ API Gateway → Client
This approach gives you full control over every stage. Use it when Bedrock Knowledge Bases’ built-in options don’t meet your requirements — for example, if you need custom document parsers (Amazon Textract for complex PDFs), non-standard chunking strategies, or specialized query routing logic. The trade-off is more infrastructure to manage and more code to maintain.
Option C: Hybrid (Common in Practice)
Most production teams land here: use Bedrock Knowledge Bases for ingestion and indexing (the offline pipeline), but build a custom query pipeline (the online pipeline) using Lambda or ECS. This gives you managed ingestion with custom query enhancement, reranking, and generation logic.
Ingestion: Bedrock KB (managed)
Query: API Gateway → Lambda (query enhancement)
→ Bedrock Retrieve API (search the KB's index)
→ Cohere Rerank on Bedrock (rerank candidates)
→ Bedrock InvokeModel (generation with custom prompt)
→ Bedrock Guardrails → Response
The choice between options depends on your team’s operational maturity, latency requirements, and customization needs. Start with Option A, identify the bottlenecks through evaluation (Section 9), and selectively move components to Option C as needed.
Source: AWS, “Amazon Bedrock Knowledge Bases,” https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base.html
Source: AWS, “Build a RAG-based generative AI application using Amazon Bedrock Knowledge Bases,” https://aws.amazon.com/blogs/machine-learning/build-a-rag-based-generative-ai-application-using-amazon-bedrock-knowledge-bases/
3. Data Ingestion & Chunking — The Foundation
If you get chunking wrong, nothing downstream can save you. The best embedding model and the most sophisticated retrieval strategy cannot compensate for poorly chunked documents. Chunking is where most RAG systems silently fail — and it’s where the highest-ROI optimizations live.
Why Chunking Matters
Chunking determines the granularity of your retrieval units. Too large, and retrieved chunks contain noise that dilutes the answer. Too small, and you lose context — the model sees fragments without enough information to generate a complete response.
The challenge is that there is no universal optimal chunk size. It depends on your document types, query patterns, embedding model’s sweet spot, and the nature of the questions your users ask. A system answering factoid questions about product specifications needs different chunking than one synthesizing answers from legal contracts.
Consider a concrete example: a 50-page AWS user guide. If you chunk it into 128-token pieces, a question like “How do I configure cross-region replication for S3?” might retrieve a chunk containing the command syntax but missing the prerequisite IAM permissions described two paragraphs earlier. If you chunk it into 2048-token pieces, that same query might retrieve a chunk covering three unrelated S3 features, burying the relevant content in noise. The art of chunking is finding the right granularity for your specific use case.
Core Chunking Strategies
Fixed-Size Chunking
The simplest approach: split text into chunks of N tokens with M tokens of overlap.
# Typical fixed-size chunking
chunk_size = 512 # tokens
chunk_overlap = 50 # tokens (~10% overlap)
When it works: Homogeneous documents with consistent structure — news articles, blog posts, transcripts, and any corpus where content density is relatively uniform. A news corpus of 10,000 articles, each roughly the same length and style, chunks well with fixed-size because the structure is inherently flat.
When it fails: Documents with hierarchical structure (technical manuals, legal contracts) where a fixed window arbitrarily splits a section mid-paragraph or separates a heading from its content. Consider a legal contract: a fixed-size chunk might start mid-sentence in one clause and end mid-sentence in another, rendering both fragments useless for answering questions about either clause.
Real-world example: A customer support FAQ database where each FAQ entry is 100-400 tokens. Fixed-size chunking at 512 tokens works well because most entries fit in a single chunk, and the few longer ones get split at natural points with overlap preserving continuity.
Overlap tuning: The overlap parameter is often under-considered. Too little overlap (0-5%), and you lose cross-boundary context. Too much (>25%), and you waste storage and create near-duplicate chunks that confuse retrieval. The 10-20% range is a reliable starting point, but for documents with long sentences (academic papers, legal text), 15-20% overlap prevents mid-sentence splits.
Recursive / Character Splitting
LangChain popularized this approach: split by a hierarchy of separators (\n\n → \n → . → ), recursively subdividing until chunks are under the size limit.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_text(document_text)
Advantage over fixed-size: Respects natural text boundaries (paragraphs, sentences). When the splitter encounters a paragraph break, it prefers to split there rather than mid-sentence.
Limitation: Still fundamentally size-driven — it doesn’t understand whether two paragraphs are semantically related. Two paragraphs discussing the same concept but separated by \n\n will be split into different chunks if the combined size exceeds the limit.
Real-world example: Technical documentation with mixed content — some sections are 200 words, others are 2,000. Recursive splitting adapts to this variance better than fixed-size because it preserves short paragraphs intact while subdividing long sections at natural boundaries.
Source: LangChain, “Text Splitters,” https://python.langchain.com/docs/how_to/#text-splitters
Semantic Chunking
Instead of splitting by size or structure, semantic chunking uses embedding similarity to determine split points. Adjacent sentences are embedded, and when the cosine similarity between consecutive sentence embeddings drops below a threshold, a chunk boundary is inserted.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_aws import BedrockEmbeddings
embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v2:0")
chunker = SemanticChunker(
embeddings,
breakpoint_threshold_type="percentile", # or "standard_deviation", "interquartile"
breakpoint_threshold_amount=75
)
chunks = chunker.create_documents([document_text])
How it works in detail: The algorithm embeds each sentence, then computes pairwise cosine similarity between consecutive sentences. It applies a breakpoint detection method — percentile-based (split when similarity drops below the Nth percentile), standard deviation-based (split when the drop exceeds N standard deviations from the mean), or interquartile range (split at statistical outliers). The result is chunks where every sentence within a chunk is semantically related.
Advantage: Chunks are semantically coherent — each chunk contains one “idea” or topic. This directly improves retrieval precision because a query about a specific topic is more likely to match a chunk that’s purely about that topic.
Trade-offs: Requires an embedding pass during ingestion (adds cost and latency). Chunk sizes vary significantly — you might get chunks ranging from 50 to 1,500 tokens. This inconsistency can affect retrieval: very small chunks may lack context, and very large ones may introduce noise. Consider adding min/max size constraints.
Real-world example: A research paper where the introduction smoothly transitions between motivation, related work, and contribution overview. Fixed-size chunking would arbitrarily split these transitions. Semantic chunking detects the topic shifts (e.g., from “related work” to “our approach”) and places boundaries there, producing chunks that each represent a coherent idea.
Source: Kamradt, “Semantic Chunking,” 2024, https://github.com/FullStackRetrieval-com/RetrievalTutorials
Document-Structure-Aware Chunking
For structured documents (HTML, Markdown, PDFs with headings), parse the document tree and chunk by structural units: sections, subsections, or logical blocks.
For Markdown/HTML: Split by heading hierarchy (H1 → H2 → H3), keeping each section as a chunk. If a section exceeds the size limit, subdivide by paragraphs within it. Preserve heading paths as metadata (e.g., “User Guide > Authentication > OAuth2 Configuration”).
For PDFs: Use layout-aware parsers (Amazon Textract, Unstructured.io, LlamaParse) that detect headings, tables, and figure captions rather than treating the PDF as flat text. A PDF rendered from a two-column layout will produce garbled text with naive extraction — layout-aware parsers reconstruct the reading order.
This is often the highest-ROI approach for enterprise documents. Most corporate knowledge bases have clear structure — user guides with sections, policies with numbered clauses, API docs with endpoints. Respecting that structure during chunking preserves the author’s intent.
Real-world example: An API reference with 200 endpoints, each documented with a description, request/response schema, parameters table, and code examples. Structure-aware chunking keeps each endpoint as a single chunk (or parent chunk), so a query about “PUT /users/{id} request body” retrieves the complete endpoint documentation rather than a fragment that contains the URL but not the parameters.
Source: Unstructured.io, “Document Parsing for LLMs,” https://unstructured.io/
Source: LlamaParse, “Document Parsing for LLM Applications,” https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse/
Hierarchical Chunking (Parent-Child)
One of the most powerful techniques for production RAG. The idea: index small chunks for precise retrieval, but return the parent chunk (larger context) to the LLM.
Document ├── Section (parent chunk — 2000 tokens) │ ├── Paragraph 1 (child chunk — 300 tokens) ← Retrieved by vector search │ ├── Paragraph 2 (child chunk — 250 tokens) │ └── Paragraph 3 (child chunk — 350 tokens)
When the retriever matches Paragraph 1, the system returns the entire Section to the LLM. This gives the model enough context to generate a complete answer while maintaining retrieval precision.
Implementation: Store both parent and child chunks with a parent-child relationship. Search against child chunks, then look up and return parent chunks. Deduplicate when multiple child chunks from the same parent are retrieved.
# LlamaIndex auto-merging retriever
from llama_index.core.node_parser import HierarchicalNodeParser
from llama_index.core.retrievers import AutoMergingRetriever
node_parser = HierarchicalNodeParser.from_defaults(
chunk_sizes=[2048, 512, 128] # parent, child, grandchild
)
nodes = node_parser.get_nodes_from_documents(documents)
Real-world example: A 100-page employee handbook. Child chunks (individual policy clauses, 200-400 tokens) provide precise retrieval — “What’s the parental leave policy?” hits the exact clause. But the parent chunk (the full “Leave Policies” section, 2,000 tokens) gives the LLM enough context to mention related policies (sick leave, unpaid leave options) that the user might also need.
Design consideration: Choose parent/child size ratios carefully. A 4:1 ratio (e.g., 2048:512) is a good starting point. If parents are too large (10:1), you lose the benefit of precise retrieval. If too small (2:1), there’s little additional context to gain.
Source: LlamaIndex, “Auto-Merging Retriever,” https://docs.llamaindex.ai/en/stable/examples/retrievers/auto_merging_retriever/
Agentic Chunking
Use an LLM to determine chunk boundaries. Feed the document to an LLM and ask it to identify semantically complete units. The LLM considers context, topic shifts, and logical completeness in ways that heuristic methods cannot.
agentic_chunk_prompt = """Analyze this document and identify the natural
semantic boundaries. For each chunk, provide:
1. The start and end positions
2. A descriptive title summarizing the chunk's content
3. A list of key entities mentioned
The goal is to create chunks that are each self-contained — a reader
should be able to understand each chunk without needing the others.
Document:
{document_text}
"""
Trade-off: Significantly more expensive and slower at ingestion time — each document requires one or more LLM calls. At $3/MTok input for a frontier model, chunking a 1M-token corpus costs $3 just for the chunking pass. Best reserved for high-value documents where chunking quality has outsized impact (e.g., legal contracts, medical records, regulatory filings).
Real-world example: A complex merger agreement with nested cross-references. The LLM identifies that Section 4.2(a) references definitions in Section 1.1 and conditions in Section 7.3, and creates a chunk that includes the relevant cross-referenced text — something no heuristic method could achieve.
Advanced Chunking Techniques
Late Chunking (Embed First, Then Chunk)
Traditional chunking pipelines chunk first, then embed each chunk independently. Late chunking inverts this: embed the entire document using a long-context embedding model, then split the embedding sequence into chunks.
Why this matters: When you embed chunks independently, each chunk loses the context of the surrounding document. The sentence “It supports three modes” is meaningless in isolation — “it” could refer to anything. When you embed the full document first, the embedding for that sentence captures that “it” refers to “the S3 Transfer Acceleration feature” because the transformer’s attention mechanism has seen the full context.
# Conceptual late chunking pipeline
# Step 1: Embed the full document with a long-context model
token_embeddings = long_context_model.encode(full_document, output="token_embeddings")
# Step 2: Split token embeddings into chunk-level embeddings
# by averaging token embeddings within each chunk span
chunk_embeddings = []
for start, end in chunk_boundaries:
chunk_emb = token_embeddings[start:end].mean(dim=0)
chunk_embeddings.append(chunk_emb)
Requirements: You need an embedding model with a long enough context window to process entire documents (e.g., jina-embeddings-v2 with 8,192 tokens, or nomic-embed-text with 8,192 tokens). For documents exceeding the model’s context window, you can apply late chunking to sections rather than the full document.
Trade-off: Higher ingestion cost (embedding full documents is more expensive than embedding chunks) and requires a long-context embedding model. The quality improvement is most noticeable for documents with heavy co-referencing and pronoun usage.
Source: Günther, M. et al., “Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models,” Jina AI, 2024, https://arxiv.org/abs/2409.04701
Proposition-Based Chunking
Instead of splitting text by size or structure, decompose each passage into atomic propositions — self-contained statements that each express a single fact.
Example transformation:
Original text: “Amazon S3 provides several storage classes designed for different use cases. S3 Standard offers high durability, availability, and performance for frequently accessed data. It is suitable for a wide variety of use cases including cloud applications, dynamic websites, and big data analytics.”
Propositions:
- “Amazon S3 provides several storage classes designed for different use cases.”
- “S3 Standard offers high durability for frequently accessed data.”
- “S3 Standard offers high availability for frequently accessed data.”
- “S3 Standard offers high performance for frequently accessed data.”
- “S3 Standard is suitable for cloud applications.”
- “S3 Standard is suitable for dynamic websites.”
- “S3 Standard is suitable for big data analytics.”
proposition_prompt = """Decompose the following passage into clear,
self-contained propositions. Each proposition should:
- Express a single, atomic fact
- Be understandable without additional context
- Resolve all pronouns and references to their specific nouns
- De-compound conjunctive statements into individual propositions
Passage: {text}
Propositions:"""
Advantage: Each proposition is a precise, self-contained retrieval unit. A query asking “Is S3 Standard suitable for big data?” will match proposition 7 with very high similarity, whereas the original paragraph would be a weaker match due to noise.
Trade-offs: Produces many small chunks (5-10x more than paragraph-level chunking), increasing storage and retrieval costs. The LLM decomposition step is expensive at ingestion time. Best used as child chunks in a hierarchical system — retrieve propositions, return the parent paragraph.
Source: Chen, S. et al., “Dense X Retrieval: What Retrieval Granularity Should We Use?,” 2023, https://arxiv.org/abs/2312.06648
Context-Enriched Chunking
A practical technique that addresses the “chunk in isolation” problem: prepend contextual information to each chunk so it can stand alone.
Approach 1: Prepend document/section metadata
# Before: raw chunk
chunk = "It supports three modes: standard, expedited, and bulk."
# After: context-enriched chunk
enriched_chunk = """Document: AWS S3 Glacier Developer Guide
Section: Data Retrieval Options
---
S3 Glacier supports three retrieval modes: standard, expedited, and bulk."""
Approach 2: LLM-generated contextual summary
Use a lightweight LLM to generate a brief contextual summary for each chunk, situating it within the broader document.
context_prompt = """Given the following document and a specific chunk
extracted from it, write a 1-2 sentence context that situates this
chunk within the broader document. The context should resolve any
ambiguous references and clarify what topic is being discussed.
Document title: {doc_title}
Section path: {heading_path}
Surrounding text: {prev_paragraph}... [CHUNK] ...{next_paragraph}
Chunk: {chunk_text}
Context:"""
# Prepend the generated context to the chunk before embedding
final_chunk = f"{generated_context}\n\n{chunk_text}"
Approach 3: Anthropic’s Contextual Retrieval method
Anthropic published a specific implementation of this pattern, showing that prepending chunk-specific context (generated by Claude) reduced retrieval failure rates by 35% when combined with hybrid search and reranking — from a 5.7% failure rate to 1.9%.
context_prompt = """<document>
{WHOLE_DOCUMENT}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{CHUNK_CONTENT}
</chunk>
Please give a short succinct context to situate this chunk within
the overall document for the purposes of improving search retrieval
of the chunk. Answer only with the succinct context and nothing else."""
Trade-off: Requires an LLM call per chunk during ingestion. For a corpus of 100,000 chunks, this is significant. Use prompt caching to reduce cost when processing chunks from the same document (the document text in the prompt is repeated). Anthropic reported that prompt caching reduced the cost of contextual enrichment by approximately 50x.
Source: Anthropic, “Introducing Contextual Retrieval,” 2024, https://www.anthropic.com/news/contextual-retrieval
Document-Specific Chunking Strategies
Different document types have fundamentally different structures and require tailored chunking approaches.
PDF Tables
Tables are one of the most common failure points in RAG systems. Naive text extraction destroys table structure, turning rows and columns into meaningless strings.
Strategy:
- Use Amazon Textract with
AnalyzeDocument(TABLES feature) to extract tables as structured data - Store each table as a complete chunk — never split a table across chunks
- Include the table caption and any surrounding explanatory text
- Convert to a text representation that preserves structure:
# Option 1: Markdown table (good for embedding)
markdown_table = """
| Instance Type | vCPUs | Memory (GiB) | Price/hr |
|--------------|-------|-------------|----------|
| m5.large | 2 | 8 | $0.096 |
| m5.xlarge | 4 | 16 | $0.192 |
"""
# Option 2: Row-per-line with headers (good for dense tables)
text_rows = """
Table: EC2 Instance Pricing (US-East-1)
- m5.large: 2 vCPUs, 8 GiB Memory, $0.096/hr
- m5.xlarge: 4 vCPUs, 16 GiB Memory, $0.192/hr
"""
For very large tables (50+ rows): Consider splitting by row groups while repeating the header in each chunk, or create a summary chunk describing the table’s contents alongside the full table chunk.
Source: AWS, “Amazon Textract AnalyzeDocument,” https://docs.aws.amazon.com/textract/latest/dg/API_AnalyzeDocument.html
HTML Pages
HTML presents unique challenges: navigation elements, footers, sidebars, and ads surround the actual content.
Strategy:
- Strip boilerplate (nav, footer, scripts, ads) using readability algorithms or tools like
trafilaturaorBeautifulSoupwith semantic filtering - Parse the clean HTML DOM tree for structure (headings, lists, code blocks)
- Chunk by semantic HTML sections (
,
, heading hierarchy) - Preserve links as metadata — a chunk about “S3 pricing” should retain the link to the pricing page
from trafilatura import extract
# Extract main content, stripping boilerplate
clean_text = extract(html_content, include_tables=True, include_links=True)
JSON and Structured Data
API responses, configuration files, and structured datasets need special handling.
Strategy:
- For flat JSON objects: each top-level key-value pair or logical group becomes a chunk
- For nested JSON: chunk at meaningful nesting levels (e.g., each item in an array of products)
- Always include the schema path as context:
"product.pricing.tiers[0]"tells the model where this data sits - Convert to natural language where appropriate:
# Raw JSON
{"instance_type": "m5.large", "vcpus": 2, "memory_gib": 8, "price_per_hour": 0.096}
# Natural language chunk (better for embedding)
"The m5.large EC2 instance type has 2 vCPUs, 8 GiB of memory, and costs $0.096 per hour."
Emails and Chat Logs
Conversational content has unique structure: turns, threads, quoted replies, signatures, and attachments.
Strategy for emails:
- Strip signatures, disclaimers, and reply chains (or chunk them separately)
- Each email becomes a chunk, with metadata: sender, recipients, date, subject, thread ID
- For long email threads: chunk each message individually but store the thread ID for retrieval of the full conversation
- Extract and separately chunk any inline content (tables, lists, action items)
Strategy for chat logs (Slack, Teams):
- Chunk by conversation thread, not by individual message — a single Slack message rarely has enough context
- Use time-based windowing: group messages within a conversation that are within N minutes of each other
- Include participant names and timestamps as metadata
- Flag and extract code snippets, shared links, and decisions as separate high-value chunks
Chunk Size Optimization
The Empirical Evidence
The impact of chunk size on RAG quality has been studied extensively. A comprehensive 2024 study evaluated chunk sizes across diverse datasets and tasks:
| Chunk Size | Retrieval Precision | Answer Quality | Best For | Risk |
|---|---|---|---|---|
| 128 tokens | Very High | Low | Factoid lookup, proposition indexing | Fragments lack context for generation |
| 256 tokens | High | Medium | Short-answer QA, FAQ matching | May split compound explanations |
| 512 tokens | Medium-High | High | General-purpose (most use cases) | Balanced trade-off |
| 1024 tokens | Medium | High | Complex explanations, how-to content | Some noise for narrow queries |
| 2048 tokens | Lower | Medium-High | Long-form synthesis, multi-topic answers | Significant noise dilution |
Source: NVIDIA, “Advanced RAG Techniques,” 2024, https://developer.nvidia.com/blog/advanced-rag-techniques/
Methodology for Finding Your Optimal Chunk Size
Do not blindly adopt published benchmarks — they tested different documents and queries than yours. Run your own experiments:
Step 1: Create a representative evaluation set. Select 50-100 queries that reflect your real user base. Include simple factoid questions, multi-part questions, and questions that require synthesizing information from multiple places.
Step 2: Chunk your corpus at multiple sizes. Test at least 4 sizes: 256, 512, 1024, and one document-structure-aware baseline.
Step 3: Measure retrieval and generation metrics separately.
chunk_sizes = [256, 512, 1024, 2048]
results = {}
for size in chunk_sizes:
chunks = chunk_corpus(documents, chunk_size=size, overlap=int(size * 0.15))
index = build_index(chunks)
retrieval_metrics = evaluate_retrieval(index, eval_queries) # Recall@5, Precision@5
generation_metrics = evaluate_generation(index, eval_queries) # Faithfulness, Completeness
results[size] = {
"recall@5": retrieval_metrics.recall,
"precision@5": retrieval_metrics.precision,
"faithfulness": generation_metrics.faithfulness,
"completeness": generation_metrics.completeness,
"avg_chunks_per_query": retrieval_metrics.avg_retrieved
}
Step 4: Analyze the trade-off curve. Plot retrieval precision vs. answer completeness. The optimal chunk size is where both metrics are acceptably high — typically a Pareto-optimal point where improving one metric would significantly degrade the other.
Step 5: Consider your embedding model’s training data. Most embedding models were trained on passages of a specific length. Titan Embeddings V2 was trained on passages up to 8,192 tokens but performs best on passages of 256-512 tokens. Cohere Embed v3 handles up to 512 tokens per input. Matching your chunk size to your embedding model’s sweet spot improves representation quality.
The practical recommendation: Start with 512 tokens and 10-20% overlap. Then evaluate with your actual queries and documents. There is no substitute for empirical testing on your data.
Chunking Strategy Comparison
| Strategy | Chunk Quality | Ingestion Cost | Complexity | Best For | Limitations |
|---|---|---|---|---|---|
| Fixed-size | Low-Medium | Very Low | Trivial | Homogeneous corpora, prototyping | Ignores document structure entirely |
| Recursive splitting | Medium | Very Low | Low | General-purpose, mixed documents | Size-driven, not meaning-driven |
| Semantic | High | Medium (embedding pass) | Medium | Documents with subtle topic shifts | Variable chunk sizes, embedding cost |
| Structure-aware | High | Low-Medium (parsing) | Medium | Docs with clear headings/sections | Requires structured input |
| Hierarchical (parent-child) | Very High | Low-Medium | Medium-High | Enterprise docs, knowledge bases | More complex retrieval logic |
| Late chunking | High | High (full-doc embedding) | High | Co-reference-heavy documents | Requires long-context embedding model |
| Proposition-based | Very High (precision) | Very High (LLM calls) | High | High-value factoid retrieval | Expensive, many small chunks |
| Context-enriched | Very High | High (LLM calls) | Medium | Any corpus (universal improvement) | Cost at scale, prompt caching helps |
| Agentic | Highest | Very High (LLM calls) | Very High | Legal, medical, complex cross-references | Slow, expensive, non-deterministic |
Decision guide:
- Prototyping or low-budget? → Recursive splitting with 512 tokens
- Structured enterprise docs? → Document-structure-aware + hierarchical
- Highest quality, cost not primary concern? → Context-enriched + hierarchical + reranking
- Factoid QA over dense content? → Proposition-based as child chunks with paragraph-level parents
- Documents with heavy pronouns and references? → Late chunking or context-enriched
Metadata Enrichment
Every chunk should carry metadata beyond just the text:
- Source document (title, URL, last updated)
- Section/heading hierarchy (where in the document this chunk lives)
- Document type (FAQ, policy, tutorial, API reference)
- Entity tags (products, services, concepts mentioned)
- Access control tags (department, classification level)
- Chunk position (beginning, middle, end of document — useful for summaries vs. details)
This metadata enables filtered retrieval — when a user asks about “S3 pricing,” you can filter to pricing documents before semantic search, dramatically improving precision.
chunk_metadata = {
"source": "s3-developer-guide-2025.pdf",
"source_url": "https://docs.aws.amazon.com/s3/latest/userguide/",
"section_path": "Storage Classes > S3 Standard",
"doc_type": "technical_documentation",
"entities": ["S3", "S3 Standard", "storage class"],
"last_updated": "2025-11-15",
"access_level": "public",
"chunk_index": 14,
"total_chunks": 87
}
AWS: Bedrock Knowledge Base Chunking
Bedrock Knowledge Bases offers several chunking strategies out of the box:
- Default chunking: ~300 tokens with overlap (reasonable starting point)
- Fixed-size chunking: Configurable size and overlap
- Hierarchical chunking: Parent-child chunking with configurable parent and child sizes
- Semantic chunking: Groups text by semantic similarity using a configurable breakpoint threshold
- No chunking: Treats each file as a single chunk (useful for short documents or pre-chunked data)
- Custom transformation: Use a Lambda function for arbitrary chunking logic
For most production deployments, start with hierarchical or semantic chunking in Bedrock KB, then evaluate. If you need document-structure-aware parsing (especially for PDFs with complex layouts), use custom transformation with Amazon Textract or a third-party parser.
Custom Transformation with Lambda
When the built-in strategies are insufficient, Bedrock KB’s custom transformation lets you implement any chunking logic via a Lambda function. The Lambda receives the parsed document and returns your custom chunks.
# Lambda function for custom Bedrock KB chunking
import json
import boto3
import re
textract = boto3.client('textract')
def lambda_handler(event, context):
"""
Custom chunking Lambda for Bedrock Knowledge Base.
The event contains:
- s3Bucket: source bucket
- s3ObjectKey: document key
- metadata: document metadata
- parsedDocumentContent: the extracted text (if using default parser)
"""
input_files = event.get("inputFiles", [])
output_files = []
for input_file in input_files:
content = input_file["contentBody"]
original_metadata = input_file.get("metadata", {})
# Custom logic: chunk by heading structure
chunks = chunk_by_headings(content)
chunk_results = []
for i, chunk in enumerate(chunks):
chunk_results.append({
"contentBody": chunk["text"],
"contentType": "text/plain",
"contentMetadata": {
**original_metadata,
"section_title": chunk.get("heading", ""),
"chunk_index": str(i),
"heading_level": str(chunk.get("level", 0))
}
})
output_files.append({
"originalFileLocation": input_file["originalFileLocation"],
"fileContents": chunk_results
})
return {"outputFiles": output_files}
def chunk_by_headings(text, max_chunk_size=1500):
"""Split document by heading hierarchy with size limits."""
# Split on markdown-style headings
heading_pattern = r'^(#{1,4})\s+(.+)$'
sections = []
current_section = {"heading": "Introduction", "level": 0, "text": ""}
for line in text.split("\n"):
match = re.match(heading_pattern, line)
if match:
if current_section["text"].strip():
sections.append(current_section)
level = len(match.group(1))
heading = match.group(2)
current_section = {
"heading": heading,
"level": level,
"text": f"{line}\n"
}
else:
current_section["text"] += line + "\n"
if current_section["text"].strip():
sections.append(current_section)
# Split oversized sections by paragraph
final_chunks = []
for section in sections:
if len(section["text"]) <= max_chunk_size:
final_chunks.append(section)
else:
paragraphs = section["text"].split("\n\n")
sub_chunk = {"heading": section["heading"], "level": section["level"], "text": ""}
for para in paragraphs:
if len(sub_chunk["text"]) + len(para) > max_chunk_size and sub_chunk["text"]:
final_chunks.append(sub_chunk)
sub_chunk = {
"heading": section["heading"],
"level": section["level"],
"text": ""
}
sub_chunk["text"] += para + "\n\n"
if sub_chunk["text"].strip():
final_chunks.append(sub_chunk)
return final_chunks
Setting up the custom transformation in Bedrock:
import boto3
bedrock_agent = boto3.client('bedrock-agent', region_name='us-east-1')
response = bedrock_agent.create_data_source(
knowledgeBaseId='YOUR_KB_ID',
name='custom-chunked-source',
dataSourceConfiguration={
'type': 'S3',
's3Configuration': {
'bucketArn': 'arn:aws:s3:::my-documents-bucket',
'inclusionPrefixes': ['documents/']
}
},
vectorIngestionConfiguration={
'customTransformationConfiguration': {
'intermediateStorage': {
's3BucketName': 'my-intermediate-bucket'
},
'transformations': [{
'stepToApply': 'POST_CHUNKING',
'transformationFunction': {
'transformationLambdaConfiguration': {
'lambdaArn': 'arn:aws:lambda:us-east-1:123456789012:function:custom-chunker'
}
}
}]
}
}
)
When to use custom transformation:
- Your documents have domain-specific structure the built-in parsers don’t handle (e.g., medical records with specific section codes, financial filings with XBRL tags)
- You need to integrate a specialized parser like Amazon Textract for tables and forms
- You want to implement proposition-based or context-enriched chunking
- You need to chain multiple processing steps (e.g., Textract → table extraction → heading-based chunking → metadata enrichment)
Source: AWS, “Chunking and parsing configurations,” https://docs.aws.amazon.com/bedrock/latest/userguide/kb-chunking-parsing.html
Source: AWS, “Custom transformation with Lambda for Bedrock Knowledge Bases,” https://docs.aws.amazon.com/bedrock/latest/userguide/kb-chunking-parsing.html#kb-custom-transformation
Key Takeaways
- Start structure-aware. If your documents have headings and sections, use them. Document-structure-aware chunking outperforms fixed-size chunking with zero additional cost.
- Hierarchical is almost always worth it. The parent-child pattern (small chunks for retrieval, large chunks for generation) addresses the fundamental tension between precision and context.
- Context enrichment is the highest-leverage advanced technique. Prepending contextual summaries to chunks consistently improves retrieval across all document types and query patterns.
- Match chunk size to your embedding model. A 2,048-token chunk embedded by a model optimized for 512-token passages will have a degraded representation.
- Test with your data. Published benchmarks are a starting point. The optimal strategy depends on your documents, your queries, and your quality requirements. Run the experiment.
- Budget for iteration. Your first chunking strategy will not be your last. Build your pipeline to make strategy changes easy — chunking is the component you’ll revisit most often.
4. Embedding & Indexing
Embedding and indexing are the bridge between your chunked text and retrievable knowledge. The embedding model determines how well semantic meaning is captured; the vector store and index configuration determine how efficiently and accurately that meaning is searched at query time.
Embedding Model Selection
Your embedding model translates text into dense vector representations where semantic similarity maps to geometric proximity. The quality of these embeddings sets the ceiling for retrieval quality — no amount of reranking or query enhancement can compensate for fundamentally poor embeddings.
Key factors in choosing an embedding model:
- Domain alignment. Models trained on general web text may underperform on domain-specific jargon (medical, legal, financial). If your MTEB scores are strong but retrieval on your data is weak, domain mismatch is the likely culprit.
- Multilingual support. If your corpus includes multiple languages, you need a model explicitly trained for cross-lingual retrieval. Titan V2 and Cohere Embed v3 both handle this well.
- Dimension trade-off. Higher dimensions capture more nuance but increase storage, memory, and search latency. For most use cases, 1024 dimensions is the sweet spot. Below 512, you lose meaningful semantic distinctions; above 2048, the marginal quality gain rarely justifies the cost.
- Context window. Your chunks must fit within the model’s max input tokens. If you use 1024-token chunks, a model with a 512-token limit will silently truncate half the content.
| Model | Dimensions | Max Tokens | MTEB Avg (Retrieval) | Cost (per 1M tokens) | Strengths |
|---|---|---|---|---|---|
| Amazon Titan Embeddings V2 | 256/512/1024 | 8,192 | ~63 | $0.02 | Native Bedrock, configurable dims, good multilingual |
| Cohere Embed v3 | 1024 | 512 | ~66 | $0.10 | Top-tier search quality, int8/binary compression |
| Amazon Titan Text Embeddings V1 | 1536 | 8,192 | ~60 | $0.02 | Good baseline, fixed dimensions |
| BGE-M3 (open source) | 1024 | 8,192 | ~65 | Self-host | Multi-lingual, multi-granularity, dense+sparse |
| E5-Mistral-7B (open source) | 4096 | 32,768 | ~67 | Self-host | Instruction-tuned, excellent zero-shot |
| GTE-Qwen2-7B (open source) | 3584 | 131,072 | ~68 | Self-host | Very long context, strong multilingual |
Note: MTEB scores are approximate and vary by task subset. Always benchmark on your data.
Practical guidance: If you’re in the AWS ecosystem, Titan Embeddings V2 is the path of least resistance — no data leaves your VPC, pricing is straightforward ($0.02/1M tokens), and integration with Bedrock Knowledge Bases is seamless. The configurable dimensionality (256/512/1024) lets you trade quality for cost and speed. If retrieval quality is your bottleneck, benchmark Cohere Embed v3 — it consistently ranks at the top for search tasks on the MTEB leaderboard and is available on Bedrock.
For self-hosted models, BGE-M3 is a strong choice if you need both dense and sparse embeddings from a single model (useful for hybrid search without maintaining separate indexes). Deploy it on SageMaker with a ml.g5.xlarge instance for a good cost-performance balance.
import boto3
import json
bedrock_runtime = boto3.client("bedrock-runtime", region_name="us-east-1")
def embed_text(text: str, dimensions: int = 1024, normalize: bool = True) -> list[float]:
"""Embed text using Titan Embeddings V2 via Bedrock."""
response = bedrock_runtime.invoke_model(
modelId="amazon.titan-embed-text-v2:0",
body=json.dumps({
"inputText": text,
"dimensions": dimensions,
"normalize": normalize
})
)
return json.loads(response["body"].read())["embedding"]
# Embed a chunk
embedding = embed_text("Amazon S3 provides eleven nines of durability.", dimensions=1024)
# Returns a 1024-dimensional normalized vector
Matryoshka embeddings and dimension reduction. Some models (Titan V2, text-embedding-3-large) support Matryoshka Representation Learning (MRL), where the most important information is packed into the first N dimensions. You can truncate a 1024-dim embedding to 256 dims with modest quality loss (~2-5% drop in recall). This is useful for cost optimization: 256-dim vectors use 4× less storage and search ~3× faster.
Source: Hugging Face, “MTEB Leaderboard,” https://huggingface.co/spaces/mteb/leaderboard
Source: Kusupati et al., “Matryoshka Representation Learning,” NeurIPS 2022, https://arxiv.org/abs/2205.13147
Vector Store Selection
The vector store is where your embeddings live and where retrieval queries execute. Choosing the right one depends on your scale, search requirements, and existing infrastructure.
| Vector Store | Managed on AWS | Hybrid Search | Metadata Filtering | Max Vectors | Best For |
|---|---|---|---|---|---|
| OpenSearch Serverless | Yes | Yes (BM25 + kNN) | Yes (complex filters) | Billions | Production hybrid search at scale |
| Aurora PostgreSQL (pgvector) | Yes | Limited (requires custom) | Yes (full SQL) | Millions | Teams already on Aurora, SQL joins |
| Amazon Neptune Analytics | Yes | Graph + vector | Yes (Gremlin/openCypher) | Millions | Knowledge graph + vector hybrid |
| Amazon MemoryDB | Yes | Yes (via VSS module) | Yes | Millions | Ultra-low latency, real-time |
| Pinecone | Third-party | Yes (sparse-dense) | Yes | Billions | Simplicity, fast iteration |
| FAISS | Self-managed | No | No | Billions (on large instances) | Prototyping, batch processing |
OpenSearch Serverless is the pragmatic choice for most AWS customers. It supports hybrid search (BM25 + kNN) natively, scales automatically, integrates with Bedrock KB, and handles complex metadata filtering with boolean logic. The downside is cost at low scale — there’s a minimum of 2 OCUs (~$350/month). For teams processing fewer than 100 queries per day, this minimum can dominate the cost profile.
Aurora PostgreSQL with pgvector is compelling when your application already uses Aurora, or when you need to join vector search results with relational data (e.g., “find similar products that are in stock and priced under $50”). The limitation is that pgvector’s HNSW implementation is less mature than OpenSearch’s, and true hybrid search requires application-level score fusion.
Amazon Neptune Analytics is the right choice when your data has rich graph relationships. It combines vector similarity search with graph traversal — retrieve chunks that are semantically similar and connected to a specific entity in your knowledge graph. This is Graph RAG territory (Section 8).
# OpenSearch Serverless — hybrid search example
hybrid_query = {
"size": 10,
"query": {
"hybrid": {
"queries": [
{
"match": { # BM25 (sparse)
"text": {"query": "S3 cross-region replication setup"}
}
},
{
"knn": { # Dense (vector)
"embedding": {
"vector": query_embedding,
"k": 20
}
}
}
]
}
}
}
Source: AWS, “Supported vector stores for Amazon Bedrock Knowledge Bases,” https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base-supported.html
Source: AWS, “Vector search for Amazon OpenSearch Serverless,” https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-vector-search.html
Indexing Parameters
For HNSW (Hierarchical Navigable Small World — the dominant indexing algorithm in production vector search), three parameters control the recall-speed-memory trade-off:
ef_construction(build-time quality): Controls the size of the dynamic candidate list during index building. Higher values produce a higher-quality graph (better recall) at the cost of slower indexing. Start with 256-512 for production. Values below 128 often produce noticeable recall degradation; values above 512 rarely improve recall enough to justify the indexing time.
M(connections per node): The number of bidirectional links per node in the graph. Higher M increases recall and memory usage. Start with 16-32. M=16 is sufficient for most corpora under 10M vectors; M=32 for larger or higher-dimensional indexes.
ef_search(query-time quality): Controls the size of the dynamic candidate list during search. Higher values improve recall at the cost of higher latency. Start with 128 and tune based on your recall-latency requirements. A common pattern is to setef_search = 2 * top_kas a starting point.
| Parameter | Low Setting | High Setting | Impact on Recall | Impact on Speed |
|---|---|---|---|---|
ef_construction |
128 | 512 | +5-10% recall | 2-3× slower indexing |
M |
8 | 32 | +5-15% recall | 2-4× more memory |
ef_search |
64 | 256 | +3-8% recall | 2-3× slower queries |
Practical tuning workflow: Build your index with ef_construction=512, M=16. Then test ef_search values from 64 to 512 on your evaluation set, plotting recall@10 vs. P95 latency. Pick the point where increasing ef_search no longer meaningfully improves recall — typically around 128-256 for most workloads.
Distance metric. Use cosine similarity for normalized embeddings (most embedding models normalize by default). If embeddings are not normalized, use inner product. Euclidean (L2) distance is rarely the best choice for text embeddings but is supported everywhere as a fallback.
Source: Malkov & Yashunin, “Efficient and Robust Approximate Nearest Neighbor using Hierarchical Navigable Small World Graphs,” 2018, https://arxiv.org/abs/1603.09320
5. Query Understanding & Enhancement
Most RAG tutorials show a simple flow: user query → embed → retrieve → generate. In production, this naive approach fails surprisingly often. Users ask ambiguous questions, use different terminology than your documents, or pose complex questions that require multiple retrieval passes.
A query understanding layer between the user and the retriever is one of the highest-ROI investments in a RAG system. This section covers every major technique in depth — with implementation patterns, trade-offs, and guidance on when to use each.
5.1 Query Rewriting / Reformulation
The user says: “How do I fix the timeout issue?” Your documents don’t contain the word “fix” — they say “troubleshoot” and “resolve.” A rewrite step bridges this vocabulary gap.
import boto3, json
bedrock_runtime = boto3.client("bedrock-runtime", region_name="us-east-1")
def rewrite_query(original_query: str, domain_context: str = "") -> str:
"""Rewrite a user query to improve retrieval against technical docs."""
prompt = f"""Rewrite the following user question to be more specific and use
terminology likely found in technical documentation. Preserve the original intent.
Do not answer the question — only reformulate it.
Domain context: {domain_context}
Original question: {original_query}
Rewritten question:"""
response = bedrock_runtime.invoke_model(
modelId="anthropic.claude-3-haiku-20240307-v1:0",
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 150,
"messages": [{"role": "user", "content": prompt}]
})
)
return json.loads(response["body"].read())["content"][0]["text"].strip()
# Example
original = "How do I fix the timeout issue?"
rewritten = rewrite_query(original, domain_context="AWS Lambda")
# → "How to troubleshoot and resolve AWS Lambda function timeout errors"
This is cheap (a single fast LLM call with Haiku at ~$0.00025 per rewrite) and often improves retrieval hit rate by 10–20%.
When it helps most: Technical domains where user vocabulary diverges from documentation vocabulary. When to skip: When users already use precise terminology (e.g., internal API consumers querying API docs).
Source: Ma et al., “Query Rewriting in Retrieval-Augmented Large Language Models,” 2023, https://arxiv.org/abs/2305.14283
5.2 Query Expansion
Query expansion enriches the original query with additional terms to increase recall. Two primary strategies:
Synonym Injection
Append synonyms or domain-equivalent terms to the query before embedding or keyword search:
# Domain-specific synonym map (can be auto-generated or curated)
SYNONYM_MAP = {
"timeout": ["timeout error", "request timeout", "connection timeout", "deadline exceeded"],
"slow": ["high latency", "performance degradation", "long response time"],
"crash": ["application crash", "unhandled exception", "segmentation fault", "OOM killed"],
"deploy": ["deployment", "release", "rollout", "ship"],
}
def expand_with_synonyms(query: str, synonym_map: dict, max_expansions: int = 3) -> str:
"""Expand query with domain-specific synonyms for improved keyword recall."""
expansions = []
query_lower = query.lower()
for term, synonyms in synonym_map.items():
if term in query_lower:
expansions.extend(synonyms[:max_expansions])
if expansions:
return f"{query} ({', '.join(expansions)})"
return query
# Example
expand_with_synonyms("Lambda timeout when processing large files")
# → "Lambda timeout when processing large files (timeout error, request timeout, connection timeout)"
Entity Expansion
Resolve abbreviations, acronyms, and shorthand references to their full forms:
ENTITY_MAP = {
"S3": "Amazon Simple Storage Service (S3)",
"EKS": "Amazon Elastic Kubernetes Service (EKS)",
"Lambda": "AWS Lambda",
"RDS": "Amazon Relational Database Service (RDS)",
"IAM": "AWS Identity and Access Management (IAM)",
"VPC": "Amazon Virtual Private Cloud (VPC)",
}
def expand_entities(query: str, entity_map: dict) -> str:
"""Expand acronyms and abbreviations to full names for retrieval."""
expanded = query
for abbrev, full_name in entity_map.items():
if abbrev in query and full_name not in query:
expanded = expanded.replace(abbrev, full_name, 1)
return expanded
# Example
expand_entities("How to connect S3 to Lambda")
# → "How to connect Amazon Simple Storage Service (S3) to AWS Lambda"
Trade-off: Synonym injection improves recall but can reduce precision — the expanded terms may match irrelevant documents. Use sparingly and combine with reranking to filter noise. Entity expansion is nearly always beneficial and carries low risk.
Source: Carpineto & Romano, “A Survey of Automatic Query Expansion in Information Retrieval,” ACM Computing Surveys, 2012, https://doi.org/10.1145/2071389.2071390
5.3 Query Fusion / RAG-Fusion
RAG-Fusion generates multiple variants of the user’s query, retrieves documents for each variant independently, then combines results using Reciprocal Rank Fusion (RRF). This dramatically improves recall by approaching the knowledge base from multiple angles.
import hashlib
from collections import defaultdict
def generate_query_variants(original_query: str, n_variants: int = 4) -> list[str]:
"""Use an LLM to generate diverse reformulations of the query."""
prompt = f"""Generate {n_variants} different versions of the following question.
Each version should approach the topic from a different angle or use different
phrasing, while preserving the original intent. Return one question per line.
Original question: {original_query}
Variants:"""
response = bedrock_runtime.invoke_model(
modelId="anthropic.claude-3-haiku-20240307-v1:0",
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 300,
"messages": [{"role": "user", "content": prompt}]
})
)
result = json.loads(response["body"].read())["content"][0]["text"]
variants = [line.strip().lstrip("0123456789.-) ") for line in result.strip().split("\n") if line.strip()]
return [original_query] + variants[:n_variants]
def reciprocal_rank_fusion(ranked_lists: list[list[str]], k: int = 60) -> list[str]:
"""Combine multiple ranked result lists using Reciprocal Rank Fusion.
Args:
ranked_lists: List of ranked document ID lists (one per query variant).
k: RRF constant (default 60, as in the original paper).
Returns:
Fused ranked list of document IDs.
"""
scores = defaultdict(float)
for ranked_list in ranked_lists:
for rank, doc_id in enumerate(ranked_list, start=1):
scores[doc_id] += 1.0 / (k + rank)
return sorted(scores, key=scores.get, reverse=True)
def rag_fusion_retrieve(query: str, retriever, n_variants: int = 4, top_k: int = 10) -> list[str]:
"""Full RAG-Fusion pipeline: generate variants, retrieve for each, fuse results."""
variants = generate_query_variants(query, n_variants)
# Retrieve in parallel for each variant
ranked_lists = []
for variant in variants:
results = retriever.search(variant, top_k=top_k)
ranked_lists.append([doc.id for doc in results])
# Fuse using RRF
fused_ids = reciprocal_rank_fusion(ranked_lists)
return fused_ids[:top_k]
Why RRF over simple score aggregation: Different query variants may use different retrieval paths (some hit dense search hard, others match keyword patterns). Raw scores aren’t comparable across these paths. RRF uses only rank positions, making it score-agnostic and robust.
Typical improvement: RAG-Fusion improves Recall@10 by 5–15% compared to single-query retrieval, at the cost of N× retrieval latency (mitigated by parallel execution).
Source: Raudaschl, “RAG-Fusion: a New Take on Retrieval-Augmented Generation,” 2023, https://arxiv.org/abs/2402.03367
Source: Cormack et al., “Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods,” SIGIR 2009, https://dl.acm.org/doi/10.1145/1571941.1572114
5.4 HyDE — Hypothetical Document Embeddings
Instead of embedding the query directly, ask the LLM to generate a hypothetical answer, then embed that answer for retrieval.
def hyde_retrieve(query: str, retriever, embed_fn) -> list:
"""Generate a hypothetical document, embed it, and retrieve similar real documents."""
hyde_prompt = f"""Write a short, factual passage (3-5 sentences) that would answer
this question as if it appeared in official technical documentation.
Do not hedge or add caveats — write as if stating established facts.
Question: {query}
Passage:"""
response = bedrock_runtime.invoke_model(
modelId="anthropic.claude-3-haiku-20240307-v1:0",
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 200,
"messages": [{"role": "user", "content": hyde_prompt}]
})
)
hypothetical_doc = json.loads(response["body"].read())["content"][0]["text"]
# Embed the hypothetical document (not the original query)
hyde_embedding = embed_fn(hypothetical_doc)
# Retrieve using the HyDE embedding
return retriever.search_by_vector(hyde_embedding, top_k=10)
Why it works: The hypothetical answer occupies the same embedding space as your documents — it’s declarative, technical, and detailed — while user queries are short and interrogative. The embedding similarity between the hypothetical answer and real documents is often higher than between the raw query and the documents.
Caveat: HyDE adds one LLM call (~200–400ms with Haiku). It works best for technical and factual queries. For simple keyword lookups (“S3 pricing table”), it can actually hurt by introducing noise from the hypothetical generation. Use it selectively based on query classification (see Section 5.5).
Source: Gao et al., “Precise Zero-Shot Dense Retrieval without Relevance Labels,” 2023, https://arxiv.org/abs/2212.10496
5.5 Intent Classification & Query Routing
Not all queries should follow the same enhancement path. A routing layer classifies the query and directs it to the appropriate strategy.
Intent Classification Taxonomy
| Intent Category | Example Queries | Recommended Pipeline |
|---|---|---|
| Factoid | “What is the max item size in DynamoDB?” | Direct retrieval, no enhancement needed |
| How-to / Procedural | “How do I enable versioning on S3?” | Query rewriting + step-back prompting |
| Comparison | “DynamoDB vs Aurora for write-heavy workloads” | Query decomposition + parallel retrieval |
| Troubleshooting | “Lambda function timing out on large payloads” | Synonym expansion + HyDE |
| Conceptual / Explanatory | “Explain eventual consistency in DynamoDB” | HyDE + step-back prompting |
| Multi-hop / Analytical | “Total cost of a 3-node OpenSearch cluster for RAG” | Full decomposition + multi-hop retrieval |
| Conversational / Follow-up | “What about the pricing?” (after discussing S3) | Context condensation (see 5.7) |
| Out-of-scope | “What’s the weather today?” | Skip retrieval, respond directly or decline |
Implementation
from enum import Enum
class QueryIntent(Enum):
FACTOID = "factoid"
HOWTO = "howto"
COMPARISON = "comparison"
TROUBLESHOOTING = "troubleshooting"
CONCEPTUAL = "conceptual"
ANALYTICAL = "analytical"
CONVERSATIONAL = "conversational"
OUT_OF_SCOPE = "out_of_scope"
def classify_intent(query: str, conversation_history: list = None) -> QueryIntent:
"""Classify query intent to route to the appropriate enhancement pipeline."""
history_context = ""
if conversation_history:
history_context = f"\nConversation history: {conversation_history[-3:]}"
prompt = f"""Classify this query into exactly one category.
Categories: factoid, howto, comparison, troubleshooting, conceptual, analytical, conversational, out_of_scope
{history_context}
Query: {query}
Category:"""
response = bedrock_runtime.invoke_model(
modelId="anthropic.claude-3-haiku-20240307-v1:0",
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 20,
"messages": [{"role": "user", "content": prompt}]
})
)
category = json.loads(response["body"].read())["content"][0]["text"].strip().lower()
return QueryIntent(category)
def route_query(query: str, intent: QueryIntent, retriever, embed_fn) -> list:
"""Route query to appropriate enhancement + retrieval pipeline based on intent."""
match intent:
case QueryIntent.FACTOID:
return retriever.search(query, top_k=5)
case QueryIntent.HOWTO:
rewritten = rewrite_query(query)
return retriever.search(rewritten, top_k=7)
case QueryIntent.COMPARISON:
# Decompose into sub-queries, retrieve for each
sub_queries = decompose_query(query)
all_results = []
for sq in sub_queries:
all_results.extend(retriever.search(sq, top_k=5))
return deduplicate_and_rank(all_results)
case QueryIntent.TROUBLESHOOTING:
expanded = expand_with_synonyms(query, SYNONYM_MAP)
return hyde_retrieve(expanded, retriever, embed_fn)
case QueryIntent.CONCEPTUAL:
return hyde_retrieve(query, retriever, embed_fn)
case QueryIntent.ANALYTICAL:
return rag_fusion_retrieve(query, retriever, n_variants=4)
case QueryIntent.OUT_OF_SCOPE:
return [] # Skip retrieval
On AWS: For routing, a common approach is a Lambda-based classifier as a preprocessing step — lightweight, deterministic, and easy to debug. Bedrock Agents can also handle routing by defining multiple Knowledge Bases as tools, though in practice many teams prefer explicit orchestration for better control and observability (see Section 5.9).
Source: Jeong et al., “Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity,” 2024, https://arxiv.org/abs/2403.14403
5.6 Query Decomposition
Complex questions need to be broken into sub-questions. A single retrieval pass cannot surface all the information needed for: “Compare the pricing and performance of DynamoDB and Aurora for a write-heavy workload with 10,000 TPS.”
def decompose_query(query: str, max_sub_queries: int = 5) -> list[str]:
"""Break a complex query into independently answerable sub-questions."""
prompt = f"""Break this complex question into 2-{max_sub_queries} simpler
sub-questions that can each be answered independently by searching a knowledge base.
Each sub-question should be self-contained (no references to other sub-questions).
Return one sub-question per line.
Complex question: {query}
Sub-questions:"""
response = bedrock_runtime.invoke_model(
modelId="anthropic.claude-3-haiku-20240307-v1:0",
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 300,
"messages": [{"role": "user", "content": prompt}]
})
)
result = json.loads(response["body"].read())["content"][0]["text"]
return [line.strip().lstrip("0123456789.-) ") for line in result.strip().split("\n") if line.strip()]
# Example
decompose_query("Compare the pricing and performance of DynamoDB and Aurora for write-heavy workloads")
# → ["What is the pricing model for Amazon DynamoDB?",
# "What is the pricing model for Amazon Aurora?",
# "What are the write performance characteristics of DynamoDB?",
# "What are the write performance characteristics of Aurora?",
# "What are best practices for write-heavy workloads on AWS databases?"]
Retrieve for each sub-question independently, then feed all retrieved chunks to the LLM with the original complex question.
Source: Press et al., “Measuring and Narrowing the Compositionality Gap in Language Models,” 2023, https://arxiv.org/abs/2210.03350
5.7 Conversation Context Management for Multi-Turn RAG
In conversational RAG, the current query often depends on conversation history. Effective multi-turn handling requires more than simple context concatenation.
Context Condensation
Resolve pronouns and implicit references to produce a standalone query:
def condense_with_history(
conversation_history: list[dict],
current_query: str,
max_history_turns: int = 5
) -> str:
"""Rewrite a follow-up query as a standalone question using conversation history."""
# Trim to recent history to manage context and cost
recent_history = conversation_history[-max_history_turns:]
history_text = "\n".join(
f"{'User' if turn['role'] == 'user' else 'Assistant'}: {turn['content']}"
for turn in recent_history
)
prompt = f"""Given this conversation history and follow-up question, rewrite
the follow-up as a standalone question that can be understood without the conversation.
Preserve all specifics — do not generalize or lose detail.
Conversation history:
{history_text}
Follow-up question: {current_query}
Standalone question:"""
response = bedrock_runtime.invoke_model(
modelId="anthropic.claude-3-haiku-20240307-v1:0",
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 150,
"messages": [{"role": "user", "content": prompt}]
})
)
return json.loads(response["body"].read())["content"][0]["text"].strip()
# Example
history = [
{"role": "user", "content": "Tell me about S3 storage classes."},
{"role": "assistant", "content": "S3 offers several storage classes: Standard, Intelligent-Tiering, Standard-IA, One Zone-IA, Glacier Instant, Glacier Flexible, and Glacier Deep Archive."},
]
condense_with_history(history, "Which one is cheapest for infrequent access?")
# → "Which Amazon S3 storage class is cheapest for infrequently accessed data?"
Topic Drift Detection
In long conversations, the topic may shift. Detect when the user changes subject so you don’t carry stale context:
def detect_topic_shift(conversation_history: list[dict], current_query: str) -> bool:
"""Detect whether the current query represents a topic shift from the conversation."""
if len(conversation_history) < 2:
return False
recent_context = " ".join(turn["content"] for turn in conversation_history[-4:])
prompt = f"""Is the following new question a continuation of the prior conversation,
or a shift to a new topic? Answer only "continuation" or "new_topic".
Recent conversation context: {recent_context}
New question: {current_query}
Answer:"""
response = bedrock_runtime.invoke_model(
modelId="anthropic.claude-3-haiku-20240307-v1:0",
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 10,
"messages": [{"role": "user", "content": prompt}]
})
)
answer = json.loads(response["body"].read())["content"][0]["text"].strip().lower()
return "new_topic" in answer
If a topic shift is detected, skip context condensation and treat the query as standalone. This prevents contaminating retrieval with irrelevant conversational context.
On AWS: Bedrock Knowledge Bases’ RetrieveAndGenerate API supports session management natively via sessionId. The service handles context carryover for multi-turn conversations. For custom pipelines, implement condensation in a Lambda function upstream of retrieval.
Source: AWS, “Amazon Bedrock RetrieveAndGenerate API,” https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent-runtime_RetrieveAndGenerate.html
5.8 Query Caching Strategies (Semantic Cache)
Not every query needs a full retrieval pass. Semantic caching stores previous query–result pairs and returns cached results when a new query is semantically similar to a previous one.
How Semantic Caching Works
New query → Embed → Search cache (cosine similarity) → If similarity > threshold: return cached result (cache hit) → If similarity < threshold: full retrieval pipeline (cache miss) → store in cache
Implementation with ElastiCache + Embedding Similarity
import numpy as np
import redis
import json as json_module
class SemanticCache:
"""Semantic cache using Redis for storage and cosine similarity for matching."""
def __init__(self, redis_client: redis.Redis, embed_fn, similarity_threshold: float = 0.95):
self.redis = redis_client
self.embed_fn = embed_fn
self.threshold = similarity_threshold
self.ttl_seconds = 3600 # 1 hour default TTL
def get(self, query: str) -> dict | None:
"""Check cache for a semantically similar query."""
query_embedding = self.embed_fn(query)
# Scan cached embeddings (for production, use a vector index in Redis)
cached_keys = self.redis.keys("qcache:*")
best_match = None
best_similarity = 0.0
for key in cached_keys:
cached = json_module.loads(self.redis.get(key))
cached_embedding = np.array(cached["embedding"])
similarity = np.dot(query_embedding, cached_embedding) / (
np.linalg.norm(query_embedding) * np.linalg.norm(cached_embedding)
)
if similarity > best_similarity:
best_similarity = similarity
best_match = cached
if best_similarity >= self.threshold and best_match:
return best_match["result"]
return None
def put(self, query: str, result: dict) -> None:
"""Store a query-result pair in the cache."""
embedding = self.embed_fn(query)
cache_entry = {
"query": query,
"embedding": embedding.tolist(),
"result": result,
}
cache_key = f"qcache:{hash(query)}"
self.redis.setex(cache_key, self.ttl_seconds, json_module.dumps(cache_entry))
Production considerations:
- Threshold tuning: 0.95+ for high-precision caching (only near-identical queries). 0.90 for broader matching (higher hit rate, risk of stale results).
- TTL management: Set TTL based on how frequently your underlying documents change. For static knowledge bases, longer TTLs (hours/days); for dynamic content, shorter (minutes).
- Cache invalidation: Invalidate cache entries when the underlying knowledge base is updated. Tag cache entries with the data source version.
- Redis with vector search: For production scale, use Redis with the RediSearch module (or Amazon MemoryDB with vector search) instead of brute-force scanning.
Cost impact: At 1,000 queries/day with a 30% cache hit rate, you save ~300 embedding calls, ~300 retrieval operations, and (if caching final responses) ~300 LLM generation calls per day.
Source: Zhu et al., “GPTCache: An Open-Source Semantic Cache for LLM Applications,” 2023, https://arxiv.org/abs/2311.09820
Source: AWS, “Amazon MemoryDB for Redis,” https://aws.amazon.com/memorydb/
5.9 AWS Implementation: Query Enhancement Pipeline
Option A: Agent-Based Orchestration
An LLM-powered agent can handle query understanding by treating retrieval as a tool. The agent reasons about the query, decides whether to decompose it, routes sub-questions to different knowledge sources, and synthesizes results. This can be implemented with frameworks like LangGraph, CrewAI, or Strands Agents SDK, or with managed services like Amazon Bedrock Agents.
# Conceptual agent-based query pipeline
# The orchestrating agent:
# 1. Analyzes the query complexity
# 2. Decomposes into sub-questions if needed
# 3. Routes each sub-question to the relevant knowledge source
# 4. Retrieves and synthesizes a final answer
Pros: Flexible, handles novel query patterns without explicit rules. Cons: Less predictable than explicit pipelines; harder to debug when the agent makes suboptimal routing decisions; higher latency due to reasoning steps.
Option B: Lambda-Based Query Pipeline (Full Control)
For fine-grained control over every enhancement step, build a custom pipeline using Lambda:
# Lambda function: query_enhancer
# Triggered by API Gateway or Step Functions
import json
import boto3
bedrock_runtime = boto3.client("bedrock-runtime")
bedrock_agent_runtime = boto3.client("bedrock-agent-runtime")
def lambda_handler(event, context):
query = event["query"]
conversation_history = event.get("history", [])
# Step 1: Context condensation (if multi-turn)
if conversation_history:
query = condense_with_history(conversation_history, query)
# Step 2: Intent classification
intent = classify_intent(query, conversation_history)
# Step 3: Check semantic cache
cached = semantic_cache.get(query)
if cached:
return {"statusCode": 200, "body": json.dumps(cached), "cache_hit": True}
# Step 4: Apply enhancement based on intent
match intent:
case QueryIntent.FACTOID:
enhanced_queries = [query]
case QueryIntent.COMPARISON | QueryIntent.ANALYTICAL:
enhanced_queries = decompose_query(query)
case QueryIntent.TROUBLESHOOTING:
enhanced_queries = [expand_with_synonyms(query, SYNONYM_MAP)]
case QueryIntent.CONCEPTUAL:
# Use HyDE — return the hypothetical doc for embedding
enhanced_queries = [generate_hypothetical_doc(query)]
case _:
enhanced_queries = [rewrite_query(query)]
# Step 5: Retrieve from Bedrock Knowledge Base for each enhanced query
all_results = []
for eq in enhanced_queries:
response = bedrock_agent_runtime.retrieve(
knowledgeBaseId="KB_ID",
retrievalQuery={"text": eq},
retrievalConfiguration={
"vectorSearchConfiguration": {
"numberOfResults": 10,
"overrideSearchType": "HYBRID"
}
}
)
all_results.extend(response["retrievalResults"])
# Step 6: Deduplicate and rank
unique_results = deduplicate_by_content(all_results)
# Step 7: Cache and return
result = {"enhanced_queries": enhanced_queries, "results": unique_results}
semantic_cache.put(query, result)
return {"statusCode": 200, "body": json.dumps(result)}
Architecture with Step Functions: For complex orchestration (parallel retrieval, conditional branching), use AWS Step Functions to coordinate multiple Lambda functions:
API Gateway → Step Functions
→ Lambda: Condense Context (if multi-turn)
→ Lambda: Classify Intent
→ Choice State (branch by intent)
→ Parallel: Retrieve for each sub-query
→ Lambda: Fuse & Deduplicate Results
→ Lambda: Rerank
→ Lambda: Generate Response
Source: AWS, “Amazon Bedrock Agents,” https://docs.aws.amazon.com/bedrock/latest/userguide/agents.html
Source: AWS, “Retrieve API — Amazon Bedrock,” https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent-runtime_Retrieve.html
5.10 Step-Back Prompting
For highly specific questions, “step back” to a broader question first:
Specific: “What’s the maximum item size in DynamoDB?”
Step-back: “What are the limits and quotas for DynamoDB?”
Retrieving against the broader question often surfaces a comprehensive limits document that contains the specific answer, whereas the specific query might miss it if the exact phrasing doesn’t match.
def step_back_query(specific_query: str) -> str:
"""Generate a broader 'step-back' version of a specific question."""
prompt = f"""Given this specific question, generate a broader question that would
retrieve a comprehensive document containing the answer. The broader question should
cover the general topic area of the specific question.
Specific question: {specific_query}
Broader question:"""
response = bedrock_runtime.invoke_model(
modelId="anthropic.claude-3-haiku-20240307-v1:0",
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 100,
"messages": [{"role": "user", "content": prompt}]
})
)
return json.loads(response["body"].read())["content"][0]["text"].strip()
Best combined with the original query: Retrieve for both the specific query and the step-back query, then merge results. This captures both precise matches and comprehensive overview documents.
Source: Zheng et al., “Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models,” 2023, https://arxiv.org/abs/2310.06117
5.11 Technique Comparison: When to Use Which
| Technique | Latency Added | Quality Lift | Best For | Avoid When |
|---|---|---|---|---|
| Query Rewriting | ~200ms | +10–20% hit rate | Vocabulary mismatch between users and docs | Users already use precise terminology |
| Synonym Expansion | <10ms | +5–10% recall | Keyword-heavy hybrid search | Dense-only retrieval (adds noise) |
| Entity Expansion | <10ms | +5–15% recall | Acronym-heavy domains (AWS, medical) | Documents already use acronyms exclusively |
| RAG-Fusion | N × retrieval latency | +5–15% recall | Complex queries, diverse document corpus | Simple factoid queries (overkill) |
| HyDE | ~300ms | +10–25% for conceptual Qs | Technical/conceptual queries | Simple keyword lookups, time-sensitive queries |
| Query Decomposition | ~200ms + N × retrieval | +20–30% for complex Qs | Multi-part comparisons, analytical queries | Simple single-fact questions |
| Step-Back Prompting | ~200ms | +10–15% for specific Qs | Highly specific questions against broad docs | Already broad or vague queries |
| Context Condensation | ~200ms | Required for multi-turn | Any multi-turn conversation | Single-turn interactions |
| Semantic Cache | ~5ms (hit) | No quality change | High query repetition (FAQ-style) | Diverse, unique queries; rapidly changing docs |
| Intent Classification | ~150ms | Enables all above | Systems using 2+ techniques | Single-technique pipelines |
The recommended starting point for production:
- Intent classification (always — it’s the router)
- Query rewriting (high ROI, low cost)
- Context condensation (if multi-turn)
- Semantic cache (if >20% query repetition)
Add decomposition, HyDE, and RAG-Fusion incrementally based on evaluation data showing specific failure modes.
5.12 Evaluating Query Enhancement
How do you know your query enhancement is actually helping? Measure retrieval quality with and without each technique.
Key Metrics
| Metric | What It Measures | How to Compute |
|---|---|---|
| Δ Recall@K | Change in recall after enhancement | recall_enhanced - recall_baseline |
| Δ Hit Rate | Change in hit rate after enhancement | hit_rate_enhanced - hit_rate_baseline |
| Δ NDCG@K | Change in ranking quality | ndcg_enhanced - ndcg_baseline |
| Query latency overhead | Added latency from enhancement | p95_enhanced - p95_baseline |
| Cache hit rate | Fraction of queries served from cache | cache_hits / total_queries |
| Enhancement coverage | Fraction of queries that get enhanced | enhanced_queries / total_queries |
A/B Testing Framework
import random
def query_pipeline_ab_test(query: str, retriever, config: dict) -> dict:
"""Route queries to baseline or enhanced pipeline for A/B testing."""
# Random assignment with configurable traffic split
use_enhanced = random.random() < config.get("enhanced_traffic_pct", 0.5)
if use_enhanced:
enhanced_query = rewrite_query(query)
results = retriever.search(enhanced_query, top_k=10)
pipeline = "enhanced"
else:
results = retriever.search(query, top_k=10)
pipeline = "baseline"
# Log for analysis
log_retrieval_event(
original_query=query,
pipeline=pipeline,
result_ids=[r.id for r in results],
latency_ms=elapsed_ms,
)
return {"results": results, "pipeline": pipeline}
Practical guidance: Run A/B tests for at least 500 queries per arm before drawing conclusions. Track both retrieval metrics (recall, NDCG) and downstream generation metrics (faithfulness, user satisfaction) — an enhancement that improves retrieval may not always improve the final answer if the LLM was already generating well from the baseline context.
Source: RAGAS Documentation, “Metrics,” https://docs.ragas.io/en/latest/concepts/metrics/
References (Section 5)
- Ma, X. et al. (2023). “Query Rewriting in Retrieval-Augmented Large Language Models.” https://arxiv.org/abs/2305.14283
- Carpineto, C. & Romano, G. (2012). “A Survey of Automatic Query Expansion in Information Retrieval.” ACM Computing Surveys. https://doi.org/10.1145/2071389.2071390
- Raudaschl, A. (2023). “RAG-Fusion: a New Take on Retrieval-Augmented Generation.” https://arxiv.org/abs/2402.03367
- Cormack, G. et al. (2009). “Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods.” SIGIR 2009. https://dl.acm.org/doi/10.1145/1571941.1572114
- Gao, L. et al. (2023). “Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE).” https://arxiv.org/abs/2212.10496
- Jeong, S. et al. (2024). “Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity.” https://arxiv.org/abs/2403.14403
- Press, O. et al. (2023). “Measuring and Narrowing the Compositionality Gap in Language Models.” https://arxiv.org/abs/2210.03350
- Zheng, Z. et al. (2023). “Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models.” https://arxiv.org/abs/2310.06117
- Zhu, Z. et al. (2023). “GPTCache: An Open-Source Semantic Cache for LLM Applications.” https://arxiv.org/abs/2311.09820
- AWS Documentation. “Amazon Bedrock Agents.” https://docs.aws.amazon.com/bedrock/latest/userguide/agents.html
- AWS Documentation. “Retrieve API — Amazon Bedrock.” https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent-runtime_Retrieve.html
- AWS Documentation. “RetrieveAndGenerate API.” https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent-runtime_RetrieveAndGenerate.html
- AWS Documentation. “Amazon MemoryDB for Redis.” https://aws.amazon.com/memorydb/
- RAGAS Documentation. “Metrics.” https://docs.ragas.io/en/latest/concepts/metrics/
6. Retrieval Strategies
Retrieval is the core of RAG — get this wrong, and even the best generation model can’t produce a good answer. This section covers the major retrieval paradigms, how to combine them, and the optimization techniques that separate production systems from prototypes.
Dense vs. Sparse vs. Hybrid
Dense retrieval (semantic/vector search) embeds queries and documents into vector space and retrieves by cosine similarity. The embedding captures semantic meaning, so “automobile” matches “car” and “EC2 instance types for machine learning” matches “compute-optimized instances for ML workloads.” Dense retrieval excels at paraphrased questions, concept-level queries, and natural language that doesn’t exactly match the terminology in your corpus.
Sparse retrieval (keyword/BM25) uses term-frequency-based matching. BM25 scores documents based on how often query terms appear, adjusted for document length and term rarity. Sparse retrieval excels at exact matches — product codes (NR-502), error messages (AccessDeniedException), proper nouns, and technical identifiers. It also handles rare domain terms that embedding models may not have seen during training.
The failure modes are complementary. Dense retrieval fails on exact-match queries (an embedding for “p3.16xlarge” may not be close to a chunk containing that specific instance type). Sparse retrieval fails on semantic queries (searching “reduce costs” won’t match a document about “cost optimization strategies” unless those exact words appear).
Hybrid search combines both approaches, and in nearly every benchmark and production deployment, it outperforms either alone. The combination is not just additive — it’s multiplicative, because each method covers the other’s blind spots.
Score fusion strategies:
# Linear combination (simplest, most common)
final_score = alpha * dense_score + (1 - alpha) * sparse_score
# alpha = 0.5-0.7 works for most use cases; tune on your eval set
# Reciprocal Rank Fusion (RRF) — rank-based, score-agnostic
# Works better when dense and sparse scores are on different scales
def reciprocal_rank_fusion(dense_ranks, sparse_ranks, k=60):
"""Combine rankings using RRF. k is a smoothing constant."""
fused_scores = {}
for doc_id, rank in dense_ranks.items():
fused_scores[doc_id] = fused_scores.get(doc_id, 0) + 1 / (k + rank)
for doc_id, rank in sparse_ranks.items():
fused_scores[doc_id] = fused_scores.get(doc_id, 0) + 1 / (k + rank)
return sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
RRF is generally more robust than linear combination because it doesn’t require score normalization. OpenSearch uses a variant of this in its native hybrid search implementation.
On AWS: OpenSearch supports hybrid search natively via the hybrid query type, combining BM25 match queries with knn vector queries. Bedrock Knowledge Bases supports configurable hybrid retrieval — set searchType to HYBRID when calling Retrieve or RetrieveAndGenerate. The default behavior automatically balances dense and sparse scoring.
import boto3
bedrock_agent_runtime = boto3.client('bedrock-agent-runtime', region_name='us-east-1')
response = bedrock_agent_runtime.retrieve(
knowledgeBaseId='YOUR_KB_ID',
retrievalQuery={'text': 'How to configure S3 lifecycle policies?'},
retrievalConfiguration={
'vectorSearchConfiguration': {
'numberOfResults': 10,
'overrideSearchType': 'HYBRID' # Enable hybrid search
}
}
)
Source: Ma et al., “A Unified Full-Pipeline Approach to Dense and Sparse Retrieval,” 2024, https://arxiv.org/abs/2401.04055
Source: Cormack et al., “Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods,” 2009, https://dl.acm.org/doi/10.1145/1571941.1572114
Metadata Filtering
Metadata filtering narrows the search space before semantic matching, making retrieval both faster and more precise. This is critically underutilized — most teams embed metadata in chunk text and hope the embedding captures it. Explicit structured filtering is deterministic, faster, and more reliable.
Effective metadata fields to index:
- Document type: policy, FAQ, API reference, tutorial, changelog
- Date: creation date, last updated, effective date
- Access level: public, internal, confidential
- Product/service: the specific product the document covers
- Language: for multilingual corpora
- Source: which system the document came from (Confluence, SharePoint, S3)
- Version: document version for versioned content
# Bedrock KB — retrieve only from pricing documents updated after 2024
response = bedrock_agent_runtime.retrieve(
knowledgeBaseId='YOUR_KB_ID',
retrievalQuery={'text': 'What are the current EC2 pricing tiers?'},
retrievalConfiguration={
'vectorSearchConfiguration': {
'numberOfResults': 10,
'filter': {
'andAll': [
{'equals': {'key': 'doc_type', 'value': 'pricing'}},
{'greaterThan': {'key': 'updated_year', 'value': 2024}}
]
}
}
}
)
Advanced pattern: dynamic filtering. Use the query understanding layer (Section 5) to automatically extract filter conditions from the user’s query. “What changed in the S3 pricing in January 2026?” → extract {doc_type: "pricing", service: "S3", date_range: "2026-01"} and apply as metadata filters before vector search. This dramatically improves precision for structured queries.
Source: AWS, “Metadata and filtering for Amazon Bedrock Knowledge Bases,” https://docs.aws.amazon.com/bedrock/latest/userguide/kb-test-config.html
Reranking — The Highest-ROI Optimization
If you implement only one optimization beyond basic RAG, make it reranking. Initial retrieval (dense, sparse, or hybrid) returns a candidate set of 20-50 chunks. A reranker re-scores these candidates using a more powerful model that sees the query and each candidate together.
Why reranking works so well: Bi-encoders (used in initial retrieval) embed query and document independently — they produce separate vectors and compare them by distance. This is fast (you can pre-compute document embeddings) but misses fine-grained query-document interactions. Cross-encoders see the query and document as a single concatenated input and can capture token-level interactions: negation, specificity, conditional relevance. The result is dramatically better relevance judgments.
The trade-off is compute: a cross-encoder is ~100x slower per comparison than a bi-encoder lookup. That’s why we use a two-stage pipeline — fast, approximate retrieval first (retrieve 50 candidates from millions), then precise reranking on the small candidate set.
| Reranker | Type | Availability | Latency (20 docs) | Quality |
|---|---|---|---|---|
| Cohere Rerank v3 | Cross-encoder | Bedrock native | ~80ms | Excellent — top-tier on BEIR |
| Amazon Rerank 1.0 | Cross-encoder | Bedrock native | ~60ms | Strong, optimized for Bedrock pipeline |
| BGE Reranker v2.5 | Cross-encoder | Self-host (SageMaker) | ~100ms | Near Cohere quality, open weights |
| ColBERT v2 | Late-interaction | Self-host | ~40ms | Fast, good for latency-sensitive apps |
| FlashRank | Cross-encoder (small) | Self-host | ~20ms | Lighter, suitable for edge/low-resource |
Late-interaction models (ColBERT) are a middle ground: they pre-compute per-token document embeddings (like bi-encoders) but perform token-level matching at query time (like cross-encoders). This gives most of the quality benefit of cross-encoders with significantly lower latency. ColBERT is worth evaluating if your latency budget is tight.
Implementation with Bedrock:
# Bedrock reranking — integrated into retrieve-and-generate
response = bedrock_agent_runtime.retrieve_and_generate(
input={'text': 'How do I set up cross-region replication?'},
retrieveAndGenerateConfiguration={
'type': 'KNOWLEDGE_BASE',
'knowledgeBaseConfiguration': {
'knowledgeBaseId': 'YOUR_KB_ID',
'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-5-sonnet-20241022-v2:0',
'retrievalConfiguration': {
'vectorSearchConfiguration': {
'numberOfResults': 25, # Retrieve more candidates
'overrideSearchType': 'HYBRID'
}
},
'orchestrationConfiguration': {
'queryTransformationConfiguration': {
'type': 'QUERY_DECOMPOSITION'
}
}
}
}
)
Quantitative impact: In production systems, adding reranking to a basic retrieve→generate pipeline typically improves answer quality by 15-25% (measured by faithfulness and answer relevance). Latency impact is 50-100ms — negligible for most applications. Cost is minimal (Cohere Rerank on Bedrock is $1 per 1,000 search units). The ROI is almost always positive.
Source: AWS, “Rerank for more relevant RAG responses,” https://docs.aws.amazon.com/bedrock/latest/userguide/rerank.html
Source: Thakur et al., “BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models,” 2021, https://arxiv.org/abs/2104.08663
Top-K Selection
How many chunks should you feed to the LLM? This decision involves more nuance than most guides suggest.
Factors that determine optimal K:
- Chunk size: Smaller chunks → higher K (10-15). Larger chunks → lower K (3-5). The goal is to provide 2,000-5,000 tokens of context for most queries.
- Query complexity: Simple factoid (“What is the max size of an S3 object?”) → 3-5 chunks. Complex analytical (“Compare the security models of DynamoDB and Aurora”) → 10-20 chunks.
- Model context window: Don’t exceed 30-40% of the context window with retrieved context. Leave room for the system prompt, conversation history, and generation. For a 200K context model, this is generous; for an 8K model, every token counts.
- Diminishing returns: Research consistently shows that adding chunks beyond position 5-7 has rapidly diminishing returns — and can actually hurt performance if the additional chunks introduce noise or contradictions.
The “lost in the middle” effect: Liu et al. (2023) demonstrated that LLMs are significantly better at using information at the beginning and end of their context window, and tend to miss information placed in the middle. This has practical implications for RAG: if you retrieve 10 chunks, the model may effectively ignore chunks 4-7. Mitigation strategies include (1) keeping K small, (2) reranking to ensure the most relevant chunks are first, and (3) placing the most important context at the beginning and end of the prompt.
Dynamic K: Rather than fixing K, adjust it based on retrieval confidence. If the top 3 chunks all have high reranker scores (>0.8), use K=3. If scores drop gradually, increase K until scores fall below a relevance threshold. This avoids both the “too little context” and “too much noise” failure modes.
Source: Liu et al., “Lost in the Middle: How Language Models Use Long Contexts,” 2023, https://arxiv.org/abs/2307.03172
7. Generation & Post-Processing
The generation phase is where retrieved context becomes a user-facing answer. Getting this right requires careful prompt engineering, intelligent context management, robust citation, and safeguards against hallucination. These details separate a RAG demo from a production system.
Prompt Engineering for RAG
RAG prompts have a fundamentally different structure from general-purpose prompts. The model must ground its response in provided context while maintaining natural language fluency — and critically, it must know when the context is insufficient.
A production RAG system prompt template:
You are a {domain} assistant. Your task is to answer the user's question
based on the provided reference documents.
## Instructions
1. Answer ONLY using information found in the provided context below.
2. If the context does not contain enough information to fully answer
the question, explicitly state what information is missing rather
than guessing.
3. Cite specific sources using [Source N] notation for each claim.
4. If sources contain conflicting information, acknowledge the
discrepancy and present both perspectives.
5. Use direct quotes sparingly — paraphrase while preserving accuracy.
6. Structure your response with clear headings for complex answers.
## Context Documents
{retrieved_chunks_with_numbered_sources}
## User Question
{user_query}
Key design decisions:
- “Don’t know” instruction is critical. Without explicit instruction to acknowledge uncertainty, models hallucinate confidently when context is insufficient. In regulated industries (healthcare, finance, legal), a confident wrong answer is far worse than “I don’t have enough information.”
- Instruction placement matters. Research shows that placing instructions before the context (not after) leads to better instruction following. The model processes the context through the lens of the instructions.
- Context ordering affects quality. Place the most relevant chunks first and last (exploiting the primacy and recency effects discussed in Section 6’s “lost in the middle” finding). If your reranker provides confidence scores, sort chunks by decreasing relevance but consider duplicating the top chunk at the end.
- XML tags for structure. For Claude models, wrap context in XML tags (
,) for more reliable parsing. For other models, clear markdown separators work well.
def format_context_for_prompt(chunks: list[dict], max_tokens: int = 4000) -> str:
"""Format retrieved chunks into a numbered context block."""
context_parts = []
token_count = 0
for i, chunk in enumerate(chunks, 1):
chunk_text = f"[Source {i}] ({chunk['metadata'].get('title', 'Unknown')})\n{chunk['text']}\n"
chunk_tokens = len(chunk_text.split()) * 1.3 # rough token estimate
if token_count + chunk_tokens > max_tokens:
break
context_parts.append(chunk_text)
token_count += chunk_tokens
return "\n---\n".join(context_parts)
Context Window Management
When retrieved context exceeds what can reasonably fit in the prompt, you need a strategy. The right approach depends on the query type and the number of relevant chunks.
Strategy 1: Stuffing (Simple, K ≤ 5)
Put all chunks directly in one prompt. This is the default for most simple RAG implementations and works well when K is small and chunks are relevant. No information is lost, but the model must process everything at once.
Strategy 2: Map-Reduce (Large K, Synthesis Queries)
For questions that require information from many documents (“Summarize all pricing changes in 2025”):
- Map phase: Send each chunk (or small groups of chunks) to the LLM with a focused extraction prompt: “Extract any information about pricing changes from this passage.”
- Reduce phase: Collect all extracted summaries and send them to the LLM with the final synthesis prompt.
This adds latency (multiple LLM calls) but handles arbitrarily large context.
Strategy 3: Iterative Refinement (Sequential, K = 5-15)
Process chunks one at a time, progressively refining the answer:
- Generate an initial answer from the first chunk.
- Present the next chunk alongside the current answer: “Here is additional context. Update your answer if this adds relevant information.”
- Repeat until all chunks are processed.
This is effective when information is distributed across chunks and each chunk may modify the answer. The downside is high latency — one LLM call per chunk.
Strategy 4: Hierarchical Summarization (Very Large K)
For scenarios where dozens of chunks are relevant (e.g., searching across thousands of support tickets for patterns):
- Group chunks by topic/source.
- Summarize each group.
- Summarize the summaries.
This is essentially map-reduce with multiple levels, useful when information density is low and you’re mining for patterns rather than specific facts.
The “Lost in the Middle” Mitigation:
Regardless of strategy, be aware that LLMs tend to under-use information in the middle of long contexts. Practical mitigations:
- Keep context under 5,000 tokens when possible (K=3-5 with 512-token chunks)
- Place highest-relevance chunks at positions 1 and K (beginning and end)
- Use reranker scores to aggressively filter — 5 highly relevant chunks beat 15 moderately relevant ones
Source: Liu et al., “Lost in the Middle: How Language Models Use Long Contexts,” 2023, https://arxiv.org/abs/2307.03172
Citation and Source Attribution
For enterprise RAG, every claim must trace back to a source. This is not optional in regulated industries — it’s a compliance requirement. Good citations also build user trust: users can verify answers and learn to calibrate their confidence in the system.
Implementation approaches:
Approach 1: Inline citation with numbered sources
Number chunks in the prompt, instruct the model to cite by number, then post-process to replace numbers with document links.
# Post-processing: replace citation numbers with actual links
import re
def resolve_citations(answer: str, chunks: list[dict]) -> str:
"""Replace [Source N] with actual document links."""
def replace_citation(match):
idx = int(match.group(1)) - 1
if idx < len(chunks):
title = chunks[idx]['metadata'].get('title', 'Document')
url = chunks[idx]['metadata'].get('url', '#')
return f"[{title}]({url})"
return match.group(0)
return re.sub(r'\[Source (\d+)\]', replace_citation, answer)
Approach 2: Bedrock Knowledge Bases native citations
The RetrieveAndGenerate API returns citations in the response, each containing the specific text segments (retrievedReferences) that informed part of the answer. This gives you span-level attribution without custom post-processing.
response = bedrock_agent_runtime.retrieve_and_generate(
input={'text': query},
retrieveAndGenerateConfiguration={...}
)
# Each citation maps a part of the answer to its source
for citation in response['citations']:
answer_span = citation['generatedResponsePart']['textResponsePart']['text']
for ref in citation['retrievedReferences']:
source_text = ref['content']['text']
source_uri = ref['location']['s3Location']['uri']
print(f"Claim: '{answer_span[:50]}...' → Source: {source_uri}")
Approach 3: Post-hoc verification
After generating the answer, run a separate verification step that checks each sentence against the retrieved chunks using NLI (natural language inference). Flag sentences that aren’t entailed by any chunk. This is more robust but adds latency and cost.
Guardrails and Hallucination Prevention
Even with perfect retrieval, models can hallucinate — generating plausible-sounding claims not supported by the context. Production RAG systems need multiple layers of defense.
Amazon Bedrock Guardrails provides configurable safeguards that can be applied to any Bedrock model call:
- Content filters: Block harmful, violent, sexually explicit, or inappropriate content with configurable strength levels.
- Denied topics: Define specific topics the model should refuse to discuss (e.g., “competitor pricing,” “legal advice”).
- Word filters: Block specific terms, profanity, or sensitive internal terminology from appearing in responses.
- PII detection and redaction: Automatically detect and mask personally identifiable information (names, addresses, SSNs, credit card numbers) in both input and output.
- Contextual grounding checks: Compare the generated response against the retrieved source documents and flag any claims not supported by the context. This is the most RAG-specific guardrail — it directly addresses hallucination.
- Automated Reasoning checks: Validate response logic using formal verification. Particularly useful for numerical claims and multi-step reasoning.
# Apply guardrails to a RAG generation call
response = bedrock_runtime.invoke_model(
modelId='anthropic.claude-3-5-sonnet-20241022-v2:0',
guardrailIdentifier='my-rag-guardrail',
guardrailVersion='DRAFT',
body=json.dumps({
'anthropic_version': 'bedrock-2023-05-31',
'messages': [{'role': 'user', 'content': rag_prompt}],
'max_tokens': 2048
})
)
# Check if guardrails intervened
if response.get('amazon-bedrock-guardrailAction') == 'INTERVENED':
# Response was modified or blocked by guardrails
handle_guardrail_intervention(response)
The contextual grounding check deserves special emphasis. It works by comparing each sentence in the generated response against the retrieved chunks using an entailment model. Sentences that aren’t supported by any chunk are flagged with a grounding score below the configured threshold. You can set the threshold based on your risk tolerance — stricter for medical/legal, more lenient for general knowledge bases.
Additional hallucination mitigation strategies:
- Temperature = 0 for factual RAG. Higher temperatures increase creativity but also hallucination risk.
- Instruct the model to quote directly when making specific claims. Quoted text is verifiable.
- Implement a “confidence signal” — ask the model to assess its confidence using natural language categories with clear definitions: “Rate your confidence as HIGH (answer is directly and fully supported by the provided context), MEDIUM (answer is partially supported or requires minor inference), or LOW (context is insufficient or tangentially related).” Avoid bare numeric scales (1-5) without descriptions — models don’t have an inherent understanding of what each number means, leading to inconsistent and poorly calibrated ratings. Low-confidence answers can trigger human review or a fallback retrieval pass.
Source: AWS, “Amazon Bedrock Guardrails,” https://docs.aws.amazon.com/bedrock/latest/userguide/guardrails.html
Source: AWS, “Contextual grounding check in Guardrails,” https://docs.aws.amazon.com/bedrock/latest/userguide/guardrails-contextual-grounding-check.html
Streaming Responses
For user-facing applications, streaming the response token-by-token dramatically improves perceived latency. The user sees the answer forming in real-time rather than waiting 5-10 seconds for the complete response.
With Bedrock:
response = bedrock_runtime.invoke_model_with_response_stream(
modelId='anthropic.claude-3-5-sonnet-20241022-v2:0',
body=json.dumps({
'anthropic_version': 'bedrock-2023-05-31',
'messages': [{'role': 'user', 'content': rag_prompt}],
'max_tokens': 2048
})
)
for event in response['body']:
chunk = json.loads(event['chunk']['bytes'])
if chunk['type'] == 'content_block_delta':
print(chunk['delta']['text'], end='', flush=True)
Challenge with streaming + citations: When streaming, you don’t have the complete answer to post-process citations. Solutions: (1) process citations after stream completes, (2) instruct the model to use a citation format that’s self-contained (e.g., include document titles inline rather than numbered references), or (3) use Bedrock’s RetrieveAndGenerateStream API which handles this natively.
Source: AWS, “Invoke model with streaming,” https://docs.aws.amazon.com/bedrock/latest/userguide/inference-invoke.html
8. Advanced RAG Patterns
Standard RAG follows a fixed pipeline: retrieve → generate. Advanced patterns break this rigidity, introducing decision-making, iteration, and multi-source reasoning. These patterns are where RAG moves from “search + summarize” to genuine knowledge-intensive reasoning.
Agentic RAG
In traditional RAG, retrieval is deterministic — every query triggers the same retrieve→generate pipeline. Agentic RAG gives an AI agent autonomy over the retrieval process. The agent decides:
- Whether to retrieve at all. Some questions don’t need external knowledge (“What is 2+2?”). Unnecessary retrieval adds latency and noise.
- What query to use for retrieval. The agent can reformulate the user’s question, decompose it into sub-queries, or retrieve from different knowledge bases depending on the topic.
- When to retrieve again. After reviewing initial results, the agent may decide the information is insufficient and issue follow-up retrieval with refined queries.
- How to combine retrieval with other tools. The agent can interleave retrieval with calculations, API calls, database queries, or code execution.
Implementation approaches: Agentic RAG can be built with open-source agent frameworks (LangGraph, CrewAI, Strands Agents SDK, LlamaIndex Agents), custom orchestration using Step Functions or Lambda, or managed services. The core pattern is the same: the LLM’s reasoning loop decides when and how to invoke retrieval as a tool.
# Conceptual agentic RAG with tool-use pattern
tools = [
{
"name": "search_knowledge_base",
"description": "Search the financial reports knowledge base",
"parameters": {"query": "string"}
},
{
"name": "calculate",
"description": "Perform numerical calculations",
"parameters": {"expression": "string"}
}
]
# The agent's reasoning loop:
# 1. Analyze user query
# 2. Decide which tools to invoke (and in what order)
# 3. Execute tools, observe results
# 4. Decide if more information is needed
# 5. Synthesize final answer
The key advantage of agentic RAG is adaptability. A fixed pipeline applies the same processing to every query regardless of complexity. An agent can apply a simple single-pass retrieval for factoid questions, multi-hop retrieval for complex analytical queries, and skip retrieval entirely for general knowledge questions — all within the same system.
On AWS: Bedrock Knowledge Bases can serve as the retrieval tool in any agent framework — you call the Retrieve API from your agent’s tool function. For fully managed orchestration, Bedrock Agents is an option, though many production teams prefer explicit frameworks (LangGraph, Strands) for better control over the reasoning loop.
Source: AWS, “Strands Agents SDK,” https://github.com/strands-agents/sdk-python
Source: LangGraph, “Building Agentic RAG,” https://python.langchain.com/docs/tutorials/rag/
Multi-Hop RAG
Some questions cannot be answered with a single retrieval pass because the information needed spans multiple documents that are only connected through intermediate reasoning.
Example: “What’s the total monthly cost of running the OpenSearch cluster configuration recommended for a 10M-document RAG workload?”
No single document contains this answer. You need:
- Hop 1: Retrieve documents about recommended OpenSearch configurations for RAG at 10M-document scale → find: “r6g.2xlarge, 3 data nodes, 2 replicas”
- Hop 2: Retrieve pricing for r6g.2xlarge instances in the user’s region → find: “$0.718/hr per instance”
- Synthesize: 3 nodes × $0.718/hr × 730 hours/month = $1,572.42/month
Implementation patterns:
IRCoT (Interleaved Retrieval Chain-of-Thought): The model alternates between reasoning and retrieval. After each reasoning step, it generates a follow-up query based on what it’s learned so far. This maps naturally to agent frameworks with tool-use loops — the agent reasons, retrieves, reasons again, retrieves again, until it has sufficient information.
Query decomposition + parallel retrieval: Decompose the original question into independent sub-questions (Section 5), retrieve for each in parallel, then synthesize. This is faster than sequential multi-hop but works only when sub-questions are independent.
Recursive retrieval: Retrieve initial chunks, extract entities or follow-up questions from them, retrieve again, and repeat up to a maximum depth. Add a stopping condition based on the model’s assessment of whether it has sufficient information.
def multi_hop_retrieve(query: str, kb_id: str, max_hops: int = 3) -> list:
"""Iterative multi-hop retrieval with LLM-guided query generation."""
all_chunks = []
current_query = query
for hop in range(max_hops):
# Retrieve for current query
chunks = retrieve_from_kb(current_query, kb_id, top_k=5)
all_chunks.extend(chunks)
# Ask LLM: do we have enough info, or need another hop?
assessment = llm_assess(query, all_chunks)
if assessment['sufficient']:
break
# Generate follow-up query based on what we've learned
current_query = assessment['follow_up_query']
return all_chunks
When to use multi-hop: When your evaluation shows that single-pass retrieval fails on questions requiring cross-document reasoning (typically 15-30% of queries in complex domains like finance, healthcare, and technical documentation).
Source: Press et al., “Measuring and Narrowing the Compositionality Gap in Language Models,” 2023, https://arxiv.org/abs/2210.03350
Source: Trivedi et al., “Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions,” 2023, https://arxiv.org/abs/2212.10509
Graph RAG
When your knowledge has rich entity relationships — organizational hierarchies, product dependencies, regulatory cross-references, citation networks — standard vector retrieval misses structural information. Graph RAG combines knowledge graphs with vector retrieval to capture both semantic similarity and relational structure.
Microsoft’s Graph RAG approach (2024):
- Entity and relationship extraction: During ingestion, an LLM extracts entities (people, products, concepts) and their relationships from each document.
- Graph construction: Build a knowledge graph from extracted entities and relationships using a graph database (Amazon Neptune).
- Community detection: Apply graph algorithms (e.g., Leiden algorithm) to identify clusters of closely related entities.
- Community summarization: Generate natural language summaries for each community, capturing the key themes and relationships.
- Query-time retrieval: For a given query, retrieve both (a) vector-similar chunks and (b) relevant community summaries from the graph. Feed both to the LLM.
Why this matters: Standard RAG answers “What does document X say about topic Y?” well. Graph RAG also answers “How is entity A related to entity B?”, “What are the major themes across 10,000 documents?”, and “What are the downstream impacts of changing policy X?” — questions that require understanding relationships, not just content.
On AWS: Amazon Neptune Analytics supports vector similarity search alongside graph traversal (Gremlin and openCypher). You can store entity embeddings as node properties and combine graph patterns with vector similarity in a single query.
// OpenCypher — find entities related to 'S3' within 2 hops
// that are also semantically similar to the query
MATCH path = (s:Service {name: 'Amazon S3'})-[*1..2]-(related)
WHERE related.embedding IS NOT NULL
WITH related, gds.similarity.cosine(related.embedding, $query_embedding) AS sim
WHERE sim > 0.7
RETURN related.name, related.type, sim
ORDER BY sim DESC LIMIT 10
When to invest in Graph RAG: Graph RAG adds significant ingestion complexity (entity extraction, graph maintenance, community detection). It’s worth it when:
- Users frequently ask relationship questions (“Who approved this policy?”, “What services depend on this component?”)
- Your documents form a natural graph (legal documents with cross-references, codebases with dependencies, organizational policies)
- You need thematic summarization across large corpora (“What are the top concerns across 5,000 customer support tickets?”)
For pure factoid QA over a relatively flat document corpus, standard RAG with hybrid search and reranking is sufficient and much simpler.
Source: Edge et al., “From Local to Global: A Graph RAG Approach to Query-Focused Summarization,” Microsoft Research, 2024, https://arxiv.org/abs/2404.16130
Self-RAG
Self-RAG introduces self-reflection into the generation process. Rather than blindly generating from retrieved context, the model actively evaluates its own behavior at each step using special reflection tokens:
- [Retrieve]: “Do I need to retrieve information for this query?” → Yes/No
- [IsRel]: “Is this retrieved passage relevant to the query?” → Relevant/Irrelevant
- [IsSup]: “Is my generated response supported by this passage?” → Fully Supported / Partially Supported / Not Supported
- [IsUse]: “Is this response useful to the user?” → Useful rating (1-5)
The self-correction loop:
- Model receives query, decides whether retrieval is needed.
- If yes, retrieves passages and evaluates each for relevance (filters irrelevant ones).
- Generates a response segment and checks if it’s supported by the passages.
- If not supported, regenerates with a different approach or retrieves additional passages.
- Evaluates overall usefulness and refines if needed.
This addresses a fundamental RAG failure mode: the system retrieves irrelevant documents, the generator treats them as authoritative, and the output is confidently wrong. Self-RAG catches this by evaluating relevance before generation and verifying support after generation.
Practical implementation: True Self-RAG requires a model fine-tuned with reflection tokens (the original paper fine-tuned Llama 2). A practical approximation for production: implement the reflection logic as explicit prompting steps in a multi-step pipeline — retrieve, check relevance (with a lightweight classifier or LLM call), generate, verify grounding (using Bedrock Guardrails’ contextual grounding check).
Source: Asai et al., “Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection,” 2023, https://arxiv.org/abs/2310.11511
Corrective RAG (CRAG)
CRAG adds a quality evaluator between retrieval and generation. After retrieving documents, an evaluator model scores their relevance and routes the pipeline accordingly:
- Correct (high confidence): Retrieved documents are clearly relevant → proceed directly to generation.
- Ambiguous (medium confidence): Relevance is uncertain → supplement retrieved documents with web search results to provide additional context.
- Incorrect (low confidence): Retrieved documents are irrelevant to the query → discard them entirely and fall back to web search.
Why this matters: Standard RAG has no mechanism to detect retrieval failure. If the knowledge base doesn’t contain the answer, the system still forces generation from the top-K results — which may be tangentially related at best. CRAG prevents the common failure mode of “confidently wrong answers from irrelevant context.”
Implementation:
def corrective_rag(query: str, kb_id: str) -> str:
"""CRAG: evaluate retrieval quality before generation."""
# Step 1: Initial retrieval
chunks = retrieve_from_kb(query, kb_id, top_k=10)
# Step 2: Evaluate retrieval quality
evaluation = evaluate_relevance(query, chunks)
if evaluation['verdict'] == 'correct':
# High confidence — use retrieved chunks directly
context = chunks
elif evaluation['verdict'] == 'ambiguous':
# Medium confidence — supplement with web search
web_results = web_search(query)
context = chunks + web_results
else:
# Low confidence — fall back to web search only
context = web_search(query)
# Step 3: Generate from curated context
return generate_answer(query, context)
def evaluate_relevance(query: str, chunks: list) -> dict:
"""Use an LLM to evaluate retrieval relevance."""
prompt = f"""Evaluate whether these retrieved passages are relevant
to answering the query. Rate as 'correct', 'ambiguous', or 'incorrect'.
Query: {query}
Passages: {format_chunks(chunks[:3])} # Evaluate top 3
Verdict:"""
# ... invoke LLM and parse response
Trade-off: CRAG adds one LLM call for evaluation per query. In production, use a fast, small model (Haiku, Titan Lite) for the evaluation step to minimize latency impact.
Source: Yan et al., “Corrective Retrieval Augmented Generation,” 2024, https://arxiv.org/abs/2401.15884
Adaptive RAG
Different queries have fundamentally different complexity levels, and applying the same pipeline to all of them wastes resources on simple queries while under-serving complex ones. Adaptive RAG classifies queries by complexity and routes them to appropriate processing pipelines:
- Simple queries (factoid, single-fact): Single-pass retrieval, no decomposition, K=3. “What is the maximum object size in S3?”
- Moderate queries (comparison, multi-aspect): Query rewriting + hybrid retrieval + reranking, K=5-10. “Compare DynamoDB and Aurora for write-heavy workloads.”
- Complex queries (multi-hop, analytical): Full decomposition + multi-hop retrieval + synthesis, K=10-20. “Design a cost-optimized architecture for a real-time recommendation system serving 10M users.”
Complexity classification can be done with:
- Rule-based heuristics: Query length, presence of comparison words (“vs”, “compare”, “difference”), question type (who/what/how/why).
- Lightweight LLM classifier: A small model (Haiku) classifies query complexity in <100ms.
- Embedding-based classifier: Train a simple classifier on query embeddings labeled by complexity.
The key insight is that routing a simple factoid question through a complex multi-hop pipeline doesn’t improve the answer — it just adds latency and cost. Conversely, routing a complex analytical question through a simple pipeline produces shallow, incomplete answers.
Source: Jeong et al., “Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity,” 2024, https://arxiv.org/abs/2403.14403
Multi-Modal RAG
Production knowledge bases increasingly contain not just text but images (architecture diagrams, screenshots, charts), tables (financial data, specifications), and occasionally audio/video. Multi-modal RAG extends the pipeline to handle these content types.
Approaches:
- Image-to-text at ingestion: Use a vision model (Claude 3.5 Sonnet, GPT-4V) to generate detailed text descriptions of images during ingestion. Embed and retrieve the descriptions. This is the simplest approach and works well for diagrams and charts.
- Multi-modal embeddings: Use models that embed both text and images into the same vector space (e.g., Amazon Titan Multimodal Embeddings). At query time, the text query embedding is compared against both text and image embeddings.
- Table-aware RAG: Convert tables to structured text representations during chunking (Section 3), and use table-specific retrieval strategies (exact-match on column names + semantic on content).
On AWS: Titan Multimodal Embeddings supports both text and image inputs, producing embeddings in the same 1024-dimensional space. Bedrock Knowledge Bases supports multi-modal data sources including images and PDFs with embedded images.
# Titan Multimodal Embeddings — embed an image
import base64
with open('architecture_diagram.png', 'rb') as f:
image_bytes = base64.b64encode(f.read()).decode('utf-8')
response = bedrock_runtime.invoke_model(
modelId='amazon.titan-embed-image-v1',
body=json.dumps({
'inputImage': image_bytes,
'embeddingConfig': {'outputEmbeddingLength': 1024}
})
)
image_embedding = json.loads(response['body'].read())['embedding']
Practical advice: Start with image-to-text at ingestion (approach 1). It’s the simplest to implement, works with any existing text-based RAG pipeline, and provides good results for most use cases. Move to multi-modal embeddings only if you have a large volume of images and text descriptions don’t capture the visual information adequately (e.g., complex technical diagrams where spatial relationships matter).
Source: AWS, “Amazon Titan Multimodal Embeddings G1,” https://docs.aws.amazon.com/bedrock/latest/userguide/titan-multiemb-models.html
Pattern Selection Guide
Choosing the right advanced pattern depends on your specific needs:
| Pattern | Best For | Complexity | Latency Impact |
|---|---|---|---|
| Agentic RAG | Dynamic queries, tool use | Medium | +500ms-2s (agent reasoning) |
| Multi-Hop | Cross-document reasoning | Medium-High | +1-5s (multiple retrievals) |
| Graph RAG | Relationship queries, theme extraction | High | +200-500ms (graph traversal) |
| Self-RAG | High-stakes, accuracy-critical | Medium | +500ms (reflection steps) |
| CRAG | Unreliable knowledge bases, fallback needed | Low-Medium | +200ms (evaluation step) |
| Adaptive RAG | Mixed query complexity | Low | Neutral (saves on simple queries) |
| Multi-Modal | Image/table-heavy corpora | Medium | Varies |
Start simple. Most teams get 80% of the value from standard RAG with hybrid search, reranking, and good chunking. Add advanced patterns incrementally based on where your evaluation shows failures — not because they sound impressive in blog posts.
9. RAG Evaluation — The Deep Dive
This is where most RAG projects either succeed or slowly, silently fail. Building a RAG prototype that “works on my demo” takes days. Building a RAG system with measured, reproducible quality takes months — and evaluation is the difference.
9.1 Why Evaluation Is the Hardest Part
The demo trap. Your RAG system answers your 10 test questions perfectly. You deploy it. Within a week, users are complaining about wrong answers, missing information, and hallucinations. What happened? Your 10 test questions weren’t representative of real usage patterns — they were cherry-picked queries where you already knew the documents contained good answers.
The ground truth problem. Unlike classification tasks where you have clear labels, RAG evaluation often lacks ground truth. What’s the “correct” answer to “Explain our refund policy”? There might be multiple valid formulations at different levels of detail, and the policy might span several documents. Unlike a math problem, there’s no single right answer.
The multi-dimensional problem. A RAG response can simultaneously be:
- Relevant to the query but unfaithful to the sources (hallucination)
- Faithful to the sources but irrelevant to the query (retrieval miss)
- Both relevant and faithful but incomplete (missed important context)
- Complete and accurate but too slow (latency) or too expensive (cost)
- Fast and accurate for the first query but degrading over time (drift)
You need metrics across all these dimensions, and optimizing one often hurts another.
9.2 Component-Level vs. System-Level Evaluation
Before diving into metrics, understand the two evaluation paradigms:
Component-level evaluation measures each piece of the pipeline independently:
- Is the retriever finding the right documents? (Retrieval metrics)
- Is the generator producing faithful answers? (Generation metrics)
- Is the query enhancement actually improving retrieval? (Enhancement metrics)
System-level evaluation measures the end-to-end experience:
- Is the user getting the right answer? (Correctness)
- Is the experience fast enough? (Latency)
- Is it cost-effective at scale? (Cost per query)
You need both. System-level metrics tell you if something is wrong. Component-level metrics tell you where it’s wrong and how to fix it.
9.3 The Three-Layer Evaluation Framework
Layer 1: Retrieval Evaluation
Measures whether the right documents are being retrieved. These metrics require relevance judgments — for each query, you need to know which chunks should be retrieved.
Context Precision / Precision@K
Of the K retrieved chunks, what fraction is actually relevant to the query?
Precision@K = |relevant ∩ retrieved| / K
Example: You retrieve 10 chunks, 6 are relevant → Precision@10 = 0.6
High precision means little noise in your retrieved context. Low precision means the LLM is wading through irrelevant text, which increases hallucination risk and wastes tokens.
Context Recall / Recall@K
Of all relevant chunks in the entire corpus, what fraction did you successfully retrieve?
Recall@K = |relevant ∩ retrieved| / |total_relevant|
Example: There are 8 relevant chunks total, you retrieved 6 → Recall@10 = 0.75
High recall means you’re not missing important information. Low recall means the LLM might generate incomplete answers because it never saw the critical context.
NDCG (Normalized Discounted Cumulative Gain)
Are the most relevant chunks ranked highest? NDCG penalizes relevant documents that appear lower in the ranking.
DCG@K = Σ(i=1 to K) rel_i / log2(i+1)
NDCG@K = DCG@K / IDCG@K (where IDCG is the ideal DCG)
Why it matters: LLMs pay more attention to context appearing earlier in the prompt. A relevant chunk at position 1 is far more valuable than the same chunk at position 10. NDCG captures this.
MRR (Mean Reciprocal Rank)
Where does the first relevant chunk appear in the results?
RR = 1 / rank_of_first_relevant
MRR = mean(RR across all queries)
MRR of 0.5 means the first relevant chunk typically appears at position 2. MRR of 1.0 means it’s always at position 1.
Hit Rate (the simplest health check)
For what fraction of queries does at least one relevant chunk appear in the Top-K?
Hit Rate = queries_with_at_least_one_hit / total_queries
If your hit rate is below 0.8, you have a fundamental retrieval problem — everything else is secondary.
Practical recommendation: Start monitoring Hit Rate and Recall@K — these tell you if your retriever is finding the right documents at all. Then add NDCG and Precision@K to optimize ranking quality and noise reduction.
Layer 2: Generation Evaluation
Measures the quality of the LLM’s response given the retrieved context.
Faithfulness (Groundedness)
Does the response only contain information supported by the retrieved context? This is the hallucination metric and arguably the single most important metric for enterprise RAG.
Faithfulness is typically measured by:
- Decomposing the generated response into individual claims/statements
- For each claim, checking whether it is supported by the retrieved context
- Computing the ratio of supported claims to total claims
Faithfulness = |supported_claims| / |total_claims|
Example: The response makes 8 claims, 7 are supported by retrieved context → Faithfulness = 0.875
A faithfulness score below 0.8 is a red flag. It means roughly 1 in 5 statements in the response is not grounded in the provided context — the model is either hallucinating or drawing from its training data rather than your documents.
Answer Relevancy
Does the response actually address the user’s question? A response can be perfectly faithful (everything it says is in the context) but completely miss the point of the question.
RAGAS measures this by generating synthetic questions from the answer and computing similarity to the original question. If the answer is relevant, questions generated from it should resemble the original query.
Answer Completeness
Does the response cover all aspects of the question? Especially important for multi-part questions:
“What are the S3 storage classes, their use cases, and pricing?” — a complete answer must address all three parts.
Completeness is typically measured against a reference answer or by decomposing the question into sub-questions and checking coverage.
Hallucination Rate
The inverse of faithfulness, expressed as a trackable rate:
Hallucination Rate = |unsupported_claims| / |total_claims|
Track this over time as a trending metric. If it’s climbing, something changed — new documents with different structure, a prompt regression, or a model update.
Layer 3: End-to-End Evaluation
Correctness
Is the final answer factually correct? This requires ground truth answers for comparison and can be measured by:
- Exact match (for factoid QA — rarely appropriate)
- Semantic similarity (embedding similarity between generated and reference answers)
- LLM-as-judge (ask a judge model to score correctness on a 1-5 scale)
- Human evaluation (the gold standard, but expensive)
Latency (P50 / P95 / P99)
Track response time percentiles separately for each pipeline stage:
- Query enhancement: typically 200-500ms (1 LLM call)
- Retrieval: typically 50-200ms (vector search)
- Reranking: typically 50-150ms
- Generation: typically 1-5 seconds (main bottleneck)
- Total: typically 2-6 seconds end-to-end
Users expect < 3 seconds for simple questions. For complex questions requiring multi-hop retrieval, set expectations with streaming responses.
Cost per Query
At scale, this determines viability. Break it down:
- Embedding the query: ~$0.0001
- Vector search: ~$0.0005
- Reranking (20 candidates): ~$0.001
- LLM generation (Sonnet): ~$0.003-0.01
- Query enhancement (Haiku): ~$0.0003
- Total: ~$0.005-0.015 per query
At 10,000 queries/day, that’s $50-150/day or $1,500-4,500/month just on inference. Evaluate whether each pipeline component justifies its cost through improved quality.
User Satisfaction Signals
The ultimate metric — but lagging and noisy:
- Thumbs up/down on responses
- Query reformulations (user rephrasing = original answer was unsatisfactory)
- Escalation to human agents
- Session abandonment rate
9.4 AWS Native: Bedrock Knowledge Base Evaluation
Amazon Bedrock provides built-in RAG evaluation capabilities that significantly reduce the engineering effort required to measure quality.
Evaluation Types
Retrieve-Only Evaluation: Tests retrieval in isolation.
- Metrics: Context Relevance (are retrieved chunks relevant?), Context Coverage (do retrieved chunks cover the expected answer?)
- Use when: Debugging retrieval quality, comparing vector stores or chunking strategies
Retrieve-and-Generate Evaluation: Tests the full RAG pipeline.
- Metrics: Correctness, Completeness, Faithfulness, Helpfulness
- Use when: Measuring end-to-end quality, comparing prompts or models
Evaluation Dataset Format
The evaluation dataset is a JSONL file in S3:
{
"conversationTurns": [{
"input": {
"content": [{"text": "What are the S3 storage classes?"}]
},
"referenceResponses": [{
"content": [{"text": "S3 offers six storage classes: S3 Standard..."}]
}],
"referenceContexts": [{
"content": [{"text": "Amazon S3 storage classes include..."}]
}]
}]
}
referenceResponses— optional ground truth answers (required for Correctness)referenceContexts— optional ground truth chunks (required for Context Coverage in retrieve-only mode)
Tip: You need a minimum of ~50-100 evaluation examples for statistically meaningful results. Aim for 200+ to capture the diversity of real-world queries.
Setting Up an Evaluation Job
import boto3
from datetime import datetime
bedrock = boto3.client('bedrock', region_name='us-east-1')
response = bedrock.create_evaluation_job(
jobName=f'rag-eval-{datetime.now():%Y-%m-%d-%H-%M}',
roleArn='arn:aws:iam::role/BedrockEvalRole',
evaluationConfig={
'automated': {
'datasetMetricConfigs': [{
'taskType': 'RetrieveAndGenerate',
'dataset': {'s3Uri': 's3://my-bucket/eval-dataset.jsonl'},
'metricNames': [
'Builtin.Correctness',
'Builtin.Completeness',
'Builtin.Faithfulness',
'Builtin.Helpfulness'
]
}]
}
},
inferenceConfig={
'ragConfigs': [{
'knowledgeBaseConfig': {
'knowledgeBaseId': 'YOUR_KB_ID',
'modelIdentifier': 'anthropic.claude-sonnet-4-6-v1'
}
}]
},
outputDataConfig={
's3Uri': 's3://my-bucket/eval-results/'
}
)
Interpreting Results — Common Failure Patterns
| Pattern | Diagnosis | Fix |
|---|---|---|
| Low faithfulness, high correctness | Model hallucinating but getting lucky — dangerous | Strengthen grounding prompt, add Guardrails contextual grounding check |
| High faithfulness, low completeness | Retrieval missing relevant documents | Review chunking strategy, increase Top-K, check for missing documents |
| Low correctness across the board | Fundamental retrieval failure | Audit ingestion pipeline, check embedding quality, verify data is indexed |
| High variability across queries | Some query types work well, others don’t | Stratify evaluation by query type, add query routing |
| Good metrics, bad user feedback | Evaluation dataset doesn’t reflect real usage | Collect real user queries, rebuild evaluation dataset |
Source: AWS, “Evaluate the performance of RAG sources using Amazon Bedrock,” https://docs.aws.amazon.com/bedrock/latest/userguide/evaluation-kb.html
Source: AWS ML Blog, “Evaluating RAG applications with Amazon Bedrock knowledge base evaluation,” https://aws.amazon.com/blogs/machine-learning/evaluating-rag-applications-with-amazon-bedrock-knowledge-base-evaluation/
9.5 Open-Source Evaluation Frameworks — Deep Comparison
RAGAS (RAG Assessment)
The most widely adopted open-source framework. Uses LLM-based metrics that don’t require ground truth for all metrics.
Core metrics and how they work:
- Faithfulness: Decomposes the answer into claims using an LLM, then checks each claim against the context. Cost: 2 LLM calls per evaluation.
- Answer Relevancy: Generates N questions from the answer, computes mean cosine similarity to original query. Cost: 1 LLM call + N embedding calls.
- Context Precision: Checks if ground-truth-relevant chunks are ranked higher in the retrieved set. Requires ground truth.
- Context Recall: Decomposes the ground truth answer into claims, checks if each can be attributed to the retrieved context. Requires ground truth.
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset
eval_dataset = Dataset.from_dict({
"question": questions,
"answer": generated_answers,
"contexts": retrieved_contexts,
"ground_truth": reference_answers
})
results = evaluate(dataset=eval_dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall])
print(results) # {'faithfulness': 0.87, 'answer_relevancy': 0.91, ...}
Strengths: Easy to start, large community, well-documented. Weaknesses: Each evaluation example costs 3-5 LLM calls (expensive at scale), metric scores can be noisy for individual examples, best used as aggregate scores over 100+ examples.
Source: RAGAS Documentation, https://docs.ragas.io/
DeepEval
Production-oriented framework with native CI/CD integration.
from deepeval import assert_test
from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
test_case = LLMTestCase(
input="What is the refund policy?",
actual_output=rag_response,
retrieval_context=retrieved_chunks
)
faithfulness = FaithfulnessMetric(threshold=0.7)
relevancy = AnswerRelevancyMetric(threshold=0.7)
assert_test(test_case, [faithfulness, relevancy])
Killer feature: Pytest integration. Run deepeval test run test_rag.py in your CI/CD pipeline and fail the build if metrics drop below thresholds.
Source: DeepEval Documentation, https://docs.confident-ai.com/
TruLens
Developed by TruEra, focused on production feedback loops.
Differentiator: The “feedback function” abstraction — define custom evaluation functions that can use any combination of LLM judges, heuristics, and ground truth. Particularly strong for tracking metrics in production with its logging and dashboard capabilities.
Source: TruLens Documentation, https://www.trulens.org/
Phoenix / Arize
Observability-first evaluation platform.
Differentiator: Traces the entire RAG pipeline, letting you inspect individual queries end-to-end: what was retrieved, what was generated, where things went wrong. Best for debugging production issues rather than batch evaluation.
Source: Arize Phoenix, https://docs.arize.com/phoenix/
Giskard
Security and compliance-focused.
Differentiator: Automated adversarial testing — generates prompt injection attempts, tests for data leakage, checks OWASP LLM Top 10 vulnerabilities. Essential for regulated industries.
Source: Giskard Documentation, https://docs.giskard.ai/
Comprehensive Comparison
| Feature | RAGAS | DeepEval | TruLens | Phoenix/Arize | Braintrust | Giskard |
|---|---|---|---|---|---|---|
| LLM-based metrics | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| No ground truth needed | Partial | Partial | Partial | ✅ | Partial | ✅ |
| CI/CD integration | Community | Built-in | API | API | API | Built-in |
| Production monitoring | ❌ | Dashboard | ✅ | ✅ (core) | ✅ | ❌ |
| Human eval workflows | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ |
| Adversarial testing | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ (core) |
| Tracing / debugging | ❌ | ❌ | ✅ | ✅ (core) | ✅ | ❌ |
| Open source | ✅ | ✅ | ✅ | ✅ | Partial | ✅ |
| Best for | Quick eval, research | CI/CD pipelines | Prod feedback | Prod debugging | Team collab | Compliance |
9.6 Building Evaluation Datasets
Your evaluation dataset is the foundation of all measurement. A bad evaluation dataset is worse than no evaluation — it gives false confidence.
Synthetic Data Generation Pipeline
generation_prompt = """Given this document passage, generate 3 diverse
questions that can be answered using ONLY this passage.
For each question, provide:
1. The question
2. The expected answer (derived only from the passage)
3. Question type: FACTOID / HOW-TO / COMPARISON / REASONING / UNANSWERABLE
Vary the difficulty: 1 easy, 1 medium, 1 hard.
Passage: {chunk_text}
"""
Critical: Include unanswerable questions. Generate questions that look like they could be answered by your knowledge base but actually can’t. This tests the system’s ability to say “I don’t know” rather than hallucinating. Aim for 15-20% unanswerable questions in your dataset.
Recommended Methodology (Hybrid)
- Auto-generate 500+ QA pairs from your documents using an LLM
- Human experts review and filter to ~200 high-quality, diverse pairs
- Add real queries from production logs (if available) — 50-100 actual user questions
- Add adversarial examples — 20-30 tricky edge cases (ambiguous, multi-document, out-of-scope)
- Stratify by query type, difficulty, and document source
- Final dataset: 250-350 validated examples with ground truth answers and relevant chunks
Sample Size and Statistical Significance
| Dataset Size | What You Can Detect | Confidence |
|---|---|---|
| 50 examples | Major regressions (>15% drop) | Low |
| 100 examples | Moderate changes (~10% drop) | Medium |
| 200 examples | Small changes (~5% drop) | High (95%) |
| 500+ examples | Subtle changes (~2-3% drop) | Very high |
Rule of thumb: 200 examples is the minimum for production evaluation. Below that, you’re making decisions on noise.
9.7 Continuous Evaluation Pipeline
CI/CD Integration
Every change to your RAG system should trigger evaluation:
# GitHub Actions example
on:
push:
paths:
- 'rag_config/**'
- 'prompts/**'
- 'chunking/**'
jobs:
rag-eval:
runs-on: ubuntu-latest
steps:
- name: Run RAG evaluation
run: python evaluate_rag.py --config eval_config.yaml
- name: Check regression
run: |
python check_regression.py \
--baseline metrics/baseline.json \
--current metrics/current.json \
--threshold 0.05
A/B Testing RAG Configurations
When testing a new chunking strategy or prompt:
- Run evaluation suite on current configuration → baseline metrics
- Run evaluation suite on candidate configuration → candidate metrics
- Compute per-metric deltas
- Apply significance test (paired t-test or bootstrap confidence interval)
- Deploy only if: no metric regresses >2% AND target metrics improve >3%
Production Monitoring Dashboard
Track these on a real-time dashboard (CloudWatch, Grafana, or Datadog):
- Faithfulness (sampled 5-10% of queries): LLM-as-judge scores
- Retrieval hit rate (100% of queries): did retrieval return results?
- Latency P50/P95/P99 (100% of queries): per-component breakdown
- Cost per query (100% of queries): token usage × pricing
- User negative feedback rate (100% of feedback): thumbs down / total rated
- Query reformulation rate (100% of sessions): consecutive queries on same topic = signal of failure
Alert thresholds (recommended starting points):
- Faithfulness < 0.7 → 🔴 Critical
- Hit rate < 0.8 → 🔴 Critical
- P95 latency > 8 seconds → 🟡 Warning
- Negative feedback rate > 20% → 🟡 Warning
- Cost per query > 2× baseline → 🟡 Warning
Drift Detection
Your documents change. Your users change. Your RAG system drifts. Schedule monthly evaluation runs against a stable, frozen evaluation dataset to catch drift:
- If metrics drop >5% month-over-month, investigate
- Common causes: new document types that chunking handles poorly, shifted user query patterns, vector store index degradation
9.8 The Cost of Evaluation (Often Overlooked)
Evaluation itself costs money — primarily in LLM calls for LLM-as-judge metrics.
Per-example evaluation cost (approximate):
- RAGAS (4 metrics): ~4-5 LLM calls per example → ~$0.02-0.05 per example
- Bedrock KB Evaluation: ~$0.01-0.03 per example (judge model calls)
- Human evaluation: ~$0.50-2.00 per example (annotator time)
For a 200-example evaluation suite:
- Automated (RAGAS/Bedrock): $4-10 per run
- Human evaluation: $100-400 per run
- Running daily automated + weekly human: ~$200-400/month
This is cheap insurance. A single hallucinated answer in a customer-facing system can cost far more in trust, reputation, and remediation.
9.9 Common Anti-Patterns
“We tested it on 10 questions and it works great.”
→ 10 questions is anecdotal. You need 200+ for statistical validity.
“Our faithfulness is 0.95 so we’re good.”
→ On what dataset? If your eval set is too easy (simple factoid questions), 0.95 means nothing. Add adversarial and complex multi-hop queries.
“We use GPT-4 to judge GPT-4’s outputs.”
→ Same-model evaluation has known biases (verbosity preference, self-consistency). Use a different model as judge, or better, calibrate against human judgments.
“We evaluated once before launch.”
→ RAG quality degrades over time as documents change, usage patterns shift, and models update. Evaluation must be continuous.
“We optimized for faithfulness and declared victory.”
→ Faithfulness and completeness are in tension. A system that only answers when 100% certain will refuse many valid questions. Define your acceptable trade-off explicitly — for a medical FAQ, you might accept lower completeness for higher faithfulness; for a product recommendation bot, the opposite.
Source: Confident AI, “RAG Evaluation Metrics,” https://www.confident-ai.com/blog/rag-evaluation-metrics-answer-relevancy-faithfulness-and-more
Source: RAGAS, “Metrics Documentation,” https://docs.ragas.io/en/latest/concepts/metrics/
Source: Braintrust, “RAG Evaluation Tools,” https://www.braintrust.dev/articles/best-rag-evaluation-tools
10. Production Checklist
From POC to Production
| Phase | Key Actions |
|---|---|
| POC | Default chunking, single embedding model, 10-20 test queries, no monitoring |
| Pilot | Optimized chunking for your document types, evaluation dataset (100+ queries), basic monitoring |
| Production | Hybrid search + reranking, query understanding layer, CI/CD evaluation pipeline, comprehensive monitoring, guardrails |
| Mature | Agentic RAG, adaptive retrieval, continuous evaluation, A/B testing framework, cost optimization |
Decision Table
| Scenario | Chunking | Retrieval | Enhancement | Evaluation |
|---|---|---|---|---|
| Simple FAQ bot | Fixed 512 | Dense only | None needed | Hit Rate + Faithfulness |
| Technical documentation | Hierarchical | Hybrid + Rerank | Query rewriting | Full 3-layer |
| Multi-domain enterprise | Structure-aware | Hybrid + Rerank + Route | Decomposition + Routing | Full 3-layer + A/B |
| Compliance/legal | Semantic | Hybrid + Rerank | Step-back + HyDE | Full + Adversarial (Giskard) |
| Conversational assistant | Hierarchical | Hybrid + Rerank | Context condensation | Full + User signals |
Cost Optimization
RAG system costs come from four main areas: embedding (ingestion + queries), vector storage and search, LLM generation, and optional components like reranking. Here are the highest-impact optimization strategies:
1. Tiered model routing. Not every step needs the most expensive model. Use a small, fast model (Haiku, Titan Lite) for query classification, routing, and rewriting. Reserve the larger model (Sonnet, Opus) for final answer generation. This alone can cut LLM costs by 40-60% without meaningful quality loss.
2. Semantic caching. Many RAG systems see significant query repetition — users ask the same or very similar questions. Cache the (query embedding → answer) mapping. For a new query, compute its embedding and check cosine similarity against cached query embeddings. If similarity exceeds a threshold (e.g., 0.95), return the cached answer without invoking retrieval or generation. Implement with ElastiCache or a simple in-memory store for low-volume applications.
3. Embedding dimension reduction. If your embedding model supports Matryoshka representations (Titan V2, text-embedding-3-large), reduce dimensions from 1024 to 256 or 512. This cuts vector storage by 2-4× and speeds up search by ~3×, with only a 2-5% recall drop. Test on your eval set to confirm acceptable quality.
4. Right-size your vector store. OpenSearch Serverless has a minimum of 2 OCUs (~$350/month) regardless of usage. For low-volume applications (<100 queries/day), consider Aurora PostgreSQL with pgvector or even a FAISS index on a small EC2 instance — dramatically cheaper at small scale.
5. Cache frequent query embeddings. If the same query text appears repeatedly, skip the embedding API call and serve from cache. A simple LRU cache with TTL covers the common case.
6. Chunk-level deduplication. During ingestion, deduplicate near-identical chunks across documents. This reduces index size and prevents the retriever from wasting top-K slots on duplicate content.
7. Batch ingestion. Bedrock and most embedding APIs offer batch pricing or reduced per-token costs for batch operations. Schedule ingestion during off-peak hours and batch chunks rather than embedding them one at a time.
8. Monitor and alert on cost anomalies. Set CloudWatch billing alarms on Bedrock model invocation costs. A misconfigured pipeline (e.g., infinite retrieval loops in agentic RAG) can generate surprising bills quickly.
Observability Setup
- CloudWatch: Custom metrics for retrieval latency, generation latency, chunk hit rate
- X-Ray: End-to-end tracing through the RAG pipeline (query → retrieval → generation)
- CloudWatch Logs: Log retrieved chunk IDs and relevance scores for debugging
- Dashboard: Combine latency, cost, and quality metrics in a single view
Conclusion
RAG is not a single technique — it’s an architecture with numerous decision points, each offering meaningful trade-offs. The difference between a demo and a production system lies in the details: how you chunk your documents, whether you enhance queries before retrieval, how you combine sparse and dense search, and — most critically — how rigorously you evaluate.
Start simple. Measure everything. Iterate based on data, not intuition. And remember: the best RAG system is one that knows when it doesn’t know.
📝 Note: The views, opinions, and technical recommendations expressed in this article are my own and do not represent the official position of any organization. All architecture patterns and code examples are for educational purposes — always validate against your specific requirements and the latest AWS documentation.
References
- Lewis, P. et al. (2020). “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” https://arxiv.org/abs/2005.11401
- Gao, L. et al. (2023). “Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE).” https://arxiv.org/abs/2212.10496
- Asai, A. et al. (2023). “Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection.” https://arxiv.org/abs/2310.11511
- Yan, S. et al. (2024). “Corrective Retrieval Augmented Generation.” https://arxiv.org/abs/2401.15884
- Jeong, S. et al. (2024). “Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity.” https://arxiv.org/abs/2403.14403
- Microsoft Research (2024). “GraphRAG: Unlocking LLM Discovery on Narrative Private Data.” https://arxiv.org/abs/2404.16130
- Press, O. et al. (2023). “Measuring and Narrowing the Compositionality Gap in Language Models.” https://arxiv.org/abs/2210.03350
- Zheng, Z. et al. (2023). “Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models.” https://arxiv.org/abs/2310.06117
- AWS Documentation. “Amazon Bedrock Knowledge Bases.” https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base.html
- AWS Documentation. “Evaluate RAG sources using Amazon Bedrock.” https://docs.aws.amazon.com/bedrock/latest/userguide/evaluation-kb.html
- AWS Machine Learning Blog. “Evaluating RAG applications with Amazon Bedrock knowledge base evaluation.” https://aws.amazon.com/blogs/machine-learning/evaluating-rag-applications-with-amazon-bedrock-knowledge-base-evaluation/
- RAGAS Documentation. https://docs.ragas.io/
- DeepEval Documentation. https://docs.confident-ai.com/
- Hugging Face MTEB Leaderboard. https://huggingface.co/spaces/mteb/leaderboard
- LlamaIndex. “Auto-Merging Retriever.” https://docs.llamaindex.ai/en/stable/examples/retrievers/auto_merging_retriever/
Leave a Comment