The Complete Guide to RAG on AWS — Architecture, Deep Dives & Evaluation

📅 2026-03-07📖 ~105 min readRAGAWSGenAIArchitecture

⚠️ Disclaimer: The code examples, architecture patterns, and configurations in this article are illustrative and intended for educational purposes only. Always review, test, and adapt them to your specific use case, security requirements, and AWS account configuration before deploying to production.

📖 Table of Contents

Why RAG Still Wins
RAG Architecture Overview
Data Ingestion & Chunking — The Foundation
Embedding & Indexing
Query Understanding & Enhancement
Retrieval Strategies
Generation & Post-Processing
Advanced RAG Patterns
RAG Evaluation — The Deep Dive
Production Checklist

1. Why RAG Still Wins

Every few months, someone declares RAG dead. A new model with a million-token context window launches, and the argument goes: “Just stuff everything in the prompt — who needs retrieval?” And every few months, practitioners building real systems quietly disagree.

The empirical evidence is clear. A 2024 study by Leng et al. compared RAG against long-context approaches on multi-document question answering tasks. The result: RAG consistently outperformed long-context stuffing for corpora exceeding 50 documents, with the gap widening as corpus size increased. Long-context models showed significant degradation in faithfulness as the input exceeded 100K tokens — a phenomenon researchers call “lost in the middle,” where models attend heavily to the beginning and end of context while neglecting the middle.

Source: Liu, N. et al., “Lost in the Middle: How Language Models Use Long Contexts,” 2023, https://arxiv.org/abs/2307.03172

Here’s why RAG remains the dominant architecture for enterprise AI applications:

Data freshness without retraining. Fine-tuning bakes knowledge into model weights. When your product documentation changes weekly, your compliance policies update quarterly, and your knowledge base grows daily, fine-tuning becomes an expensive treadmill. RAG retrieves from live data sources — update the document, and the next query reflects the change. This is not a theoretical advantage — it is the reason most enterprise knowledge assistants use RAG. A compliance team that needs answers grounded in last week’s regulatory update cannot wait for a fine-tuning cycle.

Access control and auditability. When a financial analyst asks about Q3 earnings, they should only see documents they’re authorized to access. RAG naturally supports document-level permissions because retrieval happens at query time against a permissioned index. You tag chunks with access control metadata during ingestion and filter at retrieval time — a pattern well-supported by OpenSearch Serverless and Bedrock Knowledge Bases. With long-context approaches, you’d need to pre-filter and reconstruct prompts per user — operationally painful and security-risky.

Cost at scale. Feeding 500,000 tokens into every API call is expensive. RAG retrieves the 5-10 most relevant chunks (typically 2,000-5,000 tokens) and sends only those. At thousands of queries per day, the cost difference is orders of magnitude. Consider a concrete comparison: a 10,000-document knowledge base where each document averages 5,000 tokens. Stuffing even 100 documents into a single prompt costs ~$1.50 per query with a frontier model at $3/MTok input. RAG retrieves 5-10 relevant chunks (~3,000 tokens total) for ~$0.009 per query. At 5,000 queries per day, that’s $7,500/day versus $45/day — a 166× difference.

Grounded, verifiable answers. RAG provides source attribution by design. The model’s answer can point to specific documents, paragraphs, and pages. This isn’t just nice to have — in regulated industries (healthcare, finance, legal), it’s a requirement. When an auditor asks “where did this answer come from?”, a RAG system can point to the exact chunk and document. A long-context system can only gesture at the 500K-token prompt.

Composability and modularity. A RAG pipeline is modular — you can swap the embedding model without changing retrieval logic, upgrade the vector store without touching the generation layer, or add a reranker without modifying anything else. This composability matters at enterprise scale, where different teams own different components and upgrades must be incremental. Long-context approaches couple everything into a single monolithic prompt, making iterative improvement difficult.

RAG + long context is not either/or. The most effective production systems combine both. Use RAG to retrieve the 10-20 most relevant chunks, then leverage a long-context model to reason over those chunks with full attention. This “retrieve then reason” pattern gets you the precision of RAG with the synthesis capabilities of large context windows. Amazon Bedrock Knowledge Bases supports this pattern natively through the RetrieveAndGenerate API with configurable context windows.

When RAG is NOT the answer: For tasks requiring deep reasoning over a small, stable corpus (e.g., analyzing a single contract), long-context approaches work well. For teaching a model new behaviors or styles, fine-tuning is appropriate. For simple classification or extraction from short texts, neither RAG nor fine-tuning is needed — a well-crafted prompt suffices. RAG excels when you need accurate, sourced answers over large, dynamic, access-controlled knowledge bases — which describes most enterprise use cases.

Source: Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” 2020, https://arxiv.org/abs/2005.11401

Source: Gao, Y. et al., “Retrieval-Augmented Generation for Large Language Models: A Survey,” 2024, https://arxiv.org/abs/2312.10997

2. RAG Architecture Overview

A production RAG system has two pipelines: ingestion (offline) and query (online). Understanding both — and the design decisions at each stage — is essential for building systems that scale beyond a prototype.

Ingestion Pipeline (Offline)

The ingestion pipeline converts raw documents into searchable, retrievable units. It runs asynchronously — typically triggered when new documents are added or existing ones are updated.

🚀 Ingestion Pipeline

┌─────────────────────────────────────────────────────────────────────┐
│                     INGESTION PIPELINE (Offline)                    │
│                                                                     │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌────────────────┐  │
│  │  Data     │   │ Parsing  │   │ Chunking │   │  Embedding     │  │
│  │  Sources  │──▶│ &        │──▶│ Strategy │──▶│  Model         │  │
│  │          │   │ Extract  │   │          │   │                │  │
│  └──────────┘   └──────────┘   └──────────┘   └───────┬────────┘  │
│   S3, Web,       Textract,      Fixed, Semantic,       │           │
│   Confluence,    Unstructured,  Hierarchical,     ┌────▼────────┐  │
│   SharePoint,    Tika           Structure-aware   │ Vector Store │  │
│   Databases                                       │ + Metadata   │  │
│                                                   │ Index        │  │
│                                                   └─────────────┘  │
│                                                    OpenSearch,      │
│                                                    pgvector, etc.   │
└─────────────────────────────────────────────────────────────────────┘

Each stage has meaningful design decisions:

Data source connectors. Where do your documents live? S3 is the most common starting point, but production systems often pull from multiple sources — Confluence wikis, SharePoint sites, databases, and web crawlers. Bedrock Knowledge Bases supports S3, Confluence, SharePoint, Salesforce, and web crawlers as native data sources.
Parsing and extraction. Raw documents must be converted to clean text. PDFs require layout-aware parsing (Amazon Textract for tables and forms, or third-party parsers like Unstructured.io). HTML requires boilerplate removal. Structured data (JSON, CSV) requires schema-aware extraction. This stage is often underinvested — poor parsing propagates errors through the entire pipeline.
Chunking. The most impactful decision in the ingestion pipeline. How you split documents into retrieval units determines the upper bound of your system’s quality. Section 3 covers this in depth.
Embedding. Each chunk is converted to a dense vector representation using an embedding model. The choice of model, dimensionality, and normalization directly affects retrieval quality. Section 4 covers this.
Indexing and storage. Embeddings are stored in a vector store with metadata for filtering. The index structure (HNSW, IVF, flat) affects the recall-latency trade-off at query time.

Query Pipeline (Online)

The query pipeline handles real-time user requests. Every millisecond of latency is felt by the user, so efficiency matters.

🚀 Query Pipeline

┌─────────────────────────────────────────────────────────────────────┐
│                      QUERY PIPELINE (Online)                        │
│                                                                     │
│  ┌──────────┐   ┌──────────────┐   ┌────────────┐   ┌──────────┐  │
│  │  User    │   │  Query       │   │  Retrieval │   │ Reranking│  │
│  │  Query   │──▶│  Understanding│──▶│  (Hybrid)  │──▶│          │  │
│  │          │   │  & Enhancement│   │            │   │          │  │
│  └──────────┘   └──────────────┘   └────────────┘   └────┬─────┘  │
│                  Rewrite, Expand,   Dense + Sparse        │        │
│                  Decompose, Route   + Metadata Filter ┌───▼──────┐ │
│                                                       │ Context  │ │
│  ┌──────────┐   ┌──────────────┐   ┌────────────┐   │ Assembly │ │
│  │ Response │◀──│  Guardrails  │◀──│  LLM       │◀──│          │ │
│  │ + Cite   │   │  (PII, ground│   │  Generation│   └──────────┘ │
│  └──────────┘   │   check)     │   └────────────┘                │
│                  └──────────────┘                                  │
└─────────────────────────────────────────────────────────────────────┘

The query pipeline stages in detail:

Query understanding and enhancement (Section 5). The user’s raw query is rarely optimal for retrieval. This stage rewrites, expands, decomposes, or routes the query based on intent classification. It is one of the highest-ROI investments in a production RAG system.
Retrieval. The enhanced query is used to search the vector store. Production systems almost always use hybrid search — combining dense (semantic) and sparse (keyword/BM25) retrieval — to cover both semantic and lexical matches. Metadata filters narrow the search space before vector similarity is computed.
Reranking. The initial retrieval returns a candidate set (typically 20-50 chunks). A cross-encoder reranker re-scores these candidates with a model that sees the query and each candidate together, producing a much more accurate relevance ranking. This step adds 50-150ms but typically improves answer quality by 15-25%.
Context assembly. The top-ranked chunks are assembled into a prompt context. This involves deduplication (if multiple retrieval paths returned the same chunk), ordering (most relevant first), and truncation (ensuring the total context fits within the LLM’s effective window).
LLM generation. The assembled context plus the user’s query are sent to the LLM with a system prompt that instructs grounded generation with source citation.
Guardrails and post-processing. The generated response is validated for content safety, PII, grounding (is every claim supported by the context?), and formatting. Bedrock Guardrails supports all of these checks natively.

AWS Reference Architecture

On AWS, you have two primary approaches:

Option A: Fully Managed (Bedrock Knowledge Bases)

🚀 AWS Architecture

S3 / Confluence / SharePoint / Web Crawler
    → Bedrock Knowledge Base (managed parsing, chunking, embedding)
    → OpenSearch Serverless / Aurora pgvector / Pinecone / MongoDB Atlas / Neptune Analytics
    → Bedrock RetrieveAndGenerate API (managed retrieval + generation)
    → Bedrock Guardrails (content safety + grounding check)
    → Response with source attributions

This approach minimizes infrastructure management. Bedrock handles chunking, embedding, indexing, retrieval, and generation through a single API. You configure the chunking strategy, select an embedding model, and choose a vector store — Bedrock manages the rest. Best for teams that want to ship quickly and optimize later.

Option B: Custom Pipeline (Maximum Control)

🚀 Architecture Flow

S3 → Lambda (custom parsing) → Lambda (custom chunking)
    → Bedrock Embedding API (Titan V2 / Cohere Embed v3)
    → OpenSearch Serverless (self-managed index)
    → Lambda / ECS (custom query pipeline with enhancement, retrieval, reranking)
    → Bedrock InvokeModel API (generation)
    → Bedrock Guardrails
    → API Gateway → Client

This approach gives you full control over every stage. Use it when Bedrock Knowledge Bases’ built-in options don’t meet your requirements — for example, if you need custom document parsers (Amazon Textract for complex PDFs), non-standard chunking strategies, or specialized query routing logic. The trade-off is more infrastructure to manage and more code to maintain.

Option C: Hybrid (Common in Practice)

Most production teams land here: use Bedrock Knowledge Bases for ingestion and indexing (the offline pipeline), but build a custom query pipeline (the online pipeline) using Lambda or ECS. This gives you managed ingestion with custom query enhancement, reranking, and generation logic.

🚀 Architecture Flow

Ingestion: Bedrock KB (managed)
Query:     API Gateway → Lambda (query enhancement)
           → Bedrock Retrieve API (search the KB's index)
           → Cohere Rerank on Bedrock (rerank candidates)
           → Bedrock InvokeModel (generation with custom prompt)
           → Bedrock Guardrails → Response

The choice between options depends on your team’s operational maturity, latency requirements, and customization needs. Start with Option A, identify the bottlenecks through evaluation (Section 9), and selectively move components to Option C as needed.

Source: AWS, “Amazon Bedrock Knowledge Bases,” https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base.html

Source: AWS, “Build a RAG-based generative AI application using Amazon Bedrock Knowledge Bases,” https://aws.amazon.com/blogs/machine-learning/build-a-rag-based-generative-ai-application-using-amazon-bedrock-knowledge-bases/

3. Data Ingestion & Chunking — The Foundation

If you get chunking wrong, nothing downstream can save you. The best embedding model and the most sophisticated retrieval strategy cannot compensate for poorly chunked documents. Chunking is where most RAG systems silently fail — and it’s where the highest-ROI optimizations live.

Why Chunking Matters

Chunking determines the granularity of your retrieval units. Too large, and retrieved chunks contain noise that dilutes the answer. Too small, and you lose context — the model sees fragments without enough information to generate a complete response.

The challenge is that there is no universal optimal chunk size. It depends on your document types, query patterns, embedding model’s sweet spot, and the nature of the questions your users ask. A system answering factoid questions about product specifications needs different chunking than one synthesizing answers from legal contracts.

Consider a concrete example: a 50-page AWS user guide. If you chunk it into 128-token pieces, a question like “How do I configure cross-region replication for S3?” might retrieve a chunk containing the command syntax but missing the prerequisite IAM permissions described two paragraphs earlier. If you chunk it into 2048-token pieces, that same query might retrieve a chunk covering three unrelated S3 features, burying the relevant content in noise. The art of chunking is finding the right granularity for your specific use case.

Core Chunking Strategies

Fixed-Size Chunking

The simplest approach: split text into chunks of N tokens with M tokens of overlap.

python

# Typical fixed-size chunking
chunk_size = 512  # tokens
chunk_overlap = 50  # tokens (~10% overlap)

When it works: Homogeneous documents with consistent structure — news articles, blog posts, transcripts, and any corpus where content density is relatively uniform. A news corpus of 10,000 articles, each roughly the same length and style, chunks well with fixed-size because the structure is inherently flat.

When it fails: Documents with hierarchical structure (technical manuals, legal contracts) where a fixed window arbitrarily splits a section mid-paragraph or separates a heading from its content. Consider a legal contract: a fixed-size chunk might start mid-sentence in one clause and end mid-sentence in another, rendering both fragments useless for answering questions about either clause.

Real-world example: A customer support FAQ database where each FAQ entry is 100-400 tokens. Fixed-size chunking at 512 tokens works well because most entries fit in a single chunk, and the few longer ones get split at natural points with overlap preserving continuity.

Overlap tuning: The overlap parameter is often under-considered. Too little overlap (0-5%), and you lose cross-boundary context. Too much (>25%), and you waste storage and create near-duplicate chunks that confuse retrieval. The 10-20% range is a reliable starting point, but for documents with long sentences (academic papers, legal text), 15-20% overlap prevents mid-sentence splits.

Recursive / Character Splitting

LangChain popularized this approach: split by a hierarchy of separators (\n\n → \n → . → ), recursively subdividing until chunks are under the size limit.

python

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_text(document_text)

Advantage over fixed-size: Respects natural text boundaries (paragraphs, sentences). When the splitter encounters a paragraph break, it prefers to split there rather than mid-sentence.

Limitation: Still fundamentally size-driven — it doesn’t understand whether two paragraphs are semantically related. Two paragraphs discussing the same concept but separated by \n\n will be split into different chunks if the combined size exceeds the limit.

Real-world example: Technical documentation with mixed content — some sections are 200 words, others are 2,000. Recursive splitting adapts to this variance better than fixed-size because it preserves short paragraphs intact while subdividing long sections at natural boundaries.

Source: LangChain, “Text Splitters,” https://python.langchain.com/docs/how_to/#text-splitters

Semantic Chunking

Instead of splitting by size or structure, semantic chunking uses embedding similarity to determine split points. Adjacent sentences are embedded, and when the cosine similarity between consecutive sentence embeddings drops below a threshold, a chunk boundary is inserted.

python

from langchain_experimental.text_splitter import SemanticChunker
from langchain_aws import BedrockEmbeddings

embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v2:0")

chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",  # or "standard_deviation", "interquartile"
    breakpoint_threshold_amount=75
)
chunks = chunker.create_documents([document_text])

How it works in detail: The algorithm embeds each sentence, then computes pairwise cosine similarity between consecutive sentences. It applies a breakpoint detection method — percentile-based (split when similarity drops below the Nth percentile), standard deviation-based (split when the drop exceeds N standard deviations from the mean), or interquartile range (split at statistical outliers). The result is chunks where every sentence within a chunk is semantically related.

Advantage: Chunks are semantically coherent — each chunk contains one “idea” or topic. This directly improves retrieval precision because a query about a specific topic is more likely to match a chunk that’s purely about that topic.

Trade-offs: Requires an embedding pass during ingestion (adds cost and latency). Chunk sizes vary significantly — you might get chunks ranging from 50 to 1,500 tokens. This inconsistency can affect retrieval: very small chunks may lack context, and very large ones may introduce noise. Consider adding min/max size constraints.

Real-world example: A research paper where the introduction smoothly transitions between motivation, related work, and contribution overview. Fixed-size chunking would arbitrarily split these transitions. Semantic chunking detects the topic shifts (e.g., from “related work” to “our approach”) and places boundaries there, producing chunks that each represent a coherent idea.

Source: Kamradt, “Semantic Chunking,” 2024, https://github.com/FullStackRetrieval-com/RetrievalTutorials

Document-Structure-Aware Chunking

For structured documents (HTML, Markdown, PDFs with headings), parse the document tree and chunk by structural units: sections, subsections, or logical blocks.

For Markdown/HTML: Split by heading hierarchy (H1 → H2 → H3), keeping each section as a chunk. If a section exceeds the size limit, subdivide by paragraphs within it. Preserve heading paths as metadata (e.g., “User Guide > Authentication > OAuth2 Configuration”).

For PDFs: Use layout-aware parsers (Amazon Textract, Unstructured.io, LlamaParse) that detect headings, tables, and figure captions rather than treating the PDF as flat text. A PDF rendered from a two-column layout will produce garbled text with naive extraction — layout-aware parsers reconstruct the reading order.

This is often the highest-ROI approach for enterprise documents. Most corporate knowledge bases have clear structure — user guides with sections, policies with numbered clauses, API docs with endpoints. Respecting that structure during chunking preserves the author’s intent.

Real-world example: An API reference with 200 endpoints, each documented with a description, request/response schema, parameters table, and code examples. Structure-aware chunking keeps each endpoint as a single chunk (or parent chunk), so a query about “PUT /users/{id} request body” retrieves the complete endpoint documentation rather than a fragment that contains the URL but not the parameters.

Source: Unstructured.io, “Document Parsing for LLMs,” https://unstructured.io/

Source: LlamaParse, “Document Parsing for LLM Applications,” https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse/

Hierarchical Chunking (Parent-Child)

One of the most powerful techniques for production RAG. The idea: index small chunks for precise retrieval, but return the parent chunk (larger context) to the LLM.

🚀 Hierarchical Chunking

Document
├── Section (parent chunk — 2000 tokens)
│   ├── Paragraph 1 (child chunk — 300 tokens) ← Retrieved by vector search
│   ├── Paragraph 2 (child chunk — 250 tokens)
│   └── Paragraph 3 (child chunk — 350 tokens)

When the retriever matches Paragraph 1, the system returns the entire Section to the LLM. This gives the model enough context to generate a complete answer while maintaining retrieval precision.

Implementation: Store both parent and child chunks with a parent-child relationship. Search against child chunks, then look up and return parent chunks. Deduplicate when multiple child chunks from the same parent are retrieved.

python

# LlamaIndex auto-merging retriever
from llama_index.core.node_parser import HierarchicalNodeParser
from llama_index.core.retrievers import AutoMergingRetriever

node_parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[2048, 512, 128]  # parent, child, grandchild
)
nodes = node_parser.get_nodes_from_documents(documents)

Real-world example: A 100-page employee handbook. Child chunks (individual policy clauses, 200-400 tokens) provide precise retrieval — “What’s the parental leave policy?” hits the exact clause. But the parent chunk (the full “Leave Policies” section, 2,000 tokens) gives the LLM enough context to mention related policies (sick leave, unpaid leave options) that the user might also need.

Design consideration: Choose parent/child size ratios carefully. A 4:1 ratio (e.g., 2048:512) is a good starting point. If parents are too large (10:1), you lose the benefit of precise retrieval. If too small (2:1), there’s little additional context to gain.

Source: LlamaIndex, “Auto-Merging Retriever,” https://docs.llamaindex.ai/en/stable/examples/retrievers/auto_merging_retriever/

Agentic Chunking

Use an LLM to determine chunk boundaries. Feed the document to an LLM and ask it to identify semantically complete units. The LLM considers context, topic shifts, and logical completeness in ways that heuristic methods cannot.

python

agentic_chunk_prompt = """Analyze this document and identify the natural
semantic boundaries. For each chunk, provide:
1. The start and end positions
2. A descriptive title summarizing the chunk's content
3. A list of key entities mentioned

The goal is to create chunks that are each self-contained — a reader
should be able to understand each chunk without needing the others.

Document:
{document_text}
"""

Trade-off: Significantly more expensive and slower at ingestion time — each document requires one or more LLM calls. At $3/MTok input for a frontier model, chunking a 1M-token corpus costs $3 just for the chunking pass. Best reserved for high-value documents where chunking quality has outsized impact (e.g., legal contracts, medical records, regulatory filings).

Real-world example: A complex merger agreement with nested cross-references. The LLM identifies that Section 4.2(a) references definitions in Section 1.1 and conditions in Section 7.3, and creates a chunk that includes the relevant cross-referenced text — something no heuristic method could achieve.

Advanced Chunking Techniques

Late Chunking (Embed First, Then Chunk)

Traditional chunking pipelines chunk first, then embed each chunk independently. Late chunking inverts this: embed the entire document using a long-context embedding model, then split the embedding sequence into chunks.

Why this matters: When you embed chunks independently, each chunk loses the context of the surrounding document. The sentence “It supports three modes” is meaningless in isolation — “it” could refer to anything. When you embed the full document first, the embedding for that sentence captures that “it” refers to “the S3 Transfer Acceleration feature” because the transformer’s attention mechanism has seen the full context.

python

# Conceptual late chunking pipeline
# Step 1: Embed the full document with a long-context model
token_embeddings = long_context_model.encode(full_document, output="token_embeddings")

# Step 2: Split token embeddings into chunk-level embeddings
# by averaging token embeddings within each chunk span
chunk_embeddings = []
for start, end in chunk_boundaries:
    chunk_emb = token_embeddings[start:end].mean(dim=0)
    chunk_embeddings.append(chunk_emb)

Requirements: You need an embedding model with a long enough context window to process entire documents (e.g., jina-embeddings-v2 with 8,192 tokens, or nomic-embed-text with 8,192 tokens). For documents exceeding the model’s context window, you can apply late chunking to sections rather than the full document.

Trade-off: Higher ingestion cost (embedding full documents is more expensive than embedding chunks) and requires a long-context embedding model. The quality improvement is most noticeable for documents with heavy co-referencing and pronoun usage.

Source: Günther, M. et al., “Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models,” Jina AI, 2024, https://arxiv.org/abs/2409.04701

Proposition-Based Chunking

Instead of splitting text by size or structure, decompose each passage into atomic propositions — self-contained statements that each express a single fact.

Example transformation:

Original text: “Amazon S3 provides several storage classes designed for different use cases. S3 Standard offers high durability, availability, and performance for frequently accessed data. It is suitable for a wide variety of use cases including cloud applications, dynamic websites, and big data analytics.”

Propositions:

“Amazon S3 provides several storage classes designed for different use cases.”
“S3 Standard offers high durability for frequently accessed data.”
“S3 Standard offers high availability for frequently accessed data.”
“S3 Standard offers high performance for frequently accessed data.”
“S3 Standard is suitable for cloud applications.”
“S3 Standard is suitable for dynamic websites.”
“S3 Standard is suitable for big data analytics.”

python

proposition_prompt = """Decompose the following passage into clear,
self-contained propositions. Each proposition should:
- Express a single, atomic fact
- Be understandable without additional context
- Resolve all pronouns and references to their specific nouns
- De-compound conjunctive statements into individual propositions

Passage: {text}
Propositions:"""

Advantage: Each proposition is a precise, self-contained retrieval unit. A query asking “Is S3 Standard suitable for big data?” will match proposition 7 with very high similarity, whereas the original paragraph would be a weaker match due to noise.

Trade-offs: Produces many small chunks (5-10x more than paragraph-level chunking), increasing storage and retrieval costs. The LLM decomposition step is expensive at ingestion time. Best used as child chunks in a hierarchical system — retrieve propositions, return the parent paragraph.

Source: Chen, S. et al., “Dense X Retrieval: What Retrieval Granularity Should We Use?,” 2023, https://arxiv.org/abs/2312.06648

Context-Enriched Chunking

A practical technique that addresses the “chunk in isolation” problem: prepend contextual information to each chunk so it can stand alone.

Approach 1: Prepend document/section metadata

python

# Before: raw chunk
chunk = "It supports three modes: standard, expedited, and bulk."

# After: context-enriched chunk
enriched_chunk = """Document: AWS S3 Glacier Developer Guide
Section: Data Retrieval Options
---
S3 Glacier supports three retrieval modes: standard, expedited, and bulk."""

Approach 2: LLM-generated contextual summary

Use a lightweight LLM to generate a brief contextual summary for each chunk, situating it within the broader document.

python

context_prompt = """Given the following document and a specific chunk
extracted from it, write a 1-2 sentence context that situates this
chunk within the broader document. The context should resolve any
ambiguous references and clarify what topic is being discussed.

Document title: {doc_title}
Section path: {heading_path}
Surrounding text: {prev_paragraph}... [CHUNK] ...{next_paragraph}

Chunk: {chunk_text}
Context:"""

# Prepend the generated context to the chunk before embedding
final_chunk = f"{generated_context}\n\n{chunk_text}"

Approach 3: Anthropic’s Contextual Retrieval method

Anthropic published a specific implementation of this pattern, showing that prepending chunk-specific context (generated by Claude) reduced retrieval failure rates by 35% when combined with hybrid search and reranking — from a 5.7% failure rate to 1.9%.

python

context_prompt = """<document>
{WHOLE_DOCUMENT}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{CHUNK_CONTENT}
</chunk>
Please give a short succinct context to situate this chunk within
the overall document for the purposes of improving search retrieval
of the chunk. Answer only with the succinct context and nothing else."""

Trade-off: Requires an LLM call per chunk during ingestion. For a corpus of 100,000 chunks, this is significant. Use prompt caching to reduce cost when processing chunks from the same document (the document text in the prompt is repeated). Anthropic reported that prompt caching reduced the cost of contextual enrichment by approximately 50x.

Source: Anthropic, “Introducing Contextual Retrieval,” 2024, https://www.anthropic.com/news/contextual-retrieval

Document-Specific Chunking Strategies

Different document types have fundamentally different structures and require tailored chunking approaches.

PDF Tables

Tables are one of the most common failure points in RAG systems. Naive text extraction destroys table structure, turning rows and columns into meaningless strings.

Strategy:

Use Amazon Textract with AnalyzeDocument (TABLES feature) to extract tables as structured data
Store each table as a complete chunk — never split a table across chunks
Include the table caption and any surrounding explanatory text
Convert to a text representation that preserves structure:

python

# Option 1: Markdown table (good for embedding)
markdown_table = """
| Instance Type | vCPUs | Memory (GiB) | Price/hr |
|--------------|-------|-------------|----------|
| m5.large     | 2     | 8           | $0.096   |
| m5.xlarge    | 4     | 16          | $0.192   |
"""

# Option 2: Row-per-line with headers (good for dense tables)
text_rows = """
Table: EC2 Instance Pricing (US-East-1)
- m5.large: 2 vCPUs, 8 GiB Memory, $0.096/hr
- m5.xlarge: 4 vCPUs, 16 GiB Memory, $0.192/hr
"""

For very large tables (50+ rows): Consider splitting by row groups while repeating the header in each chunk, or create a summary chunk describing the table’s contents alongside the full table chunk.

Source: AWS, “Amazon Textract AnalyzeDocument,” https://docs.aws.amazon.com/textract/latest/dg/API_AnalyzeDocument.html

HTML Pages

HTML presents unique challenges: navigation elements, footers, sidebars, and ads surround the actual content.

Strategy:

Strip boilerplate (nav, footer, scripts, ads) using readability algorithms or tools like trafilatura or BeautifulSoup with semantic filtering
Parse the clean HTML DOM tree for structure (headings, lists, code blocks)
Chunk by semantic HTML sections (
,
, heading hierarchy)
Preserve links as metadata — a chunk about “S3 pricing” should retain the link to the pricing page

python

from trafilatura import extract

# Extract main content, stripping boilerplate
clean_text = extract(html_content, include_tables=True, include_links=True)

JSON and Structured Data

API responses, configuration files, and structured datasets need special handling.

Strategy:

For flat JSON objects: each top-level key-value pair or logical group becomes a chunk
For nested JSON: chunk at meaningful nesting levels (e.g., each item in an array of products)
Always include the schema path as context: "product.pricing.tiers[0]" tells the model where this data sits
Convert to natural language where appropriate:

python

# Raw JSON
{"instance_type": "m5.large", "vcpus": 2, "memory_gib": 8, "price_per_hour": 0.096}

# Natural language chunk (better for embedding)
"The m5.large EC2 instance type has 2 vCPUs, 8 GiB of memory, and costs $0.096 per hour."

Emails and Chat Logs

Conversational content has unique structure: turns, threads, quoted replies, signatures, and attachments.

Strategy for emails:

Strip signatures, disclaimers, and reply chains (or chunk them separately)
Each email becomes a chunk, with metadata: sender, recipients, date, subject, thread ID
For long email threads: chunk each message individually but store the thread ID for retrieval of the full conversation
Extract and separately chunk any inline content (tables, lists, action items)

Strategy for chat logs (Slack, Teams):

Chunk by conversation thread, not by individual message — a single Slack message rarely has enough context
Use time-based windowing: group messages within a conversation that are within N minutes of each other
Include participant names and timestamps as metadata
Flag and extract code snippets, shared links, and decisions as separate high-value chunks

Chunk Size Optimization

The Empirical Evidence

The impact of chunk size on RAG quality has been studied extensively. A comprehensive 2024 study evaluated chunk sizes across diverse datasets and tasks:

Chunk Size	Retrieval Precision	Answer Quality	Best For	Risk
128 tokens	Very High	Low	Factoid lookup, proposition indexing	Fragments lack context for generation
256 tokens	High	Medium	Short-answer QA, FAQ matching	May split compound explanations
512 tokens	Medium-High	High	General-purpose (most use cases)	Balanced trade-off
1024 tokens	Medium	High	Complex explanations, how-to content	Some noise for narrow queries
2048 tokens	Lower	Medium-High	Long-form synthesis, multi-topic answers	Significant noise dilution

Source: NVIDIA, “Advanced RAG Techniques,” 2024, https://developer.nvidia.com/blog/advanced-rag-techniques/

Methodology for Finding Your Optimal Chunk Size

Do not blindly adopt published benchmarks — they tested different documents and queries than yours. Run your own experiments:

Step 1: Create a representative evaluation set. Select 50-100 queries that reflect your real user base. Include simple factoid questions, multi-part questions, and questions that require synthesizing information from multiple places.

Step 2: Chunk your corpus at multiple sizes. Test at least 4 sizes: 256, 512, 1024, and one document-structure-aware baseline.

Step 3: Measure retrieval and generation metrics separately.

python

chunk_sizes = [256, 512, 1024, 2048]
results = {}

for size in chunk_sizes:
    chunks = chunk_corpus(documents, chunk_size=size, overlap=int(size * 0.15))
    index = build_index(chunks)

    retrieval_metrics = evaluate_retrieval(index, eval_queries)  # Recall@5, Precision@5
    generation_metrics = evaluate_generation(index, eval_queries)  # Faithfulness, Completeness

    results[size] = {
        "recall@5": retrieval_metrics.recall,
        "precision@5": retrieval_metrics.precision,
        "faithfulness": generation_metrics.faithfulness,
        "completeness": generation_metrics.completeness,
        "avg_chunks_per_query": retrieval_metrics.avg_retrieved
    }

Step 4: Analyze the trade-off curve. Plot retrieval precision vs. answer completeness. The optimal chunk size is where both metrics are acceptably high — typically a Pareto-optimal point where improving one metric would significantly degrade the other.

Step 5: Consider your embedding model’s training data. Most embedding models were trained on passages of a specific length. Titan Embeddings V2 was trained on passages up to 8,192 tokens but performs best on passages of 256-512 tokens. Cohere Embed v3 handles up to 512 tokens per input. Matching your chunk size to your embedding model’s sweet spot improves representation quality.

The practical recommendation: Start with 512 tokens and 10-20% overlap. Then evaluate with your actual queries and documents. There is no substitute for empirical testing on your data.

Chunking Strategy Comparison

Strategy	Chunk Quality	Ingestion Cost	Complexity	Best For	Limitations
Fixed-size	Low-Medium	Very Low	Trivial	Homogeneous corpora, prototyping	Ignores document structure entirely
Recursive splitting	Medium	Very Low	Low	General-purpose, mixed documents	Size-driven, not meaning-driven
Semantic	High	Medium (embedding pass)	Medium	Documents with subtle topic shifts	Variable chunk sizes, embedding cost
Structure-aware	High	Low-Medium (parsing)	Medium	Docs with clear headings/sections	Requires structured input
Hierarchical (parent-child)	Very High	Low-Medium	Medium-High	Enterprise docs, knowledge bases	More complex retrieval logic
Late chunking	High	High (full-doc embedding)	High	Co-reference-heavy documents	Requires long-context embedding model
Proposition-based	Very High (precision)	Very High (LLM calls)	High	High-value factoid retrieval	Expensive, many small chunks
Context-enriched	Very High	High (LLM calls)	Medium	Any corpus (universal improvement)	Cost at scale, prompt caching helps
Agentic	Highest	Very High (LLM calls)	Very High	Legal, medical, complex cross-references	Slow, expensive, non-deterministic

Decision guide:

Prototyping or low-budget? → Recursive splitting with 512 tokens
Structured enterprise docs? → Document-structure-aware + hierarchical
Highest quality, cost not primary concern? → Context-enriched + hierarchical + reranking
Factoid QA over dense content? → Proposition-based as child chunks with paragraph-level parents
Documents with heavy pronouns and references? → Late chunking or context-enriched

Metadata Enrichment

Every chunk should carry metadata beyond just the text:

Source document (title, URL, last updated)
Section/heading hierarchy (where in the document this chunk lives)
Document type (FAQ, policy, tutorial, API reference)
Entity tags (products, services, concepts mentioned)
Access control tags (department, classification level)
Chunk position (beginning, middle, end of document — useful for summaries vs. details)

This metadata enables filtered retrieval — when a user asks about “S3 pricing,” you can filter to pricing documents before semantic search, dramatically improving precision.

python

chunk_metadata = {
    "source": "s3-developer-guide-2025.pdf",
    "source_url": "https://docs.aws.amazon.com/s3/latest/userguide/",
    "section_path": "Storage Classes > S3 Standard",
    "doc_type": "technical_documentation",
    "entities": ["S3", "S3 Standard", "storage class"],
    "last_updated": "2025-11-15",
    "access_level": "public",
    "chunk_index": 14,
    "total_chunks": 87
}

AWS: Bedrock Knowledge Base Chunking

Bedrock Knowledge Bases offers several chunking strategies out of the box:

Default chunking: ~300 tokens with overlap (reasonable starting point)
Fixed-size chunking: Configurable size and overlap
Hierarchical chunking: Parent-child chunking with configurable parent and child sizes
Semantic chunking: Groups text by semantic similarity using a configurable breakpoint threshold
No chunking: Treats each file as a single chunk (useful for short documents or pre-chunked data)
Custom transformation: Use a Lambda function for arbitrary chunking logic

For most production deployments, start with hierarchical or semantic chunking in Bedrock KB, then evaluate. If you need document-structure-aware parsing (especially for PDFs with complex layouts), use custom transformation with Amazon Textract or a third-party parser.

Custom Transformation with Lambda

When the built-in strategies are insufficient, Bedrock KB’s custom transformation lets you implement any chunking logic via a Lambda function. The Lambda receives the parsed document and returns your custom chunks.

python

# Lambda function for custom Bedrock KB chunking
import json
import boto3
import re

textract = boto3.client('textract')

def lambda_handler(event, context):
    """
    Custom chunking Lambda for Bedrock Knowledge Base.

    The event contains:
    - s3Bucket: source bucket
    - s3ObjectKey: document key
    - metadata: document metadata
    - parsedDocumentContent: the extracted text (if using default parser)
    """
    input_files = event.get("inputFiles", [])
    output_files = []

    for input_file in input_files:
        content = input_file["contentBody"]
        original_metadata = input_file.get("metadata", {})

        # Custom logic: chunk by heading structure
        chunks = chunk_by_headings(content)

        chunk_results = []
        for i, chunk in enumerate(chunks):
            chunk_results.append({
                "contentBody": chunk["text"],
                "contentType": "text/plain",
                "contentMetadata": {
                    **original_metadata,
                    "section_title": chunk.get("heading", ""),
                    "chunk_index": str(i),
                    "heading_level": str(chunk.get("level", 0))
                }
            })

        output_files.append({
            "originalFileLocation": input_file["originalFileLocation"],
            "fileContents": chunk_results
        })

    return {"outputFiles": output_files}


def chunk_by_headings(text, max_chunk_size=1500):
    """Split document by heading hierarchy with size limits."""
    # Split on markdown-style headings
    heading_pattern = r'^(#{1,4})\s+(.+)$'
    sections = []
    current_section = {"heading": "Introduction", "level": 0, "text": ""}

    for line in text.split("\n"):
        match = re.match(heading_pattern, line)
        if match:
            if current_section["text"].strip():
                sections.append(current_section)
            level = len(match.group(1))
            heading = match.group(2)
            current_section = {
                "heading": heading,
                "level": level,
                "text": f"{line}\n"
            }
        else:
            current_section["text"] += line + "\n"

    if current_section["text"].strip():
        sections.append(current_section)

    # Split oversized sections by paragraph
    final_chunks = []
    for section in sections:
        if len(section["text"]) <= max_chunk_size:
            final_chunks.append(section)
        else:
            paragraphs = section["text"].split("\n\n")
            sub_chunk = {"heading": section["heading"], "level": section["level"], "text": ""}
            for para in paragraphs:
                if len(sub_chunk["text"]) + len(para) > max_chunk_size and sub_chunk["text"]:
                    final_chunks.append(sub_chunk)
                    sub_chunk = {
                        "heading": section["heading"],
                        "level": section["level"],
                        "text": ""
                    }
                sub_chunk["text"] += para + "\n\n"
            if sub_chunk["text"].strip():
                final_chunks.append(sub_chunk)

    return final_chunks

Setting up the custom transformation in Bedrock:

python

import boto3

bedrock_agent = boto3.client('bedrock-agent', region_name='us-east-1')

response = bedrock_agent.create_data_source(
    knowledgeBaseId='YOUR_KB_ID',
    name='custom-chunked-source',
    dataSourceConfiguration={
        'type': 'S3',
        's3Configuration': {
            'bucketArn': 'arn:aws:s3:::my-documents-bucket',
            'inclusionPrefixes': ['documents/']
        }
    },
    vectorIngestionConfiguration={
        'customTransformationConfiguration': {
            'intermediateStorage': {
                's3BucketName': 'my-intermediate-bucket'
            },
            'transformations': [{
                'stepToApply': 'POST_CHUNKING',
                'transformationFunction': {
                    'transformationLambdaConfiguration': {
                        'lambdaArn': 'arn:aws:lambda:us-east-1:123456789012:function:custom-chunker'
                    }
                }
            }]
        }
    }
)

When to use custom transformation:

Your documents have domain-specific structure the built-in parsers don’t handle (e.g., medical records with specific section codes, financial filings with XBRL tags)
You need to integrate a specialized parser like Amazon Textract for tables and forms
You want to implement proposition-based or context-enriched chunking
You need to chain multiple processing steps (e.g., Textract → table extraction → heading-based chunking → metadata enrichment)

Source: AWS, “Chunking and parsing configurations,” https://docs.aws.amazon.com/bedrock/latest/userguide/kb-chunking-parsing.html

Source: AWS, “Custom transformation with Lambda for Bedrock Knowledge Bases,” https://docs.aws.amazon.com/bedrock/latest/userguide/kb-chunking-parsing.html#kb-custom-transformation

Key Takeaways

Start structure-aware. If your documents have headings and sections, use them. Document-structure-aware chunking outperforms fixed-size chunking with zero additional cost.
Hierarchical is almost always worth it. The parent-child pattern (small chunks for retrieval, large chunks for generation) addresses the fundamental tension between precision and context.
Context enrichment is the highest-leverage advanced technique. Prepending contextual summaries to chunks consistently improves retrieval across all document types and query patterns.
Match chunk size to your embedding model. A 2,048-token chunk embedded by a model optimized for 512-token passages will have a degraded representation.
Test with your data. Published benchmarks are a starting point. The optimal strategy depends on your documents, your queries, and your quality requirements. Run the experiment.
Budget for iteration. Your first chunking strategy will not be your last. Build your pipeline to make strategy changes easy — chunking is the component you’ll revisit most often.

4. Embedding & Indexing

Embedding and indexing are the bridge between your chunked text and retrievable knowledge. The embedding model determines how well semantic meaning is captured; the vector store and index configuration determine how efficiently and accurately that meaning is searched at query time.

Embedding Model Selection

Your embedding model translates text into dense vector representations where semantic similarity maps to geometric proximity. The quality of these embeddings sets the ceiling for retrieval quality — no amount of reranking or query enhancement can compensate for fundamentally poor embeddings.

Key factors in choosing an embedding model:

Domain alignment. Models trained on general web text may underperform on domain-specific jargon (medical, legal, financial). If your MTEB scores are strong but retrieval on your data is weak, domain mismatch is the likely culprit.
Multilingual support. If your corpus includes multiple languages, you need a model explicitly trained for cross-lingual retrieval. Titan V2 and Cohere Embed v3 both handle this well.
Dimension trade-off. Higher dimensions capture more nuance but increase storage, memory, and search latency. For most use cases, 1024 dimensions is the sweet spot. Below 512, you lose meaningful semantic distinctions; above 2048, the marginal quality gain rarely justifies the cost.
Context window. Your chunks must fit within the model’s max input tokens. If you use 1024-token chunks, a model with a 512-token limit will silently truncate half the content.

Model	Dimensions	Max Tokens	MTEB Avg (Retrieval)	Cost (per 1M tokens)	Strengths
Amazon Titan Embeddings V2	256/512/1024	8,192	~63	$0.02	Native Bedrock, configurable dims, good multilingual
Cohere Embed v3	1024	512	~66	$0.10	Top-tier search quality, int8/binary compression
Amazon Titan Text Embeddings V1	1536	8,192	~60	$0.02	Good baseline, fixed dimensions
BGE-M3 (open source)	1024	8,192	~65	Self-host	Multi-lingual, multi-granularity, dense+sparse
E5-Mistral-7B (open source)	4096	32,768	~67	Self-host	Instruction-tuned, excellent zero-shot
GTE-Qwen2-7B (open source)	3584	131,072	~68	Self-host	Very long context, strong multilingual

Note: MTEB scores are approximate and vary by task subset. Always benchmark on your data.

Practical guidance: If you’re in the AWS ecosystem, Titan Embeddings V2 is the path of least resistance — no data leaves your VPC, pricing is straightforward ($0.02/1M tokens), and integration with Bedrock Knowledge Bases is seamless. The configurable dimensionality (256/512/1024) lets you trade quality for cost and speed. If retrieval quality is your bottleneck, benchmark Cohere Embed v3 — it consistently ranks at the top for search tasks on the MTEB leaderboard and is available on Bedrock.

For self-hosted models, BGE-M3 is a strong choice if you need both dense and sparse embeddings from a single model (useful for hybrid search without maintaining separate indexes). Deploy it on SageMaker with a ml.g5.xlarge instance for a good cost-performance balance.

python

import boto3
import json

bedrock_runtime = boto3.client("bedrock-runtime", region_name="us-east-1")

def embed_text(text: str, dimensions: int = 1024, normalize: bool = True) -> list[float]:
    """Embed text using Titan Embeddings V2 via Bedrock."""
    response = bedrock_runtime.invoke_model(
        modelId="amazon.titan-embed-text-v2:0",
        body=json.dumps({
            "inputText": text,
            "dimensions": dimensions,
            "normalize": normalize
        })
    )
    return json.loads(response["body"].read())["embedding"]

# Embed a chunk
embedding = embed_text("Amazon S3 provides eleven nines of durability.", dimensions=1024)
# Returns a 1024-dimensional normalized vector

Matryoshka embeddings and dimension reduction. Some models (Titan V2, text-embedding-3-large) support Matryoshka Representation Learning (MRL), where the most important information is packed into the first N dimensions. You can truncate a 1024-dim embedding to 256 dims with modest quality loss (~2-5% drop in recall). This is useful for cost optimization: 256-dim vectors use 4× less storage and search ~3× faster.

Source: Hugging Face, “MTEB Leaderboard,” https://huggingface.co/spaces/mteb/leaderboard

Source: Kusupati et al., “Matryoshka Representation Learning,” NeurIPS 2022, https://arxiv.org/abs/2205.13147

Vector Store Selection

The vector store is where your embeddings live and where retrieval queries execute. Choosing the right one depends on your scale, search requirements, and existing infrastructure.

Vector Store	Managed on AWS	Hybrid Search	Metadata Filtering	Max Vectors	Best For
OpenSearch Serverless	Yes	Yes (BM25 + kNN)	Yes (complex filters)	Billions	Production hybrid search at scale
Aurora PostgreSQL (pgvector)	Yes	Limited (requires custom)	Yes (full SQL)	Millions	Teams already on Aurora, SQL joins
Amazon Neptune Analytics	Yes	Graph + vector	Yes (Gremlin/openCypher)	Millions	Knowledge graph + vector hybrid
Amazon MemoryDB	Yes	Yes (via VSS module)	Yes	Millions	Ultra-low latency, real-time
Pinecone	Third-party	Yes (sparse-dense)	Yes	Billions	Simplicity, fast iteration
FAISS	Self-managed	No	No	Billions (on large instances)	Prototyping, batch processing

OpenSearch Serverless is the pragmatic choice for most AWS customers. It supports hybrid search (BM25 + kNN) natively, scales automatically, integrates with Bedrock KB, and handles complex metadata filtering with boolean logic. The downside is cost at low scale — there’s a minimum of 2 OCUs (~$350/month). For teams processing fewer than 100 queries per day, this minimum can dominate the cost profile.

Aurora PostgreSQL with pgvector is compelling when your application already uses Aurora, or when you need to join vector search results with relational data (e.g., “find similar products that are in stock and priced under $50”). The limitation is that pgvector’s HNSW implementation is less mature than OpenSearch’s, and true hybrid search requires application-level score fusion.

Amazon Neptune Analytics is the right choice when your data has rich graph relationships. It combines vector similarity search with graph traversal — retrieve chunks that are semantically similar and connected to a specific entity in your knowledge graph. This is Graph RAG territory (Section 8).

python

# OpenSearch Serverless — hybrid search example
hybrid_query = {
    "size": 10,
    "query": {
        "hybrid": {
            "queries": [
                {
                    "match": {  # BM25 (sparse)
                        "text": {"query": "S3 cross-region replication setup"}
                    }
                },
                {
                    "knn": {  # Dense (vector)
                        "embedding": {
                            "vector": query_embedding,
                            "k": 20
                        }
                    }
                }
            ]
        }
    }
}

Source: AWS, “Supported vector stores for Amazon Bedrock Knowledge Bases,” https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base-supported.html

Source: AWS, “Vector search for Amazon OpenSearch Serverless,” https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-vector-search.html

Indexing Parameters

For HNSW (Hierarchical Navigable Small World — the dominant indexing algorithm in production vector search), three parameters control the recall-speed-memory trade-off:

ef_construction (build-time quality): Controls the size of the dynamic candidate list during index building. Higher values produce a higher-quality graph (better recall) at the cost of slower indexing. Start with 256-512 for production. Values below 128 often produce noticeable recall degradation; values above 512 rarely improve recall enough to justify the indexing time.

M (connections per node): The number of bidirectional links per node in the graph. Higher M increases recall and memory usage. Start with 16-32. M=16 is sufficient for most corpora under 10M vectors; M=32 for larger or higher-dimensional indexes.

ef_search (query-time quality): Controls the size of the dynamic candidate list during search. Higher values improve recall at the cost of higher latency. Start with 128 and tune based on your recall-latency requirements. A common pattern is to set ef_search = 2 * top_k as a starting point.

Parameter	Low Setting	High Setting	Impact on Recall	Impact on Speed
`ef_construction`	128	512	+5-10% recall	2-3× slower indexing
`M`	8	32	+5-15% recall	2-4× more memory
`ef_search`	64	256	+3-8% recall	2-3× slower queries

Practical tuning workflow: Build your index with ef_construction=512, M=16. Then test ef_search values from 64 to 512 on your evaluation set, plotting recall@10 vs. P95 latency. Pick the point where increasing ef_search no longer meaningfully improves recall — typically around 128-256 for most workloads.

Distance metric. Use cosine similarity for normalized embeddings (most embedding models normalize by default). If embeddings are not normalized, use inner product. Euclidean (L2) distance is rarely the best choice for text embeddings but is supported everywhere as a fallback.

Source: Malkov & Yashunin, “Efficient and Robust Approximate Nearest Neighbor using Hierarchical Navigable Small World Graphs,” 2018, https://arxiv.org/abs/1603.09320

5. Query Understanding & Enhancement

Most RAG tutorials show a simple flow: user query → embed → retrieve → generate. In production, this naive approach fails surprisingly often. Users ask ambiguous questions, use different terminology than your documents, or pose complex questions that require multiple retrieval passes.

A query understanding layer between the user and the retriever is one of the highest-ROI investments in a RAG system. This section covers every major technique in depth — with implementation patterns, trade-offs, and guidance on when to use each.

5.1 Query Rewriting / Reformulation

The user says: “How do I fix the timeout issue?” Your documents don’t contain the word “fix” — they say “troubleshoot” and “resolve.” A rewrite step bridges this vocabulary gap.

python

import boto3, json

bedrock_runtime = boto3.client("bedrock-runtime", region_name="us-east-1")

def rewrite_query(original_query: str, domain_context: str = "") -> str:
    """Rewrite a user query to improve retrieval against technical docs."""
    prompt = f"""Rewrite the following user question to be more specific and use
terminology likely found in technical documentation. Preserve the original intent.
Do not answer the question — only reformulate it.

Domain context: {domain_context}
Original question: {original_query}
Rewritten question:"""

    response = bedrock_runtime.invoke_model(
        modelId="anthropic.claude-3-haiku-20240307-v1:0",
        body=json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 150,
            "messages": [{"role": "user", "content": prompt}]
        })
    )
    return json.loads(response["body"].read())["content"][0]["text"].strip()

# Example
original = "How do I fix the timeout issue?"
rewritten = rewrite_query(original, domain_context="AWS Lambda")
# → "How to troubleshoot and resolve AWS Lambda function timeout errors"

This is cheap (a single fast LLM call with Haiku at ~$0.00025 per rewrite) and often improves retrieval hit rate by 10–20%.

When it helps most: Technical domains where user vocabulary diverges from documentation vocabulary. When to skip: When users already use precise terminology (e.g., internal API consumers querying API docs).

Source: Ma et al., “Query Rewriting in Retrieval-Augmented Large Language Models,” 2023, https://arxiv.org/abs/2305.14283

5.2 Query Expansion

Query expansion enriches the original query with additional terms to increase recall. Two primary strategies:

Synonym Injection

Append synonyms or domain-equivalent terms to the query before embedding or keyword search:

python

# Domain-specific synonym map (can be auto-generated or curated)
SYNONYM_MAP = {
    "timeout": ["timeout error", "request timeout", "connection timeout", "deadline exceeded"],
    "slow":    ["high latency", "performance degradation", "long response time"],
    "crash":   ["application crash", "unhandled exception", "segmentation fault", "OOM killed"],
    "deploy":  ["deployment", "release", "rollout", "ship"],
}

def expand_with_synonyms(query: str, synonym_map: dict, max_expansions: int = 3) -> str:
    """Expand query with domain-specific synonyms for improved keyword recall."""
    expansions = []
    query_lower = query.lower()
    for term, synonyms in synonym_map.items():
        if term in query_lower:
            expansions.extend(synonyms[:max_expansions])
    if expansions:
        return f"{query} ({', '.join(expansions)})"
    return query

# Example
expand_with_synonyms("Lambda timeout when processing large files")
# → "Lambda timeout when processing large files (timeout error, request timeout, connection timeout)"

Entity Expansion

Resolve abbreviations, acronyms, and shorthand references to their full forms:

python

ENTITY_MAP = {
    "S3":       "Amazon Simple Storage Service (S3)",
    "EKS":      "Amazon Elastic Kubernetes Service (EKS)",
    "Lambda":   "AWS Lambda",
    "RDS":      "Amazon Relational Database Service (RDS)",
    "IAM":      "AWS Identity and Access Management (IAM)",
    "VPC":      "Amazon Virtual Private Cloud (VPC)",
}

def expand_entities(query: str, entity_map: dict) -> str:
    """Expand acronyms and abbreviations to full names for retrieval."""
    expanded = query
    for abbrev, full_name in entity_map.items():
        if abbrev in query and full_name not in query:
            expanded = expanded.replace(abbrev, full_name, 1)
    return expanded

# Example
expand_entities("How to connect S3 to Lambda")
# → "How to connect Amazon Simple Storage Service (S3) to AWS Lambda"

Trade-off: Synonym injection improves recall but can reduce precision — the expanded terms may match irrelevant documents. Use sparingly and combine with reranking to filter noise. Entity expansion is nearly always beneficial and carries low risk.

Source: Carpineto & Romano, “A Survey of Automatic Query Expansion in Information Retrieval,” ACM Computing Surveys, 2012, https://doi.org/10.1145/2071389.2071390

5.3 Query Fusion / RAG-Fusion

RAG-Fusion generates multiple variants of the user’s query, retrieves documents for each variant independently, then combines results using Reciprocal Rank Fusion (RRF). This dramatically improves recall by approaching the knowledge base from multiple angles.

python

import hashlib
from collections import defaultdict

def generate_query_variants(original_query: str, n_variants: int = 4) -> list[str]:
    """Use an LLM to generate diverse reformulations of the query."""
    prompt = f"""Generate {n_variants} different versions of the following question.
Each version should approach the topic from a different angle or use different
phrasing, while preserving the original intent. Return one question per line.

Original question: {original_query}
Variants:"""

    response = bedrock_runtime.invoke_model(
        modelId="anthropic.claude-3-haiku-20240307-v1:0",
        body=json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 300,
            "messages": [{"role": "user", "content": prompt}]
        })
    )
    result = json.loads(response["body"].read())["content"][0]["text"]
    variants = [line.strip().lstrip("0123456789.-) ") for line in result.strip().split("\n") if line.strip()]
    return [original_query] + variants[:n_variants]


def reciprocal_rank_fusion(ranked_lists: list[list[str]], k: int = 60) -> list[str]:
    """Combine multiple ranked result lists using Reciprocal Rank Fusion.

    Args:
        ranked_lists: List of ranked document ID lists (one per query variant).
        k: RRF constant (default 60, as in the original paper).

    Returns:
        Fused ranked list of document IDs.
    """
    scores = defaultdict(float)
    for ranked_list in ranked_lists:
        for rank, doc_id in enumerate(ranked_list, start=1):
            scores[doc_id] += 1.0 / (k + rank)
    return sorted(scores, key=scores.get, reverse=True)


def rag_fusion_retrieve(query: str, retriever, n_variants: int = 4, top_k: int = 10) -> list[str]:
    """Full RAG-Fusion pipeline: generate variants, retrieve for each, fuse results."""
    variants = generate_query_variants(query, n_variants)

    # Retrieve in parallel for each variant
    ranked_lists = []
    for variant in variants:
        results = retriever.search(variant, top_k=top_k)
        ranked_lists.append([doc.id for doc in results])

    # Fuse using RRF
    fused_ids = reciprocal_rank_fusion(ranked_lists)
    return fused_ids[:top_k]

Why RRF over simple score aggregation: Different query variants may use different retrieval paths (some hit dense search hard, others match keyword patterns). Raw scores aren’t comparable across these paths. RRF uses only rank positions, making it score-agnostic and robust.

Typical improvement: RAG-Fusion improves Recall@10 by 5–15% compared to single-query retrieval, at the cost of N× retrieval latency (mitigated by parallel execution).

Source: Raudaschl, “RAG-Fusion: a New Take on Retrieval-Augmented Generation,” 2023, https://arxiv.org/abs/2402.03367

Source: Cormack et al., “Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods,” SIGIR 2009, https://dl.acm.org/doi/10.1145/1571941.1572114

5.4 HyDE — Hypothetical Document Embeddings

Instead of embedding the query directly, ask the LLM to generate a hypothetical answer, then embed that answer for retrieval.

python

def hyde_retrieve(query: str, retriever, embed_fn) -> list:
    """Generate a hypothetical document, embed it, and retrieve similar real documents."""
    hyde_prompt = f"""Write a short, factual passage (3-5 sentences) that would answer
this question as if it appeared in official technical documentation.
Do not hedge or add caveats — write as if stating established facts.

Question: {query}
Passage:"""

    response = bedrock_runtime.invoke_model(
        modelId="anthropic.claude-3-haiku-20240307-v1:0",
        body=json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 200,
            "messages": [{"role": "user", "content": hyde_prompt}]
        })
    )
    hypothetical_doc = json.loads(response["body"].read())["content"][0]["text"]

    # Embed the hypothetical document (not the original query)
    hyde_embedding = embed_fn(hypothetical_doc)

    # Retrieve using the HyDE embedding
    return retriever.search_by_vector(hyde_embedding, top_k=10)

Why it works: The hypothetical answer occupies the same embedding space as your documents — it’s declarative, technical, and detailed — while user queries are short and interrogative. The embedding similarity between the hypothetical answer and real documents is often higher than between the raw query and the documents.

Caveat: HyDE adds one LLM call (~200–400ms with Haiku). It works best for technical and factual queries. For simple keyword lookups (“S3 pricing table”), it can actually hurt by introducing noise from the hypothetical generation. Use it selectively based on query classification (see Section 5.5).

Source: Gao et al., “Precise Zero-Shot Dense Retrieval without Relevance Labels,” 2023, https://arxiv.org/abs/2212.10496

5.5 Intent Classification & Query Routing

Not all queries should follow the same enhancement path. A routing layer classifies the query and directs it to the appropriate strategy.

Intent Classification Taxonomy

Intent Category	Example Queries	Recommended Pipeline
Factoid	“What is the max item size in DynamoDB?”	Direct retrieval, no enhancement needed
How-to / Procedural	“How do I enable versioning on S3?”	Query rewriting + step-back prompting
Comparison	“DynamoDB vs Aurora for write-heavy workloads”	Query decomposition + parallel retrieval
Troubleshooting	“Lambda function timing out on large payloads”	Synonym expansion + HyDE
Conceptual / Explanatory	“Explain eventual consistency in DynamoDB”	HyDE + step-back prompting
Multi-hop / Analytical	“Total cost of a 3-node OpenSearch cluster for RAG”	Full decomposition + multi-hop retrieval
Conversational / Follow-up	“What about the pricing?” (after discussing S3)	Context condensation (see 5.7)
Out-of-scope	“What’s the weather today?”	Skip retrieval, respond directly or decline

Implementation

python

from enum import Enum

class QueryIntent(Enum):
    FACTOID = "factoid"
    HOWTO = "howto"
    COMPARISON = "comparison"
    TROUBLESHOOTING = "troubleshooting"
    CONCEPTUAL = "conceptual"
    ANALYTICAL = "analytical"
    CONVERSATIONAL = "conversational"
    OUT_OF_SCOPE = "out_of_scope"

def classify_intent(query: str, conversation_history: list = None) -> QueryIntent:
    """Classify query intent to route to the appropriate enhancement pipeline."""
    history_context = ""
    if conversation_history:
        history_context = f"\nConversation history: {conversation_history[-3:]}"

    prompt = f"""Classify this query into exactly one category.
Categories: factoid, howto, comparison, troubleshooting, conceptual, analytical, conversational, out_of_scope
{history_context}
Query: {query}
Category:"""

    response = bedrock_runtime.invoke_model(
        modelId="anthropic.claude-3-haiku-20240307-v1:0",
        body=json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 20,
            "messages": [{"role": "user", "content": prompt}]
        })
    )
    category = json.loads(response["body"].read())["content"][0]["text"].strip().lower()
    return QueryIntent(category)


def route_query(query: str, intent: QueryIntent, retriever, embed_fn) -> list:
    """Route query to appropriate enhancement + retrieval pipeline based on intent."""
    match intent:
        case QueryIntent.FACTOID:
            return retriever.search(query, top_k=5)

        case QueryIntent.HOWTO:
            rewritten = rewrite_query(query)
            return retriever.search(rewritten, top_k=7)

        case QueryIntent.COMPARISON:
            # Decompose into sub-queries, retrieve for each
            sub_queries = decompose_query(query)
            all_results = []
            for sq in sub_queries:
                all_results.extend(retriever.search(sq, top_k=5))
            return deduplicate_and_rank(all_results)

        case QueryIntent.TROUBLESHOOTING:
            expanded = expand_with_synonyms(query, SYNONYM_MAP)
            return hyde_retrieve(expanded, retriever, embed_fn)

        case QueryIntent.CONCEPTUAL:
            return hyde_retrieve(query, retriever, embed_fn)

        case QueryIntent.ANALYTICAL:
            return rag_fusion_retrieve(query, retriever, n_variants=4)

        case QueryIntent.OUT_OF_SCOPE:
            return []  # Skip retrieval

On AWS: For routing, a common approach is a Lambda-based classifier as a preprocessing step — lightweight, deterministic, and easy to debug. Bedrock Agents can also handle routing by defining multiple Knowledge Bases as tools, though in practice many teams prefer explicit orchestration for better control and observability (see Section 5.9).

Source: Jeong et al., “Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity,” 2024, https://arxiv.org/abs/2403.14403

5.6 Query Decomposition

Complex questions need to be broken into sub-questions. A single retrieval pass cannot surface all the information needed for: “Compare the pricing and performance of DynamoDB and Aurora for a write-heavy workload with 10,000 TPS.”

python

def decompose_query(query: str, max_sub_queries: int = 5) -> list[str]:
    """Break a complex query into independently answerable sub-questions."""
    prompt = f"""Break this complex question into 2-{max_sub_queries} simpler
sub-questions that can each be answered independently by searching a knowledge base.
Each sub-question should be self-contained (no references to other sub-questions).
Return one sub-question per line.

Complex question: {query}
Sub-questions:"""

    response = bedrock_runtime.invoke_model(
        modelId="anthropic.claude-3-haiku-20240307-v1:0",
        body=json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 300,
            "messages": [{"role": "user", "content": prompt}]
        })
    )
    result = json.loads(response["body"].read())["content"][0]["text"]
    return [line.strip().lstrip("0123456789.-) ") for line in result.strip().split("\n") if line.strip()]

# Example
decompose_query("Compare the pricing and performance of DynamoDB and Aurora for write-heavy workloads")
# → ["What is the pricing model for Amazon DynamoDB?",
#     "What is the pricing model for Amazon Aurora?",
#     "What are the write performance characteristics of DynamoDB?",
#     "What are the write performance characteristics of Aurora?",
#     "What are best practices for write-heavy workloads on AWS databases?"]

Retrieve for each sub-question independently, then feed all retrieved chunks to the LLM with the original complex question.

Source: Press et al., “Measuring and Narrowing the Compositionality Gap in Language Models,” 2023, https://arxiv.org/abs/2210.03350

5.7 Conversation Context Management for Multi-Turn RAG

In conversational RAG, the current query often depends on conversation history. Effective multi-turn handling requires more than simple context concatenation.

Context Condensation

Resolve pronouns and implicit references to produce a standalone query:

python

def condense_with_history(
    conversation_history: list[dict],
    current_query: str,
    max_history_turns: int = 5
) -> str:
    """Rewrite a follow-up query as a standalone question using conversation history."""
    # Trim to recent history to manage context and cost
    recent_history = conversation_history[-max_history_turns:]
    history_text = "\n".join(
        f"{'User' if turn['role'] == 'user' else 'Assistant'}: {turn['content']}"
        for turn in recent_history
    )

    prompt = f"""Given this conversation history and follow-up question, rewrite
the follow-up as a standalone question that can be understood without the conversation.
Preserve all specifics — do not generalize or lose detail.

Conversation history:
{history_text}

Follow-up question: {current_query}
Standalone question:"""

    response = bedrock_runtime.invoke_model(
        modelId="anthropic.claude-3-haiku-20240307-v1:0",
        body=json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 150,
            "messages": [{"role": "user", "content": prompt}]
        })
    )
    return json.loads(response["body"].read())["content"][0]["text"].strip()

# Example
history = [
    {"role": "user", "content": "Tell me about S3 storage classes."},
    {"role": "assistant", "content": "S3 offers several storage classes: Standard, Intelligent-Tiering, Standard-IA, One Zone-IA, Glacier Instant, Glacier Flexible, and Glacier Deep Archive."},
]
condense_with_history(history, "Which one is cheapest for infrequent access?")
# → "Which Amazon S3 storage class is cheapest for infrequently accessed data?"

Topic Drift Detection

In long conversations, the topic may shift. Detect when the user changes subject so you don’t carry stale context:

python

def detect_topic_shift(conversation_history: list[dict], current_query: str) -> bool:
    """Detect whether the current query represents a topic shift from the conversation."""
    if len(conversation_history) < 2:
        return False

    recent_context = " ".join(turn["content"] for turn in conversation_history[-4:])
    prompt = f"""Is the following new question a continuation of the prior conversation,
or a shift to a new topic? Answer only "continuation" or "new_topic".

Recent conversation context: {recent_context}
New question: {current_query}
Answer:"""

    response = bedrock_runtime.invoke_model(
        modelId="anthropic.claude-3-haiku-20240307-v1:0",
        body=json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 10,
            "messages": [{"role": "user", "content": prompt}]
        })
    )
    answer = json.loads(response["body"].read())["content"][0]["text"].strip().lower()
    return "new_topic" in answer

If a topic shift is detected, skip context condensation and treat the query as standalone. This prevents contaminating retrieval with irrelevant conversational context.

On AWS: Bedrock Knowledge Bases’ RetrieveAndGenerate API supports session management natively via sessionId. The service handles context carryover for multi-turn conversations. For custom pipelines, implement condensation in a Lambda function upstream of retrieval.

Source: AWS, “Amazon Bedrock RetrieveAndGenerate API,” https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent-runtime_RetrieveAndGenerate.html

5.8 Query Caching Strategies (Semantic Cache)

Not every query needs a full retrieval pass. Semantic caching stores previous query–result pairs and returns cached results when a new query is semantically similar to a previous one.

How Semantic Caching Works

🚀 Semantic Cache Flow

New query → Embed → Search cache (cosine similarity)
  → If similarity > threshold: return cached result (cache hit)
  → If similarity < threshold: full retrieval pipeline (cache miss) → store in cache

Implementation with ElastiCache + Embedding Similarity

python

import numpy as np
import redis
import json as json_module

class SemanticCache:
    """Semantic cache using Redis for storage and cosine similarity for matching."""

    def __init__(self, redis_client: redis.Redis, embed_fn, similarity_threshold: float = 0.95):
        self.redis = redis_client
        self.embed_fn = embed_fn
        self.threshold = similarity_threshold
        self.ttl_seconds = 3600  # 1 hour default TTL

    def get(self, query: str) -> dict | None:
        """Check cache for a semantically similar query."""
        query_embedding = self.embed_fn(query)

        # Scan cached embeddings (for production, use a vector index in Redis)
        cached_keys = self.redis.keys("qcache:*")
        best_match = None
        best_similarity = 0.0

        for key in cached_keys:
            cached = json_module.loads(self.redis.get(key))
            cached_embedding = np.array(cached["embedding"])
            similarity = np.dot(query_embedding, cached_embedding) / (
                np.linalg.norm(query_embedding) * np.linalg.norm(cached_embedding)
            )
            if similarity > best_similarity:
                best_similarity = similarity
                best_match = cached

        if best_similarity >= self.threshold and best_match:
            return best_match["result"]
        return None

    def put(self, query: str, result: dict) -> None:
        """Store a query-result pair in the cache."""
        embedding = self.embed_fn(query)
        cache_entry = {
            "query": query,
            "embedding": embedding.tolist(),
            "result": result,
        }
        cache_key = f"qcache:{hash(query)}"
        self.redis.setex(cache_key, self.ttl_seconds, json_module.dumps(cache_entry))

Production considerations:

Threshold tuning: 0.95+ for high-precision caching (only near-identical queries). 0.90 for broader matching (higher hit rate, risk of stale results).
TTL management: Set TTL based on how frequently your underlying documents change. For static knowledge bases, longer TTLs (hours/days); for dynamic content, shorter (minutes).
Cache invalidation: Invalidate cache entries when the underlying knowledge base is updated. Tag cache entries with the data source version.
Redis with vector search: For production scale, use Redis with the RediSearch module (or Amazon MemoryDB with vector search) instead of brute-force scanning.

Cost impact: At 1,000 queries/day with a 30% cache hit rate, you save ~300 embedding calls, ~300 retrieval operations, and (if caching final responses) ~300 LLM generation calls per day.

Source: Zhu et al., “GPTCache: An Open-Source Semantic Cache for LLM Applications,” 2023, https://arxiv.org/abs/2311.09820

Source: AWS, “Amazon MemoryDB for Redis,” https://aws.amazon.com/memorydb/

5.9 AWS Implementation: Query Enhancement Pipeline

Option A: Agent-Based Orchestration

An LLM-powered agent can handle query understanding by treating retrieval as a tool. The agent reasons about the query, decides whether to decompose it, routes sub-questions to different knowledge sources, and synthesizes results. This can be implemented with frameworks like LangGraph, CrewAI, or Strands Agents SDK, or with managed services like Amazon Bedrock Agents.

python

# Conceptual agent-based query pipeline
# The orchestrating agent:
# 1. Analyzes the query complexity
# 2. Decomposes into sub-questions if needed
# 3. Routes each sub-question to the relevant knowledge source
# 4. Retrieves and synthesizes a final answer

Pros: Flexible, handles novel query patterns without explicit rules. Cons: Less predictable than explicit pipelines; harder to debug when the agent makes suboptimal routing decisions; higher latency due to reasoning steps.

Option B: Lambda-Based Query Pipeline (Full Control)

For fine-grained control over every enhancement step, build a custom pipeline using Lambda:

python

# Lambda function: query_enhancer
# Triggered by API Gateway or Step Functions

import json
import boto3

bedrock_runtime = boto3.client("bedrock-runtime")
bedrock_agent_runtime = boto3.client("bedrock-agent-runtime")

def lambda_handler(event, context):
    query = event["query"]
    conversation_history = event.get("history", [])

    # Step 1: Context condensation (if multi-turn)
    if conversation_history:
        query = condense_with_history(conversation_history, query)

    # Step 2: Intent classification
    intent = classify_intent(query, conversation_history)

    # Step 3: Check semantic cache
    cached = semantic_cache.get(query)
    if cached:
        return {"statusCode": 200, "body": json.dumps(cached), "cache_hit": True}

    # Step 4: Apply enhancement based on intent
    match intent:
        case QueryIntent.FACTOID:
            enhanced_queries = [query]
        case QueryIntent.COMPARISON | QueryIntent.ANALYTICAL:
            enhanced_queries = decompose_query(query)
        case QueryIntent.TROUBLESHOOTING:
            enhanced_queries = [expand_with_synonyms(query, SYNONYM_MAP)]
        case QueryIntent.CONCEPTUAL:
            # Use HyDE — return the hypothetical doc for embedding
            enhanced_queries = [generate_hypothetical_doc(query)]
        case _:
            enhanced_queries = [rewrite_query(query)]

    # Step 5: Retrieve from Bedrock Knowledge Base for each enhanced query
    all_results = []
    for eq in enhanced_queries:
        response = bedrock_agent_runtime.retrieve(
            knowledgeBaseId="KB_ID",
            retrievalQuery={"text": eq},
            retrievalConfiguration={
                "vectorSearchConfiguration": {
                    "numberOfResults": 10,
                    "overrideSearchType": "HYBRID"
                }
            }
        )
        all_results.extend(response["retrievalResults"])

    # Step 6: Deduplicate and rank
    unique_results = deduplicate_by_content(all_results)

    # Step 7: Cache and return
    result = {"enhanced_queries": enhanced_queries, "results": unique_results}
    semantic_cache.put(query, result)

    return {"statusCode": 200, "body": json.dumps(result)}

Architecture with Step Functions: For complex orchestration (parallel retrieval, conditional branching), use AWS Step Functions to coordinate multiple Lambda functions:

🚀 Architecture Flow

API Gateway → Step Functions
    → Lambda: Condense Context (if multi-turn)
    → Lambda: Classify Intent
    → Choice State (branch by intent)
        → Parallel: Retrieve for each sub-query
    → Lambda: Fuse & Deduplicate Results
    → Lambda: Rerank
    → Lambda: Generate Response

Source: AWS, “Amazon Bedrock Agents,” https://docs.aws.amazon.com/bedrock/latest/userguide/agents.html

Source: AWS, “Retrieve API — Amazon Bedrock,” https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent-runtime_Retrieve.html

5.10 Step-Back Prompting

For highly specific questions, “step back” to a broader question first:

Specific: “What’s the maximum item size in DynamoDB?”

Step-back: “What are the limits and quotas for DynamoDB?”

Retrieving against the broader question often surfaces a comprehensive limits document that contains the specific answer, whereas the specific query might miss it if the exact phrasing doesn’t match.

python

def step_back_query(specific_query: str) -> str:
    """Generate a broader 'step-back' version of a specific question."""
    prompt = f"""Given this specific question, generate a broader question that would
retrieve a comprehensive document containing the answer. The broader question should
cover the general topic area of the specific question.

Specific question: {specific_query}
Broader question:"""

    response = bedrock_runtime.invoke_model(
        modelId="anthropic.claude-3-haiku-20240307-v1:0",
        body=json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 100,
            "messages": [{"role": "user", "content": prompt}]
        })
    )
    return json.loads(response["body"].read())["content"][0]["text"].strip()

Best combined with the original query: Retrieve for both the specific query and the step-back query, then merge results. This captures both precise matches and comprehensive overview documents.

Source: Zheng et al., “Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models,” 2023, https://arxiv.org/abs/2310.06117

5.11 Technique Comparison: When to Use Which

Technique	Latency Added	Quality Lift	Best For	Avoid When
Query Rewriting	~200ms	+10–20% hit rate	Vocabulary mismatch between users and docs	Users already use precise terminology
Synonym Expansion	<10ms	+5–10% recall	Keyword-heavy hybrid search	Dense-only retrieval (adds noise)
Entity Expansion	<10ms	+5–15% recall	Acronym-heavy domains (AWS, medical)	Documents already use acronyms exclusively
RAG-Fusion	N × retrieval latency	+5–15% recall	Complex queries, diverse document corpus	Simple factoid queries (overkill)
HyDE	~300ms	+10–25% for conceptual Qs	Technical/conceptual queries	Simple keyword lookups, time-sensitive queries
Query Decomposition	~200ms + N × retrieval	+20–30% for complex Qs	Multi-part comparisons, analytical queries	Simple single-fact questions
Step-Back Prompting	~200ms	+10–15% for specific Qs	Highly specific questions against broad docs	Already broad or vague queries
Context Condensation	~200ms	Required for multi-turn	Any multi-turn conversation	Single-turn interactions
Semantic Cache	~5ms (hit)	No quality change	High query repetition (FAQ-style)	Diverse, unique queries; rapidly changing docs
Intent Classification	~150ms	Enables all above	Systems using 2+ techniques	Single-technique pipelines

The recommended starting point for production:

Intent classification (always — it’s the router)
Query rewriting (high ROI, low cost)
Context condensation (if multi-turn)
Semantic cache (if >20% query repetition)

Add decomposition, HyDE, and RAG-Fusion incrementally based on evaluation data showing specific failure modes.

5.12 Evaluating Query Enhancement

How do you know your query enhancement is actually helping? Measure retrieval quality with and without each technique.

Key Metrics

Metric	What It Measures	How to Compute
Δ Recall@K	Change in recall after enhancement	`recall_enhanced - recall_baseline`
Δ Hit Rate	Change in hit rate after enhancement	`hit_rate_enhanced - hit_rate_baseline`
Δ NDCG@K	Change in ranking quality	`ndcg_enhanced - ndcg_baseline`
Query latency overhead	Added latency from enhancement	`p95_enhanced - p95_baseline`
Cache hit rate	Fraction of queries served from cache	`cache_hits / total_queries`
Enhancement coverage	Fraction of queries that get enhanced	`enhanced_queries / total_queries`

A/B Testing Framework

python

import random

def query_pipeline_ab_test(query: str, retriever, config: dict) -> dict:
    """Route queries to baseline or enhanced pipeline for A/B testing."""
    # Random assignment with configurable traffic split
    use_enhanced = random.random() < config.get("enhanced_traffic_pct", 0.5)

    if use_enhanced:
        enhanced_query = rewrite_query(query)
        results = retriever.search(enhanced_query, top_k=10)
        pipeline = "enhanced"
    else:
        results = retriever.search(query, top_k=10)
        pipeline = "baseline"

    # Log for analysis
    log_retrieval_event(
        original_query=query,
        pipeline=pipeline,
        result_ids=[r.id for r in results],
        latency_ms=elapsed_ms,
    )

    return {"results": results, "pipeline": pipeline}

Practical guidance: Run A/B tests for at least 500 queries per arm before drawing conclusions. Track both retrieval metrics (recall, NDCG) and downstream generation metrics (faithfulness, user satisfaction) — an enhancement that improves retrieval may not always improve the final answer if the LLM was already generating well from the baseline context.

Source: RAGAS Documentation, “Metrics,” https://docs.ragas.io/en/latest/concepts/metrics/

References (Section 5)

Ma, X. et al. (2023). “Query Rewriting in Retrieval-Augmented Large Language Models.” https://arxiv.org/abs/2305.14283
Carpineto, C. & Romano, G. (2012). “A Survey of Automatic Query Expansion in Information Retrieval.” ACM Computing Surveys. https://doi.org/10.1145/2071389.2071390
Raudaschl, A. (2023). “RAG-Fusion: a New Take on Retrieval-Augmented Generation.” https://arxiv.org/abs/2402.03367
Cormack, G. et al. (2009). “Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods.” SIGIR 2009. https://dl.acm.org/doi/10.1145/1571941.1572114
Gao, L. et al. (2023). “Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE).” https://arxiv.org/abs/2212.10496
Jeong, S. et al. (2024). “Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity.” https://arxiv.org/abs/2403.14403
Press, O. et al. (2023). “Measuring and Narrowing the Compositionality Gap in Language Models.” https://arxiv.org/abs/2210.03350
Zheng, Z. et al. (2023). “Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models.” https://arxiv.org/abs/2310.06117
Zhu, Z. et al. (2023). “GPTCache: An Open-Source Semantic Cache for LLM Applications.” https://arxiv.org/abs/2311.09820
AWS Documentation. “Amazon Bedrock Agents.” https://docs.aws.amazon.com/bedrock/latest/userguide/agents.html
AWS Documentation. “Retrieve API — Amazon Bedrock.” https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent-runtime_Retrieve.html
AWS Documentation. “RetrieveAndGenerate API.” https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent-runtime_RetrieveAndGenerate.html
AWS Documentation. “Amazon MemoryDB for Redis.” https://aws.amazon.com/memorydb/
RAGAS Documentation. “Metrics.” https://docs.ragas.io/en/latest/concepts/metrics/

6. Retrieval Strategies

Retrieval is the core of RAG — get this wrong, and even the best generation model can’t produce a good answer. This section covers the major retrieval paradigms, how to combine them, and the optimization techniques that separate production systems from prototypes.

Dense vs. Sparse vs. Hybrid

Dense retrieval (semantic/vector search) embeds queries and documents into vector space and retrieves by cosine similarity. The embedding captures semantic meaning, so “automobile” matches “car” and “EC2 instance types for machine learning” matches “compute-optimized instances for ML workloads.” Dense retrieval excels at paraphrased questions, concept-level queries, and natural language that doesn’t exactly match the terminology in your corpus.

Sparse retrieval (keyword/BM25) uses term-frequency-based matching. BM25 scores documents based on how often query terms appear, adjusted for document length and term rarity. Sparse retrieval excels at exact matches — product codes (NR-502), error messages (AccessDeniedException), proper nouns, and technical identifiers. It also handles rare domain terms that embedding models may not have seen during training.

The failure modes are complementary. Dense retrieval fails on exact-match queries (an embedding for “p3.16xlarge” may not be close to a chunk containing that specific instance type). Sparse retrieval fails on semantic queries (searching “reduce costs” won’t match a document about “cost optimization strategies” unless those exact words appear).

Hybrid search combines both approaches, and in nearly every benchmark and production deployment, it outperforms either alone. The combination is not just additive — it’s multiplicative, because each method covers the other’s blind spots.

Score fusion strategies:

python

# Linear combination (simplest, most common)
final_score = alpha * dense_score + (1 - alpha) * sparse_score
# alpha = 0.5-0.7 works for most use cases; tune on your eval set

# Reciprocal Rank Fusion (RRF) — rank-based, score-agnostic
# Works better when dense and sparse scores are on different scales
def reciprocal_rank_fusion(dense_ranks, sparse_ranks, k=60):
    """Combine rankings using RRF. k is a smoothing constant."""
    fused_scores = {}
    for doc_id, rank in dense_ranks.items():
        fused_scores[doc_id] = fused_scores.get(doc_id, 0) + 1 / (k + rank)
    for doc_id, rank in sparse_ranks.items():
        fused_scores[doc_id] = fused_scores.get(doc_id, 0) + 1 / (k + rank)
    return sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)

RRF is generally more robust than linear combination because it doesn’t require score normalization. OpenSearch uses a variant of this in its native hybrid search implementation.

On AWS: OpenSearch supports hybrid search natively via the hybrid query type, combining BM25 match queries with knn vector queries. Bedrock Knowledge Bases supports configurable hybrid retrieval — set searchType to HYBRID when calling Retrieve or RetrieveAndGenerate. The default behavior automatically balances dense and sparse scoring.

python

import boto3

bedrock_agent_runtime = boto3.client('bedrock-agent-runtime', region_name='us-east-1')

response = bedrock_agent_runtime.retrieve(
    knowledgeBaseId='YOUR_KB_ID',
    retrievalQuery={'text': 'How to configure S3 lifecycle policies?'},
    retrievalConfiguration={
        'vectorSearchConfiguration': {
            'numberOfResults': 10,
            'overrideSearchType': 'HYBRID'  # Enable hybrid search
        }
    }
)

Source: Ma et al., “A Unified Full-Pipeline Approach to Dense and Sparse Retrieval,” 2024, https://arxiv.org/abs/2401.04055

Source: Cormack et al., “Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods,” 2009, https://dl.acm.org/doi/10.1145/1571941.1572114

Metadata Filtering

Metadata filtering narrows the search space before semantic matching, making retrieval both faster and more precise. This is critically underutilized — most teams embed metadata in chunk text and hope the embedding captures it. Explicit structured filtering is deterministic, faster, and more reliable.

Effective metadata fields to index:

Document type: policy, FAQ, API reference, tutorial, changelog
Date: creation date, last updated, effective date
Access level: public, internal, confidential
Product/service: the specific product the document covers
Language: for multilingual corpora
Source: which system the document came from (Confluence, SharePoint, S3)
Version: document version for versioned content

python

# Bedrock KB — retrieve only from pricing documents updated after 2024
response = bedrock_agent_runtime.retrieve(
    knowledgeBaseId='YOUR_KB_ID',
    retrievalQuery={'text': 'What are the current EC2 pricing tiers?'},
    retrievalConfiguration={
        'vectorSearchConfiguration': {
            'numberOfResults': 10,
            'filter': {
                'andAll': [
                    {'equals': {'key': 'doc_type', 'value': 'pricing'}},
                    {'greaterThan': {'key': 'updated_year', 'value': 2024}}
                ]
            }
        }
    }
)

Advanced pattern: dynamic filtering. Use the query understanding layer (Section 5) to automatically extract filter conditions from the user’s query. “What changed in the S3 pricing in January 2026?” → extract {doc_type: "pricing", service: "S3", date_range: "2026-01"} and apply as metadata filters before vector search. This dramatically improves precision for structured queries.

Source: AWS, “Metadata and filtering for Amazon Bedrock Knowledge Bases,” https://docs.aws.amazon.com/bedrock/latest/userguide/kb-test-config.html

Reranking — The Highest-ROI Optimization

If you implement only one optimization beyond basic RAG, make it reranking. Initial retrieval (dense, sparse, or hybrid) returns a candidate set of 20-50 chunks. A reranker re-scores these candidates using a more powerful model that sees the query and each candidate together.

Why reranking works so well: Bi-encoders (used in initial retrieval) embed query and document independently — they produce separate vectors and compare them by distance. This is fast (you can pre-compute document embeddings) but misses fine-grained query-document interactions. Cross-encoders see the query and document as a single concatenated input and can capture token-level interactions: negation, specificity, conditional relevance. The result is dramatically better relevance judgments.

The trade-off is compute: a cross-encoder is ~100x slower per comparison than a bi-encoder lookup. That’s why we use a two-stage pipeline — fast, approximate retrieval first (retrieve 50 candidates from millions), then precise reranking on the small candidate set.

Reranker	Type	Availability	Latency (20 docs)	Quality
Cohere Rerank v3	Cross-encoder	Bedrock native	~80ms	Excellent — top-tier on BEIR
Amazon Rerank 1.0	Cross-encoder	Bedrock native	~60ms	Strong, optimized for Bedrock pipeline
BGE Reranker v2.5	Cross-encoder	Self-host (SageMaker)	~100ms	Near Cohere quality, open weights
ColBERT v2	Late-interaction	Self-host	~40ms	Fast, good for latency-sensitive apps
FlashRank	Cross-encoder (small)	Self-host	~20ms	Lighter, suitable for edge/low-resource

Late-interaction models (ColBERT) are a middle ground: they pre-compute per-token document embeddings (like bi-encoders) but perform token-level matching at query time (like cross-encoders). This gives most of the quality benefit of cross-encoders with significantly lower latency. ColBERT is worth evaluating if your latency budget is tight.

Implementation with Bedrock:

python

# Bedrock reranking — integrated into retrieve-and-generate
response = bedrock_agent_runtime.retrieve_and_generate(
    input={'text': 'How do I set up cross-region replication?'},
    retrieveAndGenerateConfiguration={
        'type': 'KNOWLEDGE_BASE',
        'knowledgeBaseConfiguration': {
            'knowledgeBaseId': 'YOUR_KB_ID',
            'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-5-sonnet-20241022-v2:0',
            'retrievalConfiguration': {
                'vectorSearchConfiguration': {
                    'numberOfResults': 25,  # Retrieve more candidates
                    'overrideSearchType': 'HYBRID'
                }
            },
            'orchestrationConfiguration': {
                'queryTransformationConfiguration': {
                    'type': 'QUERY_DECOMPOSITION'
                }
            }
        }
    }
)

Quantitative impact: In production systems, adding reranking to a basic retrieve→generate pipeline typically improves answer quality by 15-25% (measured by faithfulness and answer relevance). Latency impact is 50-100ms — negligible for most applications. Cost is minimal (Cohere Rerank on Bedrock is $1 per 1,000 search units). The ROI is almost always positive.

Source: AWS, “Rerank for more relevant RAG responses,” https://docs.aws.amazon.com/bedrock/latest/userguide/rerank.html

Source: Thakur et al., “BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models,” 2021, https://arxiv.org/abs/2104.08663

Top-K Selection

How many chunks should you feed to the LLM? This decision involves more nuance than most guides suggest.

Factors that determine optimal K:

Chunk size: Smaller chunks → higher K (10-15). Larger chunks → lower K (3-5). The goal is to provide 2,000-5,000 tokens of context for most queries.
Query complexity: Simple factoid (“What is the max size of an S3 object?”) → 3-5 chunks. Complex analytical (“Compare the security models of DynamoDB and Aurora”) → 10-20 chunks.
Model context window: Don’t exceed 30-40% of the context window with retrieved context. Leave room for the system prompt, conversation history, and generation. For a 200K context model, this is generous; for an 8K model, every token counts.
Diminishing returns: Research consistently shows that adding chunks beyond position 5-7 has rapidly diminishing returns — and can actually hurt performance if the additional chunks introduce noise or contradictions.

The “lost in the middle” effect: Liu et al. (2023) demonstrated that LLMs are significantly better at using information at the beginning and end of their context window, and tend to miss information placed in the middle. This has practical implications for RAG: if you retrieve 10 chunks, the model may effectively ignore chunks 4-7. Mitigation strategies include (1) keeping K small, (2) reranking to ensure the most relevant chunks are first, and (3) placing the most important context at the beginning and end of the prompt.

Dynamic K: Rather than fixing K, adjust it based on retrieval confidence. If the top 3 chunks all have high reranker scores (>0.8), use K=3. If scores drop gradually, increase K until scores fall below a relevance threshold. This avoids both the “too little context” and “too much noise” failure modes.

Source: Liu et al., “Lost in the Middle: How Language Models Use Long Contexts,” 2023, https://arxiv.org/abs/2307.03172

7. Generation & Post-Processing

The generation phase is where retrieved context becomes a user-facing answer. Getting this right requires careful prompt engineering, intelligent context management, robust citation, and safeguards against hallucination. These details separate a RAG demo from a production system.

Prompt Engineering for RAG

RAG prompts have a fundamentally different structure from general-purpose prompts. The model must ground its response in provided context while maintaining natural language fluency — and critically, it must know when the context is insufficient.

A production RAG system prompt template:

text

You are a {domain} assistant. Your task is to answer the user's question
based on the provided reference documents.

## Instructions
1. Answer ONLY using information found in the provided context below.
2. If the context does not contain enough information to fully answer
   the question, explicitly state what information is missing rather
   than guessing.
3. Cite specific sources using [Source N] notation for each claim.
4. If sources contain conflicting information, acknowledge the
   discrepancy and present both perspectives.
5. Use direct quotes sparingly — paraphrase while preserving accuracy.
6. Structure your response with clear headings for complex answers.

## Context Documents
{retrieved_chunks_with_numbered_sources}

## User Question
{user_query}

Key design decisions:

“Don’t know” instruction is critical. Without explicit instruction to acknowledge uncertainty, models hallucinate confidently when context is insufficient. In regulated industries (healthcare, finance, legal), a confident wrong answer is far worse than “I don’t have enough information.”

Instruction placement matters. Research shows that placing instructions before the context (not after) leads to better instruction following. The model processes the context through the lens of the instructions.

Context ordering affects quality. Place the most relevant chunks first and last (exploiting the primacy and recency effects discussed in Section 6’s “lost in the middle” finding). If your reranker provides confidence scores, sort chunks by decreasing relevance but consider duplicating the top chunk at the end.

XML tags for structure. For Claude models, wrap context in XML tags (, ) for more reliable parsing. For other models, clear markdown separators work well.

python

def format_context_for_prompt(chunks: list[dict], max_tokens: int = 4000) -> str:
    """Format retrieved chunks into a numbered context block."""
    context_parts = []
    token_count = 0
    for i, chunk in enumerate(chunks, 1):
        chunk_text = f"[Source {i}] ({chunk['metadata'].get('title', 'Unknown')})\n{chunk['text']}\n"
        chunk_tokens = len(chunk_text.split()) * 1.3  # rough token estimate
        if token_count + chunk_tokens > max_tokens:
            break
        context_parts.append(chunk_text)
        token_count += chunk_tokens
    return "\n---\n".join(context_parts)

Context Window Management

When retrieved context exceeds what can reasonably fit in the prompt, you need a strategy. The right approach depends on the query type and the number of relevant chunks.

Strategy 1: Stuffing (Simple, K ≤ 5)

Put all chunks directly in one prompt. This is the default for most simple RAG implementations and works well when K is small and chunks are relevant. No information is lost, but the model must process everything at once.

Strategy 2: Map-Reduce (Large K, Synthesis Queries)

For questions that require information from many documents (“Summarize all pricing changes in 2025”):

Map phase: Send each chunk (or small groups of chunks) to the LLM with a focused extraction prompt: “Extract any information about pricing changes from this passage.”
Reduce phase: Collect all extracted summaries and send them to the LLM with the final synthesis prompt.

This adds latency (multiple LLM calls) but handles arbitrarily large context.

Strategy 3: Iterative Refinement (Sequential, K = 5-15)

Process chunks one at a time, progressively refining the answer:

Generate an initial answer from the first chunk.
Present the next chunk alongside the current answer: “Here is additional context. Update your answer if this adds relevant information.”
Repeat until all chunks are processed.

This is effective when information is distributed across chunks and each chunk may modify the answer. The downside is high latency — one LLM call per chunk.

Strategy 4: Hierarchical Summarization (Very Large K)

For scenarios where dozens of chunks are relevant (e.g., searching across thousands of support tickets for patterns):

Group chunks by topic/source.
Summarize each group.
Summarize the summaries.

This is essentially map-reduce with multiple levels, useful when information density is low and you’re mining for patterns rather than specific facts.

The “Lost in the Middle” Mitigation:

Regardless of strategy, be aware that LLMs tend to under-use information in the middle of long contexts. Practical mitigations:

Keep context under 5,000 tokens when possible (K=3-5 with 512-token chunks)
Place highest-relevance chunks at positions 1 and K (beginning and end)
Use reranker scores to aggressively filter — 5 highly relevant chunks beat 15 moderately relevant ones

Source: Liu et al., “Lost in the Middle: How Language Models Use Long Contexts,” 2023, https://arxiv.org/abs/2307.03172

Citation and Source Attribution

For enterprise RAG, every claim must trace back to a source. This is not optional in regulated industries — it’s a compliance requirement. Good citations also build user trust: users can verify answers and learn to calibrate their confidence in the system.

Implementation approaches:

Approach 1: Inline citation with numbered sources

Number chunks in the prompt, instruct the model to cite by number, then post-process to replace numbers with document links.

python

# Post-processing: replace citation numbers with actual links
import re

def resolve_citations(answer: str, chunks: list[dict]) -> str:
    """Replace [Source N] with actual document links."""
    def replace_citation(match):
        idx = int(match.group(1)) - 1
        if idx < len(chunks):
            title = chunks[idx]['metadata'].get('title', 'Document')
            url = chunks[idx]['metadata'].get('url', '#')
            return f"[{title}]({url})"
        return match.group(0)
    return re.sub(r'\[Source (\d+)\]', replace_citation, answer)

Approach 2: Bedrock Knowledge Bases native citations

The RetrieveAndGenerate API returns citations in the response, each containing the specific text segments (retrievedReferences) that informed part of the answer. This gives you span-level attribution without custom post-processing.

python

response = bedrock_agent_runtime.retrieve_and_generate(
    input={'text': query},
    retrieveAndGenerateConfiguration={...}
)

# Each citation maps a part of the answer to its source
for citation in response['citations']:
    answer_span = citation['generatedResponsePart']['textResponsePart']['text']
    for ref in citation['retrievedReferences']:
        source_text = ref['content']['text']
        source_uri = ref['location']['s3Location']['uri']
        print(f"Claim: '{answer_span[:50]}...' → Source: {source_uri}")

Approach 3: Post-hoc verification

After generating the answer, run a separate verification step that checks each sentence against the retrieved chunks using NLI (natural language inference). Flag sentences that aren’t entailed by any chunk. This is more robust but adds latency and cost.

Guardrails and Hallucination Prevention

Even with perfect retrieval, models can hallucinate — generating plausible-sounding claims not supported by the context. Production RAG systems need multiple layers of defense.

Amazon Bedrock Guardrails provides configurable safeguards that can be applied to any Bedrock model call:

Content filters: Block harmful, violent, sexually explicit, or inappropriate content with configurable strength levels.
Denied topics: Define specific topics the model should refuse to discuss (e.g., “competitor pricing,” “legal advice”).
Word filters: Block specific terms, profanity, or sensitive internal terminology from appearing in responses.
PII detection and redaction: Automatically detect and mask personally identifiable information (names, addresses, SSNs, credit card numbers) in both input and output.
Contextual grounding checks: Compare the generated response against the retrieved source documents and flag any claims not supported by the context. This is the most RAG-specific guardrail — it directly addresses hallucination.
Automated Reasoning checks: Validate response logic using formal verification. Particularly useful for numerical claims and multi-step reasoning.

python

# Apply guardrails to a RAG generation call
response = bedrock_runtime.invoke_model(
    modelId='anthropic.claude-3-5-sonnet-20241022-v2:0',
    guardrailIdentifier='my-rag-guardrail',
    guardrailVersion='DRAFT',
    body=json.dumps({
        'anthropic_version': 'bedrock-2023-05-31',
        'messages': [{'role': 'user', 'content': rag_prompt}],
        'max_tokens': 2048
    })
)

# Check if guardrails intervened
if response.get('amazon-bedrock-guardrailAction') == 'INTERVENED':
    # Response was modified or blocked by guardrails
    handle_guardrail_intervention(response)

The contextual grounding check deserves special emphasis. It works by comparing each sentence in the generated response against the retrieved chunks using an entailment model. Sentences that aren’t supported by any chunk are flagged with a grounding score below the configured threshold. You can set the threshold based on your risk tolerance — stricter for medical/legal, more lenient for general knowledge bases.

Additional hallucination mitigation strategies:

Temperature = 0 for factual RAG. Higher temperatures increase creativity but also hallucination risk.
Instruct the model to quote directly when making specific claims. Quoted text is verifiable.
Implement a “confidence signal” — ask the model to assess its confidence using natural language categories with clear definitions: “Rate your confidence as HIGH (answer is directly and fully supported by the provided context), MEDIUM (answer is partially supported or requires minor inference), or LOW (context is insufficient or tangentially related).” Avoid bare numeric scales (1-5) without descriptions — models don’t have an inherent understanding of what each number means, leading to inconsistent and poorly calibrated ratings. Low-confidence answers can trigger human review or a fallback retrieval pass.

Source: AWS, “Amazon Bedrock Guardrails,” https://docs.aws.amazon.com/bedrock/latest/userguide/guardrails.html

Source: AWS, “Contextual grounding check in Guardrails,” https://docs.aws.amazon.com/bedrock/latest/userguide/guardrails-contextual-grounding-check.html

Streaming Responses

For user-facing applications, streaming the response token-by-token dramatically improves perceived latency. The user sees the answer forming in real-time rather than waiting 5-10 seconds for the complete response.

With Bedrock:

python

response = bedrock_runtime.invoke_model_with_response_stream(
    modelId='anthropic.claude-3-5-sonnet-20241022-v2:0',
    body=json.dumps({
        'anthropic_version': 'bedrock-2023-05-31',
        'messages': [{'role': 'user', 'content': rag_prompt}],
        'max_tokens': 2048
    })
)

for event in response['body']:
    chunk = json.loads(event['chunk']['bytes'])
    if chunk['type'] == 'content_block_delta':
        print(chunk['delta']['text'], end='', flush=True)

Challenge with streaming + citations: When streaming, you don’t have the complete answer to post-process citations. Solutions: (1) process citations after stream completes, (2) instruct the model to use a citation format that’s self-contained (e.g., include document titles inline rather than numbered references), or (3) use Bedrock’s RetrieveAndGenerateStream API which handles this natively.

Source: AWS, “Invoke model with streaming,” https://docs.aws.amazon.com/bedrock/latest/userguide/inference-invoke.html

8. Advanced RAG Patterns

Standard RAG follows a fixed pipeline: retrieve → generate. Advanced patterns break this rigidity, introducing decision-making, iteration, and multi-source reasoning. These patterns are where RAG moves from “search + summarize” to genuine knowledge-intensive reasoning.

Agentic RAG

In traditional RAG, retrieval is deterministic — every query triggers the same retrieve→generate pipeline. Agentic RAG gives an AI agent autonomy over the retrieval process. The agent decides:

Whether to retrieve at all. Some questions don’t need external knowledge (“What is 2+2?”). Unnecessary retrieval adds latency and noise.
What query to use for retrieval. The agent can reformulate the user’s question, decompose it into sub-queries, or retrieve from different knowledge bases depending on the topic.
When to retrieve again. After reviewing initial results, the agent may decide the information is insufficient and issue follow-up retrieval with refined queries.
How to combine retrieval with other tools. The agent can interleave retrieval with calculations, API calls, database queries, or code execution.

Implementation approaches: Agentic RAG can be built with open-source agent frameworks (LangGraph, CrewAI, Strands Agents SDK, LlamaIndex Agents), custom orchestration using Step Functions or Lambda, or managed services. The core pattern is the same: the LLM’s reasoning loop decides when and how to invoke retrieval as a tool.

python

# Conceptual agentic RAG with tool-use pattern
tools = [
    {
        "name": "search_knowledge_base",
        "description": "Search the financial reports knowledge base",
        "parameters": {"query": "string"}
    },
    {
        "name": "calculate",
        "description": "Perform numerical calculations",
        "parameters": {"expression": "string"}
    }
]

# The agent's reasoning loop:
# 1. Analyze user query
# 2. Decide which tools to invoke (and in what order)
# 3. Execute tools, observe results
# 4. Decide if more information is needed
# 5. Synthesize final answer

The key advantage of agentic RAG is adaptability. A fixed pipeline applies the same processing to every query regardless of complexity. An agent can apply a simple single-pass retrieval for factoid questions, multi-hop retrieval for complex analytical queries, and skip retrieval entirely for general knowledge questions — all within the same system.

On AWS: Bedrock Knowledge Bases can serve as the retrieval tool in any agent framework — you call the Retrieve API from your agent’s tool function. For fully managed orchestration, Bedrock Agents is an option, though many production teams prefer explicit frameworks (LangGraph, Strands) for better control over the reasoning loop.

Source: AWS, “Strands Agents SDK,” https://github.com/strands-agents/sdk-python

Source: LangGraph, “Building Agentic RAG,” https://python.langchain.com/docs/tutorials/rag/

Multi-Hop RAG

Some questions cannot be answered with a single retrieval pass because the information needed spans multiple documents that are only connected through intermediate reasoning.

Example: “What’s the total monthly cost of running the OpenSearch cluster configuration recommended for a 10M-document RAG workload?”

No single document contains this answer. You need:

Hop 1: Retrieve documents about recommended OpenSearch configurations for RAG at 10M-document scale → find: “r6g.2xlarge, 3 data nodes, 2 replicas”
Hop 2: Retrieve pricing for r6g.2xlarge instances in the user’s region → find: “$0.718/hr per instance”
Synthesize: 3 nodes × $0.718/hr × 730 hours/month = $1,572.42/month

Implementation patterns:

IRCoT (Interleaved Retrieval Chain-of-Thought): The model alternates between reasoning and retrieval. After each reasoning step, it generates a follow-up query based on what it’s learned so far. This maps naturally to agent frameworks with tool-use loops — the agent reasons, retrieves, reasons again, retrieves again, until it has sufficient information.

Query decomposition + parallel retrieval: Decompose the original question into independent sub-questions (Section 5), retrieve for each in parallel, then synthesize. This is faster than sequential multi-hop but works only when sub-questions are independent.

Recursive retrieval: Retrieve initial chunks, extract entities or follow-up questions from them, retrieve again, and repeat up to a maximum depth. Add a stopping condition based on the model’s assessment of whether it has sufficient information.

python

def multi_hop_retrieve(query: str, kb_id: str, max_hops: int = 3) -> list:
    """Iterative multi-hop retrieval with LLM-guided query generation."""
    all_chunks = []
    current_query = query
    
    for hop in range(max_hops):
        # Retrieve for current query
        chunks = retrieve_from_kb(current_query, kb_id, top_k=5)
        all_chunks.extend(chunks)
        
        # Ask LLM: do we have enough info, or need another hop?
        assessment = llm_assess(query, all_chunks)
        if assessment['sufficient']:
            break
        
        # Generate follow-up query based on what we've learned
        current_query = assessment['follow_up_query']
    
    return all_chunks

When to use multi-hop: When your evaluation shows that single-pass retrieval fails on questions requiring cross-document reasoning (typically 15-30% of queries in complex domains like finance, healthcare, and technical documentation).

Source: Press et al., “Measuring and Narrowing the Compositionality Gap in Language Models,” 2023, https://arxiv.org/abs/2210.03350

Source: Trivedi et al., “Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions,” 2023, https://arxiv.org/abs/2212.10509

Graph RAG

When your knowledge has rich entity relationships — organizational hierarchies, product dependencies, regulatory cross-references, citation networks — standard vector retrieval misses structural information. Graph RAG combines knowledge graphs with vector retrieval to capture both semantic similarity and relational structure.

Microsoft’s Graph RAG approach (2024):

Entity and relationship extraction: During ingestion, an LLM extracts entities (people, products, concepts) and their relationships from each document.
Graph construction: Build a knowledge graph from extracted entities and relationships using a graph database (Amazon Neptune).
Community detection: Apply graph algorithms (e.g., Leiden algorithm) to identify clusters of closely related entities.
Community summarization: Generate natural language summaries for each community, capturing the key themes and relationships.
Query-time retrieval: For a given query, retrieve both (a) vector-similar chunks and (b) relevant community summaries from the graph. Feed both to the LLM.

Why this matters: Standard RAG answers “What does document X say about topic Y?” well. Graph RAG also answers “How is entity A related to entity B?”, “What are the major themes across 10,000 documents?”, and “What are the downstream impacts of changing policy X?” — questions that require understanding relationships, not just content.

On AWS: Amazon Neptune Analytics supports vector similarity search alongside graph traversal (Gremlin and openCypher). You can store entity embeddings as node properties and combine graph patterns with vector similarity in a single query.

text

// OpenCypher — find entities related to 'S3' within 2 hops
// that are also semantically similar to the query
MATCH path = (s:Service {name: 'Amazon S3'})-[*1..2]-(related)
WHERE related.embedding IS NOT NULL
WITH related, gds.similarity.cosine(related.embedding, $query_embedding) AS sim
WHERE sim > 0.7
RETURN related.name, related.type, sim
ORDER BY sim DESC LIMIT 10

When to invest in Graph RAG: Graph RAG adds significant ingestion complexity (entity extraction, graph maintenance, community detection). It’s worth it when:

Users frequently ask relationship questions (“Who approved this policy?”, “What services depend on this component?”)
Your documents form a natural graph (legal documents with cross-references, codebases with dependencies, organizational policies)
You need thematic summarization across large corpora (“What are the top concerns across 5,000 customer support tickets?”)

For pure factoid QA over a relatively flat document corpus, standard RAG with hybrid search and reranking is sufficient and much simpler.

Source: Edge et al., “From Local to Global: A Graph RAG Approach to Query-Focused Summarization,” Microsoft Research, 2024, https://arxiv.org/abs/2404.16130

Self-RAG

Self-RAG introduces self-reflection into the generation process. Rather than blindly generating from retrieved context, the model actively evaluates its own behavior at each step using special reflection tokens:

[Retrieve]: “Do I need to retrieve information for this query?” → Yes/No
[IsRel]: “Is this retrieved passage relevant to the query?” → Relevant/Irrelevant
[IsSup]: “Is my generated response supported by this passage?” → Fully Supported / Partially Supported / Not Supported
[IsUse]: “Is this response useful to the user?” → Useful rating (1-5)

The self-correction loop:

Model receives query, decides whether retrieval is needed.
If yes, retrieves passages and evaluates each for relevance (filters irrelevant ones).
Generates a response segment and checks if it’s supported by the passages.
If not supported, regenerates with a different approach or retrieves additional passages.
Evaluates overall usefulness and refines if needed.

This addresses a fundamental RAG failure mode: the system retrieves irrelevant documents, the generator treats them as authoritative, and the output is confidently wrong. Self-RAG catches this by evaluating relevance before generation and verifying support after generation.

Practical implementation: True Self-RAG requires a model fine-tuned with reflection tokens (the original paper fine-tuned Llama 2). A practical approximation for production: implement the reflection logic as explicit prompting steps in a multi-step pipeline — retrieve, check relevance (with a lightweight classifier or LLM call), generate, verify grounding (using Bedrock Guardrails’ contextual grounding check).

Source: Asai et al., “Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection,” 2023, https://arxiv.org/abs/2310.11511

Corrective RAG (CRAG)

CRAG adds a quality evaluator between retrieval and generation. After retrieving documents, an evaluator model scores their relevance and routes the pipeline accordingly:

Correct (high confidence): Retrieved documents are clearly relevant → proceed directly to generation.
Ambiguous (medium confidence): Relevance is uncertain → supplement retrieved documents with web search results to provide additional context.
Incorrect (low confidence): Retrieved documents are irrelevant to the query → discard them entirely and fall back to web search.

Why this matters: Standard RAG has no mechanism to detect retrieval failure. If the knowledge base doesn’t contain the answer, the system still forces generation from the top-K results — which may be tangentially related at best. CRAG prevents the common failure mode of “confidently wrong answers from irrelevant context.”

Implementation:

python

def corrective_rag(query: str, kb_id: str) -> str:
    """CRAG: evaluate retrieval quality before generation."""
    # Step 1: Initial retrieval
    chunks = retrieve_from_kb(query, kb_id, top_k=10)
    
    # Step 2: Evaluate retrieval quality
    evaluation = evaluate_relevance(query, chunks)
    
    if evaluation['verdict'] == 'correct':
        # High confidence — use retrieved chunks directly
        context = chunks
    elif evaluation['verdict'] == 'ambiguous':
        # Medium confidence — supplement with web search
        web_results = web_search(query)
        context = chunks + web_results
    else:
        # Low confidence — fall back to web search only
        context = web_search(query)
    
    # Step 3: Generate from curated context
    return generate_answer(query, context)

def evaluate_relevance(query: str, chunks: list) -> dict:
    """Use an LLM to evaluate retrieval relevance."""
    prompt = f"""Evaluate whether these retrieved passages are relevant 
    to answering the query. Rate as 'correct', 'ambiguous', or 'incorrect'.
    
    Query: {query}
    Passages: {format_chunks(chunks[:3])}  # Evaluate top 3
    
    Verdict:"""
    # ... invoke LLM and parse response

Trade-off: CRAG adds one LLM call for evaluation per query. In production, use a fast, small model (Haiku, Titan Lite) for the evaluation step to minimize latency impact.

Source: Yan et al., “Corrective Retrieval Augmented Generation,” 2024, https://arxiv.org/abs/2401.15884

Adaptive RAG

Different queries have fundamentally different complexity levels, and applying the same pipeline to all of them wastes resources on simple queries while under-serving complex ones. Adaptive RAG classifies queries by complexity and routes them to appropriate processing pipelines:

Simple queries (factoid, single-fact): Single-pass retrieval, no decomposition, K=3. “What is the maximum object size in S3?”
Moderate queries (comparison, multi-aspect): Query rewriting + hybrid retrieval + reranking, K=5-10. “Compare DynamoDB and Aurora for write-heavy workloads.”
Complex queries (multi-hop, analytical): Full decomposition + multi-hop retrieval + synthesis, K=10-20. “Design a cost-optimized architecture for a real-time recommendation system serving 10M users.”

Complexity classification can be done with:

Rule-based heuristics: Query length, presence of comparison words (“vs”, “compare”, “difference”), question type (who/what/how/why).
Lightweight LLM classifier: A small model (Haiku) classifies query complexity in <100ms.
Embedding-based classifier: Train a simple classifier on query embeddings labeled by complexity.

The key insight is that routing a simple factoid question through a complex multi-hop pipeline doesn’t improve the answer — it just adds latency and cost. Conversely, routing a complex analytical question through a simple pipeline produces shallow, incomplete answers.

Source: Jeong et al., “Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity,” 2024, https://arxiv.org/abs/2403.14403

Multi-Modal RAG

Production knowledge bases increasingly contain not just text but images (architecture diagrams, screenshots, charts), tables (financial data, specifications), and occasionally audio/video. Multi-modal RAG extends the pipeline to handle these content types.

Approaches:

Image-to-text at ingestion: Use a vision model (Claude 3.5 Sonnet, GPT-4V) to generate detailed text descriptions of images during ingestion. Embed and retrieve the descriptions. This is the simplest approach and works well for diagrams and charts.
Multi-modal embeddings: Use models that embed both text and images into the same vector space (e.g., Amazon Titan Multimodal Embeddings). At query time, the text query embedding is compared against both text and image embeddings.
Table-aware RAG: Convert tables to structured text representations during chunking (Section 3), and use table-specific retrieval strategies (exact-match on column names + semantic on content).

On AWS: Titan Multimodal Embeddings supports both text and image inputs, producing embeddings in the same 1024-dimensional space. Bedrock Knowledge Bases supports multi-modal data sources including images and PDFs with embedded images.

python

# Titan Multimodal Embeddings — embed an image
import base64

with open('architecture_diagram.png', 'rb') as f:
    image_bytes = base64.b64encode(f.read()).decode('utf-8')

response = bedrock_runtime.invoke_model(
    modelId='amazon.titan-embed-image-v1',
    body=json.dumps({
        'inputImage': image_bytes,
        'embeddingConfig': {'outputEmbeddingLength': 1024}
    })
)
image_embedding = json.loads(response['body'].read())['embedding']

Practical advice: Start with image-to-text at ingestion (approach 1). It’s the simplest to implement, works with any existing text-based RAG pipeline, and provides good results for most use cases. Move to multi-modal embeddings only if you have a large volume of images and text descriptions don’t capture the visual information adequately (e.g., complex technical diagrams where spatial relationships matter).

Source: AWS, “Amazon Titan Multimodal Embeddings G1,” https://docs.aws.amazon.com/bedrock/latest/userguide/titan-multiemb-models.html

Pattern Selection Guide

Choosing the right advanced pattern depends on your specific needs:

Pattern	Best For	Complexity	Latency Impact
Agentic RAG	Dynamic queries, tool use	Medium	+500ms-2s (agent reasoning)
Multi-Hop	Cross-document reasoning	Medium-High	+1-5s (multiple retrievals)
Graph RAG	Relationship queries, theme extraction	High	+200-500ms (graph traversal)
Self-RAG	High-stakes, accuracy-critical	Medium	+500ms (reflection steps)
CRAG	Unreliable knowledge bases, fallback needed	Low-Medium	+200ms (evaluation step)
Adaptive RAG	Mixed query complexity	Low	Neutral (saves on simple queries)
Multi-Modal	Image/table-heavy corpora	Medium	Varies

Start simple. Most teams get 80% of the value from standard RAG with hybrid search, reranking, and good chunking. Add advanced patterns incrementally based on where your evaluation shows failures — not because they sound impressive in blog posts.

9. RAG Evaluation — The Deep Dive

This is where most RAG projects either succeed or slowly, silently fail. Building a RAG prototype that “works on my demo” takes days. Building a RAG system with measured, reproducible quality takes months — and evaluation is the difference.

9.1 Why Evaluation Is the Hardest Part

The demo trap. Your RAG system answers your 10 test questions perfectly. You deploy it. Within a week, users are complaining about wrong answers, missing information, and hallucinations. What happened? Your 10 test questions weren’t representative of real usage patterns — they were cherry-picked queries where you already knew the documents contained good answers.

The ground truth problem. Unlike classification tasks where you have clear labels, RAG evaluation often lacks ground truth. What’s the “correct” answer to “Explain our refund policy”? There might be multiple valid formulations at different levels of detail, and the policy might span several documents. Unlike a math problem, there’s no single right answer.

The multi-dimensional problem. A RAG response can simultaneously be:

Relevant to the query but unfaithful to the sources (hallucination)
Faithful to the sources but irrelevant to the query (retrieval miss)
Both relevant and faithful but incomplete (missed important context)
Complete and accurate but too slow (latency) or too expensive (cost)
Fast and accurate for the first query but degrading over time (drift)

You need metrics across all these dimensions, and optimizing one often hurts another.

9.2 Component-Level vs. System-Level Evaluation

Before diving into metrics, understand the two evaluation paradigms:

Component-level evaluation measures each piece of the pipeline independently:

Is the retriever finding the right documents? (Retrieval metrics)
Is the generator producing faithful answers? (Generation metrics)
Is the query enhancement actually improving retrieval? (Enhancement metrics)

System-level evaluation measures the end-to-end experience:

Is the user getting the right answer? (Correctness)
Is the experience fast enough? (Latency)
Is it cost-effective at scale? (Cost per query)

You need both. System-level metrics tell you if something is wrong. Component-level metrics tell you where it’s wrong and how to fix it.

9.3 The Three-Layer Evaluation Framework

Layer 1: Retrieval Evaluation

Measures whether the right documents are being retrieved. These metrics require relevance judgments — for each query, you need to know which chunks should be retrieved.

Context Precision / Precision@K

Of the K retrieved chunks, what fraction is actually relevant to the query?

text

Precision@K = |relevant ∩ retrieved| / K

Example: You retrieve 10 chunks, 6 are relevant → Precision@10 = 0.6

High precision means little noise in your retrieved context. Low precision means the LLM is wading through irrelevant text, which increases hallucination risk and wastes tokens.

Context Recall / Recall@K

Of all relevant chunks in the entire corpus, what fraction did you successfully retrieve?

text

Recall@K = |relevant ∩ retrieved| / |total_relevant|

Example: There are 8 relevant chunks total, you retrieved 6 → Recall@10 = 0.75

High recall means you’re not missing important information. Low recall means the LLM might generate incomplete answers because it never saw the critical context.

NDCG (Normalized Discounted Cumulative Gain)

Are the most relevant chunks ranked highest? NDCG penalizes relevant documents that appear lower in the ranking.

text

DCG@K = Σ(i=1 to K) rel_i / log2(i+1)
NDCG@K = DCG@K / IDCG@K  (where IDCG is the ideal DCG)

Why it matters: LLMs pay more attention to context appearing earlier in the prompt. A relevant chunk at position 1 is far more valuable than the same chunk at position 10. NDCG captures this.

MRR (Mean Reciprocal Rank)

Where does the first relevant chunk appear in the results?

text

RR = 1 / rank_of_first_relevant
MRR = mean(RR across all queries)

MRR of 0.5 means the first relevant chunk typically appears at position 2. MRR of 1.0 means it’s always at position 1.

Hit Rate (the simplest health check)

For what fraction of queries does at least one relevant chunk appear in the Top-K?

text

Hit Rate = queries_with_at_least_one_hit / total_queries

If your hit rate is below 0.8, you have a fundamental retrieval problem — everything else is secondary.

Practical recommendation: Start monitoring Hit Rate and Recall@K — these tell you if your retriever is finding the right documents at all. Then add NDCG and Precision@K to optimize ranking quality and noise reduction.

Layer 2: Generation Evaluation

Measures the quality of the LLM’s response given the retrieved context.

Faithfulness (Groundedness)

Does the response only contain information supported by the retrieved context? This is the hallucination metric and arguably the single most important metric for enterprise RAG.

Faithfulness is typically measured by:

Decomposing the generated response into individual claims/statements
For each claim, checking whether it is supported by the retrieved context
Computing the ratio of supported claims to total claims

text

Faithfulness = |supported_claims| / |total_claims|

Example: The response makes 8 claims, 7 are supported by retrieved context → Faithfulness = 0.875

A faithfulness score below 0.8 is a red flag. It means roughly 1 in 5 statements in the response is not grounded in the provided context — the model is either hallucinating or drawing from its training data rather than your documents.

Answer Relevancy

Does the response actually address the user’s question? A response can be perfectly faithful (everything it says is in the context) but completely miss the point of the question.

RAGAS measures this by generating synthetic questions from the answer and computing similarity to the original question. If the answer is relevant, questions generated from it should resemble the original query.

Answer Completeness

Does the response cover all aspects of the question? Especially important for multi-part questions:

“What are the S3 storage classes, their use cases, and pricing?” — a complete answer must address all three parts.

Completeness is typically measured against a reference answer or by decomposing the question into sub-questions and checking coverage.

Hallucination Rate

The inverse of faithfulness, expressed as a trackable rate:

text

Hallucination Rate = |unsupported_claims| / |total_claims|

Track this over time as a trending metric. If it’s climbing, something changed — new documents with different structure, a prompt regression, or a model update.

Layer 3: End-to-End Evaluation

Correctness

Is the final answer factually correct? This requires ground truth answers for comparison and can be measured by:

Exact match (for factoid QA — rarely appropriate)
Semantic similarity (embedding similarity between generated and reference answers)
LLM-as-judge (ask a judge model to score correctness on a 1-5 scale)
Human evaluation (the gold standard, but expensive)

Latency (P50 / P95 / P99)

Track response time percentiles separately for each pipeline stage:

Query enhancement: typically 200-500ms (1 LLM call)
Retrieval: typically 50-200ms (vector search)
Reranking: typically 50-150ms
Generation: typically 1-5 seconds (main bottleneck)
Total: typically 2-6 seconds end-to-end

Users expect < 3 seconds for simple questions. For complex questions requiring multi-hop retrieval, set expectations with streaming responses.

Cost per Query

At scale, this determines viability. Break it down:

Embedding the query: ~$0.0001
Vector search: ~$0.0005
Reranking (20 candidates): ~$0.001
LLM generation (Sonnet): ~$0.003-0.01
Query enhancement (Haiku): ~$0.0003
Total: ~$0.005-0.015 per query

At 10,000 queries/day, that’s $50-150/day or $1,500-4,500/month just on inference. Evaluate whether each pipeline component justifies its cost through improved quality.

User Satisfaction Signals

The ultimate metric — but lagging and noisy:

Thumbs up/down on responses
Query reformulations (user rephrasing = original answer was unsatisfactory)
Escalation to human agents
Session abandonment rate

9.4 AWS Native: Bedrock Knowledge Base Evaluation

Amazon Bedrock provides built-in RAG evaluation capabilities that significantly reduce the engineering effort required to measure quality.

Evaluation Types

Retrieve-Only Evaluation: Tests retrieval in isolation.

Metrics: Context Relevance (are retrieved chunks relevant?), Context Coverage (do retrieved chunks cover the expected answer?)
Use when: Debugging retrieval quality, comparing vector stores or chunking strategies

Retrieve-and-Generate Evaluation: Tests the full RAG pipeline.

Metrics: Correctness, Completeness, Faithfulness, Helpfulness
Use when: Measuring end-to-end quality, comparing prompts or models

Evaluation Dataset Format

The evaluation dataset is a JSONL file in S3:

json

{
  "conversationTurns": [{
    "input": {
      "content": [{"text": "What are the S3 storage classes?"}]
    },
    "referenceResponses": [{
      "content": [{"text": "S3 offers six storage classes: S3 Standard..."}]
    }],
    "referenceContexts": [{
      "content": [{"text": "Amazon S3 storage classes include..."}]
    }]
  }]
}

referenceResponses — optional ground truth answers (required for Correctness)
referenceContexts — optional ground truth chunks (required for Context Coverage in retrieve-only mode)

Tip: You need a minimum of ~50-100 evaluation examples for statistically meaningful results. Aim for 200+ to capture the diversity of real-world queries.

Setting Up an Evaluation Job

python

import boto3
from datetime import datetime

bedrock = boto3.client('bedrock', region_name='us-east-1')

response = bedrock.create_evaluation_job(
    jobName=f'rag-eval-{datetime.now():%Y-%m-%d-%H-%M}',
    roleArn='arn:aws:iam::role/BedrockEvalRole',
    evaluationConfig={
        'automated': {
            'datasetMetricConfigs': [{
                'taskType': 'RetrieveAndGenerate',
                'dataset': {'s3Uri': 's3://my-bucket/eval-dataset.jsonl'},
                'metricNames': [
                    'Builtin.Correctness',
                    'Builtin.Completeness',
                    'Builtin.Faithfulness',
                    'Builtin.Helpfulness'
                ]
            }]
        }
    },
    inferenceConfig={
        'ragConfigs': [{
            'knowledgeBaseConfig': {
                'knowledgeBaseId': 'YOUR_KB_ID',
                'modelIdentifier': 'anthropic.claude-sonnet-4-6-v1'
            }
        }]
    },
    outputDataConfig={
        's3Uri': 's3://my-bucket/eval-results/'
    }
)

Interpreting Results — Common Failure Patterns

Pattern	Diagnosis	Fix
Low faithfulness, high correctness	Model hallucinating but getting lucky — dangerous	Strengthen grounding prompt, add Guardrails contextual grounding check
High faithfulness, low completeness	Retrieval missing relevant documents	Review chunking strategy, increase Top-K, check for missing documents
Low correctness across the board	Fundamental retrieval failure	Audit ingestion pipeline, check embedding quality, verify data is indexed
High variability across queries	Some query types work well, others don’t	Stratify evaluation by query type, add query routing
Good metrics, bad user feedback	Evaluation dataset doesn’t reflect real usage	Collect real user queries, rebuild evaluation dataset

Source: AWS, “Evaluate the performance of RAG sources using Amazon Bedrock,” https://docs.aws.amazon.com/bedrock/latest/userguide/evaluation-kb.html

Source: AWS ML Blog, “Evaluating RAG applications with Amazon Bedrock knowledge base evaluation,” https://aws.amazon.com/blogs/machine-learning/evaluating-rag-applications-with-amazon-bedrock-knowledge-base-evaluation/

9.5 Open-Source Evaluation Frameworks — Deep Comparison

RAGAS (RAG Assessment)

The most widely adopted open-source framework. Uses LLM-based metrics that don’t require ground truth for all metrics.

Core metrics and how they work:

Faithfulness: Decomposes the answer into claims using an LLM, then checks each claim against the context. Cost: 2 LLM calls per evaluation.
Answer Relevancy: Generates N questions from the answer, computes mean cosine similarity to original query. Cost: 1 LLM call + N embedding calls.
Context Precision: Checks if ground-truth-relevant chunks are ranked higher in the retrieved set. Requires ground truth.
Context Recall: Decomposes the ground truth answer into claims, checks if each can be attributed to the retrieved context. Requires ground truth.

python

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

eval_dataset = Dataset.from_dict({
    "question": questions,
    "answer": generated_answers,
    "contexts": retrieved_contexts,
    "ground_truth": reference_answers
})

results = evaluate(dataset=eval_dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall])
print(results)  # {'faithfulness': 0.87, 'answer_relevancy': 0.91, ...}

Strengths: Easy to start, large community, well-documented. Weaknesses: Each evaluation example costs 3-5 LLM calls (expensive at scale), metric scores can be noisy for individual examples, best used as aggregate scores over 100+ examples.

Source: RAGAS Documentation, https://docs.ragas.io/

DeepEval

Production-oriented framework with native CI/CD integration.

python

from deepeval import assert_test
from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    input="What is the refund policy?",
    actual_output=rag_response,
    retrieval_context=retrieved_chunks
)

faithfulness = FaithfulnessMetric(threshold=0.7)
relevancy = AnswerRelevancyMetric(threshold=0.7)
assert_test(test_case, [faithfulness, relevancy])

Killer feature: Pytest integration. Run deepeval test run test_rag.py in your CI/CD pipeline and fail the build if metrics drop below thresholds.

Source: DeepEval Documentation, https://docs.confident-ai.com/

TruLens

Developed by TruEra, focused on production feedback loops.

Differentiator: The “feedback function” abstraction — define custom evaluation functions that can use any combination of LLM judges, heuristics, and ground truth. Particularly strong for tracking metrics in production with its logging and dashboard capabilities.

Source: TruLens Documentation, https://www.trulens.org/

Phoenix / Arize

Observability-first evaluation platform.

Differentiator: Traces the entire RAG pipeline, letting you inspect individual queries end-to-end: what was retrieved, what was generated, where things went wrong. Best for debugging production issues rather than batch evaluation.

Source: Arize Phoenix, https://docs.arize.com/phoenix/

Giskard

Security and compliance-focused.

Differentiator: Automated adversarial testing — generates prompt injection attempts, tests for data leakage, checks OWASP LLM Top 10 vulnerabilities. Essential for regulated industries.

Source: Giskard Documentation, https://docs.giskard.ai/

Comprehensive Comparison

Feature	RAGAS	DeepEval	TruLens	Phoenix/Arize	Braintrust	Giskard
LLM-based metrics	✅	✅	✅	✅	✅	✅
No ground truth needed	Partial	Partial	Partial	✅	Partial	✅
CI/CD integration	Community	Built-in	API	API	API	Built-in
Production monitoring	❌	Dashboard	✅	✅ (core)	✅	❌
Human eval workflows	❌	❌	❌	❌	✅	✅
Adversarial testing	❌	❌	❌	❌	❌	✅ (core)
Tracing / debugging	❌	❌	✅	✅ (core)	✅	❌
Open source	✅	✅	✅	✅	Partial	✅
Best for	Quick eval, research	CI/CD pipelines	Prod feedback	Prod debugging	Team collab	Compliance

9.6 Building Evaluation Datasets

Your evaluation dataset is the foundation of all measurement. A bad evaluation dataset is worse than no evaluation — it gives false confidence.

Synthetic Data Generation Pipeline

python

generation_prompt = """Given this document passage, generate 3 diverse 
questions that can be answered using ONLY this passage.

For each question, provide:
1. The question
2. The expected answer (derived only from the passage)
3. Question type: FACTOID / HOW-TO / COMPARISON / REASONING / UNANSWERABLE

Vary the difficulty: 1 easy, 1 medium, 1 hard.

Passage: {chunk_text}
"""

Critical: Include unanswerable questions. Generate questions that look like they could be answered by your knowledge base but actually can’t. This tests the system’s ability to say “I don’t know” rather than hallucinating. Aim for 15-20% unanswerable questions in your dataset.

Recommended Methodology (Hybrid)

Auto-generate 500+ QA pairs from your documents using an LLM
Human experts review and filter to ~200 high-quality, diverse pairs
Add real queries from production logs (if available) — 50-100 actual user questions
Add adversarial examples — 20-30 tricky edge cases (ambiguous, multi-document, out-of-scope)
Stratify by query type, difficulty, and document source
Final dataset: 250-350 validated examples with ground truth answers and relevant chunks

Sample Size and Statistical Significance

Dataset Size	What You Can Detect	Confidence
50 examples	Major regressions (>15% drop)	Low
100 examples	Moderate changes (~10% drop)	Medium
200 examples	Small changes (~5% drop)	High (95%)
500+ examples	Subtle changes (~2-3% drop)	Very high

Rule of thumb: 200 examples is the minimum for production evaluation. Below that, you’re making decisions on noise.

9.7 Continuous Evaluation Pipeline

CI/CD Integration

Every change to your RAG system should trigger evaluation:

yaml

# GitHub Actions example
on:
  push:
    paths:
      - 'rag_config/**'
      - 'prompts/**'
      - 'chunking/**'

jobs:
  rag-eval:
    runs-on: ubuntu-latest
    steps:
      - name: Run RAG evaluation
        run: python evaluate_rag.py --config eval_config.yaml
      - name: Check regression
        run: |
          python check_regression.py \
            --baseline metrics/baseline.json \
            --current metrics/current.json \
            --threshold 0.05

A/B Testing RAG Configurations

When testing a new chunking strategy or prompt:

Run evaluation suite on current configuration → baseline metrics
Run evaluation suite on candidate configuration → candidate metrics
Compute per-metric deltas
Apply significance test (paired t-test or bootstrap confidence interval)
Deploy only if: no metric regresses >2% AND target metrics improve >3%

Production Monitoring Dashboard

Track these on a real-time dashboard (CloudWatch, Grafana, or Datadog):

Faithfulness (sampled 5-10% of queries): LLM-as-judge scores
Retrieval hit rate (100% of queries): did retrieval return results?
Latency P50/P95/P99 (100% of queries): per-component breakdown
Cost per query (100% of queries): token usage × pricing
User negative feedback rate (100% of feedback): thumbs down / total rated
Query reformulation rate (100% of sessions): consecutive queries on same topic = signal of failure

Alert thresholds (recommended starting points):

Faithfulness < 0.7 → 🔴 Critical
Hit rate < 0.8 → 🔴 Critical
P95 latency > 8 seconds → 🟡 Warning
Negative feedback rate > 20% → 🟡 Warning
Cost per query > 2× baseline → 🟡 Warning

Drift Detection

Your documents change. Your users change. Your RAG system drifts. Schedule monthly evaluation runs against a stable, frozen evaluation dataset to catch drift:

If metrics drop >5% month-over-month, investigate
Common causes: new document types that chunking handles poorly, shifted user query patterns, vector store index degradation

9.8 The Cost of Evaluation (Often Overlooked)

Evaluation itself costs money — primarily in LLM calls for LLM-as-judge metrics.

Per-example evaluation cost (approximate):

RAGAS (4 metrics): ~4-5 LLM calls per example → ~$0.02-0.05 per example
Bedrock KB Evaluation: ~$0.01-0.03 per example (judge model calls)
Human evaluation: ~$0.50-2.00 per example (annotator time)

For a 200-example evaluation suite:

Automated (RAGAS/Bedrock): $4-10 per run
Human evaluation: $100-400 per run
Running daily automated + weekly human: ~$200-400/month

This is cheap insurance. A single hallucinated answer in a customer-facing system can cost far more in trust, reputation, and remediation.

9.9 Common Anti-Patterns

“We tested it on 10 questions and it works great.”

→ 10 questions is anecdotal. You need 200+ for statistical validity.

“Our faithfulness is 0.95 so we’re good.”

→ On what dataset? If your eval set is too easy (simple factoid questions), 0.95 means nothing. Add adversarial and complex multi-hop queries.

“We use GPT-4 to judge GPT-4’s outputs.”

→ Same-model evaluation has known biases (verbosity preference, self-consistency). Use a different model as judge, or better, calibrate against human judgments.

“We evaluated once before launch.”

→ RAG quality degrades over time as documents change, usage patterns shift, and models update. Evaluation must be continuous.

“We optimized for faithfulness and declared victory.”

→ Faithfulness and completeness are in tension. A system that only answers when 100% certain will refuse many valid questions. Define your acceptable trade-off explicitly — for a medical FAQ, you might accept lower completeness for higher faithfulness; for a product recommendation bot, the opposite.

Source: Confident AI, “RAG Evaluation Metrics,” https://www.confident-ai.com/blog/rag-evaluation-metrics-answer-relevancy-faithfulness-and-more

Source: RAGAS, “Metrics Documentation,” https://docs.ragas.io/en/latest/concepts/metrics/

Source: Braintrust, “RAG Evaluation Tools,” https://www.braintrust.dev/articles/best-rag-evaluation-tools

10. Production Checklist

From POC to Production

Phase	Key Actions
POC	Default chunking, single embedding model, 10-20 test queries, no monitoring
Pilot	Optimized chunking for your document types, evaluation dataset (100+ queries), basic monitoring
Production	Hybrid search + reranking, query understanding layer, CI/CD evaluation pipeline, comprehensive monitoring, guardrails
Mature	Agentic RAG, adaptive retrieval, continuous evaluation, A/B testing framework, cost optimization

Decision Table

Scenario	Chunking	Retrieval	Enhancement	Evaluation
Simple FAQ bot	Fixed 512	Dense only	None needed	Hit Rate + Faithfulness
Technical documentation	Hierarchical	Hybrid + Rerank	Query rewriting	Full 3-layer
Multi-domain enterprise	Structure-aware	Hybrid + Rerank + Route	Decomposition + Routing	Full 3-layer + A/B
Compliance/legal	Semantic	Hybrid + Rerank	Step-back + HyDE	Full + Adversarial (Giskard)
Conversational assistant	Hierarchical	Hybrid + Rerank	Context condensation	Full + User signals

Cost Optimization

RAG system costs come from four main areas: embedding (ingestion + queries), vector storage and search, LLM generation, and optional components like reranking. Here are the highest-impact optimization strategies:

1. Tiered model routing. Not every step needs the most expensive model. Use a small, fast model (Haiku, Titan Lite) for query classification, routing, and rewriting. Reserve the larger model (Sonnet, Opus) for final answer generation. This alone can cut LLM costs by 40-60% without meaningful quality loss.

2. Semantic caching. Many RAG systems see significant query repetition — users ask the same or very similar questions. Cache the (query embedding → answer) mapping. For a new query, compute its embedding and check cosine similarity against cached query embeddings. If similarity exceeds a threshold (e.g., 0.95), return the cached answer without invoking retrieval or generation. Implement with ElastiCache or a simple in-memory store for low-volume applications.

3. Embedding dimension reduction. If your embedding model supports Matryoshka representations (Titan V2, text-embedding-3-large), reduce dimensions from 1024 to 256 or 512. This cuts vector storage by 2-4× and speeds up search by ~3×, with only a 2-5% recall drop. Test on your eval set to confirm acceptable quality.

4. Right-size your vector store. OpenSearch Serverless has a minimum of 2 OCUs (~$350/month) regardless of usage. For low-volume applications (<100 queries/day), consider Aurora PostgreSQL with pgvector or even a FAISS index on a small EC2 instance — dramatically cheaper at small scale.

5. Cache frequent query embeddings. If the same query text appears repeatedly, skip the embedding API call and serve from cache. A simple LRU cache with TTL covers the common case.

6. Chunk-level deduplication. During ingestion, deduplicate near-identical chunks across documents. This reduces index size and prevents the retriever from wasting top-K slots on duplicate content.

7. Batch ingestion. Bedrock and most embedding APIs offer batch pricing or reduced per-token costs for batch operations. Schedule ingestion during off-peak hours and batch chunks rather than embedding them one at a time.

8. Monitor and alert on cost anomalies. Set CloudWatch billing alarms on Bedrock model invocation costs. A misconfigured pipeline (e.g., infinite retrieval loops in agentic RAG) can generate surprising bills quickly.

Observability Setup

CloudWatch: Custom metrics for retrieval latency, generation latency, chunk hit rate
X-Ray: End-to-end tracing through the RAG pipeline (query → retrieval → generation)
CloudWatch Logs: Log retrieved chunk IDs and relevance scores for debugging
Dashboard: Combine latency, cost, and quality metrics in a single view

Conclusion

RAG is not a single technique — it’s an architecture with numerous decision points, each offering meaningful trade-offs. The difference between a demo and a production system lies in the details: how you chunk your documents, whether you enhance queries before retrieval, how you combine sparse and dense search, and — most critically — how rigorously you evaluate.

Start simple. Measure everything. Iterate based on data, not intuition. And remember: the best RAG system is one that knows when it doesn’t know.

📝 Note: The views, opinions, and technical recommendations expressed in this article are my own and do not represent the official position of any organization. All architecture patterns and code examples are for educational purposes — always validate against your specific requirements and the latest AWS documentation.

References

Lewis, P. et al. (2020). “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” https://arxiv.org/abs/2005.11401
Gao, L. et al. (2023). “Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE).” https://arxiv.org/abs/2212.10496
Asai, A. et al. (2023). “Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection.” https://arxiv.org/abs/2310.11511
Yan, S. et al. (2024). “Corrective Retrieval Augmented Generation.” https://arxiv.org/abs/2401.15884
Jeong, S. et al. (2024). “Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity.” https://arxiv.org/abs/2403.14403
Microsoft Research (2024). “GraphRAG: Unlocking LLM Discovery on Narrative Private Data.” https://arxiv.org/abs/2404.16130
Press, O. et al. (2023). “Measuring and Narrowing the Compositionality Gap in Language Models.” https://arxiv.org/abs/2210.03350
Zheng, Z. et al. (2023). “Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models.” https://arxiv.org/abs/2310.06117
AWS Documentation. “Amazon Bedrock Knowledge Bases.” https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base.html
AWS Documentation. “Evaluate RAG sources using Amazon Bedrock.” https://docs.aws.amazon.com/bedrock/latest/userguide/evaluation-kb.html
AWS Machine Learning Blog. “Evaluating RAG applications with Amazon Bedrock knowledge base evaluation.” https://aws.amazon.com/blogs/machine-learning/evaluating-rag-applications-with-amazon-bedrock-knowledge-base-evaluation/
RAGAS Documentation. https://docs.ragas.io/
DeepEval Documentation. https://docs.confident-ai.com/
Hugging Face MTEB Leaderboard. https://huggingface.co/spaces/mteb/leaderboard
LlamaIndex. “Auto-Merging Retriever.” https://docs.llamaindex.ai/en/stable/examples/retrievers/auto_merging_retriever/