Context Engineering and the Real Agent Problem

📅 17 April 2026📖 ~14 min readContext EngineeringAI AgentsProduction AIBedrock AgentCore
⚠️ Disclaimer: The analysis and architectural patterns discussed in this post are based on publicly available documentation and my own experience building production agent systems. Platform comparisons reflect the state of these services at the time of writing and may change as products evolve. Always consult official documentation for the latest specifications.
The bottleneck for production AI agents is no longer model capability. It’s what you feed the model. Context engineering is the discipline of designing the full information system that determines what tokens the model sees at inference time, and when. This post breaks down why it matters, how the major platforms approach it, and six principles that hold up in real-world deployments.

Last month, I watched a coding agent read the same file four times in a row. Not because the file had changed. Because by turn 40 of its tool-call loop, the earlier file contents had been pushed so far back in the context window that the model couldn’t “see” them anymore. It re-read the file, processed it, took an action, and then lost track of what it had just done. Three loops later, it read the file again.

This wasn’t a broken model. Claude Sonnet is more than capable of reading a file and remembering it. The problem was that nobody had designed what the model sees at each step. The conversation history, tool outputs, system instructions, and retrieved documents had all been dumped into the context window with no curation, no prioritization, and no strategy for what to keep and what to drop.

This is what kills agents in production. Not model intelligence. Context management.

The timing to talk about this feels right. In the past few weeks, Anthropic shipped its Managed Agents platform with a specific architecture for separating reasoning from execution context. The Manus team published production data showing a 100:1 input-to-output token ratio in their agent workflows. UCSD researchers released a survey cataloging how agent memory systems fail at scale. And Google, AWS, and several startups all shipped memory infrastructure that treats context as a first-class engineering problem rather than a prompt engineering afterthought.

All of these point in the same direction: the bottleneck for production agents is no longer model capability. It’s what you feed the model.

Context Engineering vs. Prompt Engineering

Prompt engineering is about writing good instructions. You craft a system prompt, add some examples, maybe include a chain-of-thought hint, and iterate until the model gives good responses.

Context engineering is broader. It’s about designing the full system that determines what tokens the model sees at inference time, and when.

That system includes:

  • The system prompt and its instructions
  • The conversation history (which grows with every turn)
  • Tool definitions (which take up space even when not used)
  • Tool outputs from previous steps
  • Retrieved documents from RAG or search
  • User preferences and facts pulled from long-term memory
  • Structured output schemas

Every one of these consumes tokens. Every token competes for the model’s attention with every other token. And here’s what catches people off guard: more context usually makes things worse, not better. Including every piece of possibly-relevant information feels safe. In practice, it dilutes the signal. The model spends attention on a stale document from three turns ago instead of the fresh tool output it just received.

Andrej Karpathy put it simply: context engineering is about providing “all the context for the task to be plausibly solvable by the LLM.” All the context. Not all the information. There’s a big difference.

Why Agents Break Where Chatbots Don’t

A single-turn chatbot has it easy. The user sends a message, you retrieve some context, the model responds. If the response is bad, the user asks again. The context window starts fresh each time.

Agents run in loops. They take an action, observe the result, decide what to do next, take another action. Each iteration adds to the context. After 20 or 30 steps, the context window is full of tool call records, observation data, intermediate reasoning, and accumulated conversation. And unlike a chatbot, the agent can’t ask the user to “try again.” It has to keep going.

This creates failure modes that don’t exist in chatbot scenarios:

Context rot. As the agent loops, older information gets pushed further from the model’s attention. The model starts repeating actions it already took because it can’t recall its own recent history. This is exactly what happened with my coding agent reading the same file four times. I’ve also seen agents fall into a subtler version of this: self few-shotting. When the context contains a long sequence of action-observation pairs that follow the same pattern, the model locks into that pattern and stops adapting. It keeps calling the same API with the same parameters even when the situation has changed, because the repeated structure in its own history acts as implicit few-shot examples. The Manus team identified this as a real production issue and found that introducing structured variation into the agent’s action format helps break the pattern.

Context poisoning. An early hallucination or bad tool output enters the context. Every subsequent step now reasons on top of that bad information. In a chatbot, the user catches the error. In an agent loop, nobody’s watching turn by turn.

Context confusion. Too many tool definitions or irrelevant documents crowd the window. The model picks the wrong tool because it’s distracted by options it doesn’t need. Or it follows instructions from a retrieved document that contradicts its system prompt.

Context blow-up. A single tool observation, like scraping a web page or reading a PDF, dumps 50K tokens into the context. Suddenly there’s no room for the conversation history, and the model loses track of the task entirely.

The Economics of Agent Context

The cost structure explains why this matters so much. The Manus team published a number that should get your attention: in their production agent system, the average input-to-output token ratio is approximately 100:1. For every token the model generates, it processes 100 tokens of context.

This means the vast majority of your inference cost comes from context processing, not generation. And with most model providers, there’s a massive cost difference between cached and uncached input tokens. With Claude Sonnet, cached input costs $0.30 per million tokens while uncached costs $3.00. Whether your context is cache-friendly can swing your bill by an order of magnitude.

This makes KV-cache hit rate one of the most important operational metrics for a production agent. And it has direct implications for how you design your context:

Keep your prefix stable. LLMs process context left to right. If your system prompt changes between turns (even something small like a timestamp precise to the second), the cache invalidates from that point forward. Everything after the change gets reprocessed at full price.

Make context append-only. Don’t modify or reorder previous entries. Append new actions and observations to the end. Ensure your JSON serialization is deterministic. Some languages don’t guarantee key ordering, and that’s enough to break the cache.

Don’t swap tools mid-workflow. Tool definitions typically sit near the front of the serialized context. If you dynamically add or remove tools between iterations, everything after the tool definitions becomes uncached. Manus learned this the hard way and switched to a masking approach: all tools stay in the context permanently, but a state machine controls which tools the model is allowed to select at each step by constraining the output token logits. This preserves the cache while still narrowing the action space.

Memory Architecture: Three Layers, Not One

Most agent frameworks treat memory as “keep the conversation history.” That’s one layer of a system that needs at least three.

Short-term memory is the current context window. It’s the model’s workspace for this turn: current conversation, recent tool outputs, active instructions. This layer needs aggressive curation. Summarize older turns. Drop tool outputs after they’ve been processed. When an observation is 20K tokens of raw HTML, extract the relevant data and discard the rest before it enters the context.

The Manus team takes this further with a file-as-context pattern. Instead of keeping large outputs in the context window, they write them to the file system and include only a reference (file path + summary) in the context. The model can read the file again if it needs the full content, but most of the time the summary is enough. This effectively gives the agent unlimited external memory while keeping the context window lean.

Working memory tracks state across a multi-step task. Think of it as the agent’s scratchpad. It holds the current plan, decisions already made, constraints discovered, intermediate results. Without working memory, agents re-derive information they already found, waste tokens on redundant tool calls, and sometimes contradict their own earlier reasoning.

In practice, working memory often lives in a structured format outside the conversation: a task state file, a JSON scratchpad, or a dedicated section of the system prompt that gets updated between iterations. The key property is that it persists across turns but not across sessions.

Manus implements a clever variant of this: a todo.md file that the agent rewrites at every step. The file contains the full task plan with completed and remaining items. By reading this file at the start of each iteration, the agent pushes its global plan back into recent attention, preventing the “lost the plot” problem that plagues long-running workflows. It’s a form of structured self-reminder that costs very few tokens relative to keeping the full action history.

Long-term memory stores information that survives beyond a single session. User preferences, past decisions, domain knowledge, learned patterns. This layer needs to be indexed and searchable, not just appended to context.

The retrieval layer between long-term memory and the active context is where the real engineering challenge lives. You need to decide, for each turn: which stored memories are relevant enough to promote into the context window? Pull too many and you dilute the signal. Pull too few and the model misses information it needs.

AWS’s Bedrock AgentCore Memory service breaks long-term memory into four extraction strategies that map well to real use cases: semantic facts (“the user’s company has 500 employees”), user preferences (“prefers Python over TypeScript”), conversation summaries (compressed session recaps), and episodic memory (multi-turn interaction patterns with reflections). Each strategy has its own extraction, consolidation, and retrieval pipeline.

The episodic strategy is the most sophisticated. It doesn’t just record what happened; it detects when an interaction episode is complete, merges multi-turn extractions into a single episode record, and then generates cross-episode reflections that identify success patterns and failure patterns. When the agent encounters a similar situation later, it can retrieve not just “what happened last time” but “what we learned from what happened last time.” This is the closest thing production systems have to procedural memory today.

The HOT/COLD Path Split

One architectural decision that matters more than people expect: memory reads and memory writes must live on different paths.

The HOT path is synchronous. When the agent needs to generate its next response, memory retrieval happens on this path. The user is waiting. Latency matters. You query long-term memory, pull relevant records into context, and the model generates. This path should be read-only with respect to long-term memory.

The COLD path is asynchronous. After the agent responds, the conversation events get written to short-term memory (fast, synchronous). Then, in the background, extraction strategies analyze those events and produce long-term memory records. This can take a minute or more. The user doesn’t wait for it.

When I was building the memory pipeline for our Strands-based agent on AgentCore, we initially had memory extraction on the synchronous path. Every turn, the agent would pause for 30-60 seconds while AgentCore extracted and consolidated memories. Users thought the agent had frozen. Moving extraction to the async path cut perceived latency by 80% with no loss in memory quality, since the memories from turn N aren’t needed until turn N+2 at the earliest.

AgentCore enforces this split by design. CreateEvent (writing raw events to short-term memory) is synchronous and fast. The extraction pipeline that produces long-term records runs asynchronously in the background, triggered by configurable conditions: message count thresholds, token count thresholds, or session idle timeouts.

Multi-Tenancy: Where Context Architecture Gets Hard

If you’re building agents for a single user, context management is a design problem. If you’re building agents for thousands of users on a shared platform, it becomes an infrastructure problem.

The core challenge is isolation. When multiple users share the same underlying infrastructure, their contexts, memories, and cached prefixes must not leak between tenants. The failure modes are subtler than you’d expect:

Vector database neighbor leakage. If you’re using vector search for memory retrieval, the nearest-neighbor query might return results from another user’s namespace. Pre-filtering by tenant ID before the vector search is the correct approach, but some implementations do post-filtering (search first, then discard wrong-tenant results), which both leaks information and wastes compute.

KV-cache boundary violations. If you’re self-hosting models, the prefix cache can become a side channel. User A’s cached prefix might influence what gets cached for User B if the cache key doesn’t include tenant isolation. Most managed API providers handle this correctly, but self-hosted setups need explicit attention.

Memory consolidation collisions. When long-term memory extraction runs asynchronously, there’s a window where one user’s memory write could interfere with another’s if namespace isolation isn’t enforced at the storage layer.

AgentCore Memory addresses this with namespace-based isolation. Every memory record lives under a path like /strategy/{strategyId}/actors/{actorId}/, and retrieval queries are scoped to the requesting actor’s namespace. The actor ID comes from the authenticated request (JWT), not from the agent’s context, so the model can’t be tricked into querying another user’s memory through prompt injection.

There are broadly three architecture patterns for multi-tenant agent memory:

Shared infrastructure, logical isolation. One vector database, one memory service, tenant isolation via namespaces and access controls. Cheapest and simplest. Works when tenants trust the platform operator and don’t need strict data residency.

Dedicated storage, shared compute. Each tenant gets their own memory store, but shares the inference and orchestration layer. Better isolation guarantees, moderate cost.

Full isolation. Dedicated everything per tenant. Most expensive, but required in regulated industries (healthcare, finance) where data cannot share infrastructure.

Most teams start with pattern one and discover they need pattern two when their first enterprise customer asks about data isolation.

One isolation dimension that’s easy to miss: credentials. When agents call external APIs on behalf of users, the agent code should never see raw tokens or API keys. The clean pattern is to route external calls through a credential proxy (or use scoped OAuth via something like MCP) where the agent submits a request and the proxy injects the user’s credentials at the network layer. Anthropic’s managed agents enforce this explicitly: credentials never enter the sandbox where the agent’s code runs. If the agent is compromised or the model is jailbroken, the attacker gets the agent’s output, not the user’s API keys.

How the Major Platforms Handle Context

The approaches from Anthropic, Google, and AWS reveal different philosophies about where context management responsibility should live.

Anthropic’s Managed Agents introduce a “Brain and Hands” architecture that makes a clean separation between reasoning and execution.

The Brain is the reasoning component. It sees the full conversation context, maintains the plan, and decides what to do next. The Hands are execution components: specialized sub-agents or tools that carry out specific actions. Each Hand gets its own scoped context containing only what it needs for its specific task, not the full conversation history.

But the most interesting layer is the Session. In Anthropic’s design, a Session is not the same thing as a context window. The Session is a durable event log: a complete record of everything that happened during an interaction. Every message, every tool call, every result, every decision. The context window is a view into that log, constructed fresh for each model call by selecting which events are relevant right now.

This distinction matters for production systems in ways that aren’t obvious at first:

  • Model upgrades don’t break history. You can switch from Sonnet to Opus mid-session. The session log is model-agnostic. Only the context view needs to be reconstructed for the new model’s format.
  • Crash recovery becomes replay. If the agent process dies, you reconstruct the full state from the session log, like replaying a write-ahead log in a database. The context window is ephemeral; the session is durable.
  • Context views can be specialized. The Brain gets one view (high-level plan + recent decisions). A Hand gets a different view (specific instructions + relevant tool outputs). Same underlying session, different context projections.

This is a fundamentally different architecture from “stuff the conversation history into the context window and hope for the best,” which is what most agent frameworks still do.

Google’s Agent Development Kit (ADK) takes a pipeline approach: context flows through composable processors that add, remove, or transform entries before the model sees them. Its memory model separates session state, user memory, and global knowledge, with the processor pipeline deciding what gets promoted into the active context window.

AWS’s Bedrock AgentCore treats memory as managed infrastructure. You don’t build your own memory extraction pipeline; you configure strategies and let the service handle the rest. The four built-in strategies (semantic, preference, summary, episodic) cover the most common use cases. For advanced cases, self-managed strategies let you plug in your own extraction logic while still using AgentCore’s storage and retrieval infrastructure.

The philosophical difference: Anthropic gives you architectural patterns (Brain/Hands/Session) and trusts you to build the implementation. Google gives you a processing pipeline with composable middleware and lets you customize each stage. AWS gives you managed infrastructure with configurable strategies that handle the heavy lifting.

In practice, teams with strong infrastructure skills gravitate toward Anthropic’s approach. Teams that want flexibility with guardrails prefer Google’s pipeline model. Teams that want to focus on agent logic rather than memory plumbing choose AWS. Most production systems end up combining ideas from all three: the Session-as-durable-log concept from Anthropic is sound regardless of platform, the processor pipeline from Google is a useful pattern for any context construction system, and the managed extraction strategies from AWS solve a real operational problem that nobody wants to rebuild from scratch.

Anthropic Managed Agents Google ADK AWS AgentCore Memory
Architecture Brain/Hands/Session Processor Pipeline Managed Memory Service
Context scoping Per-component (Brain vs Hands) Per-processor transform Per-strategy namespace
Memory types Session-managed Session/User/Global STM/LTM with 4 strategies
Multi-tenancy Application-level Application-level Built-in namespace isolation
Customization Build your own Configure processors Configure strategies or bring your own

Six Principles for Production Context Engineering

Based on what I’ve seen work (and fail) across multiple production agent systems:

1. Design around the KV-cache, not the context window. Your context window size is a ceiling. Your KV-cache hit rate determines your actual cost and latency. Every design decision should consider cache impact first. Stable prefixes, append-only history, deterministic serialization.
2. Mask tools instead of swapping them. When you need to narrow the agent’s action space, don’t remove tool definitions from the context. Leave them in place and constrain the output at the decoding level. This preserves cache hits and avoids confusing the model with disappearing tool references.
3. Use the file system as extended memory. Large tool outputs (web pages, documents, data files) should be written to disk, not stuffed into the context window. Include a path and summary in context. Let the agent re-read the file if it needs the full content.
4. Keep errors in context. When an agent takes a wrong action, don’t strip it from the history to “clean up” the context. Leave the error and the correction visible. The model learns from its own mistakes within a session. Removing errors causes it to repeat them.
5. Separate memory reads from memory writes. Memory retrieval can happen synchronously (you need the data before generating a response). Memory extraction and consolidation should happen asynchronously (the user shouldn’t wait for background processing). This keeps the agent responsive while still building long-term memory.
6. Budget your context explicitly. Allocate token budgets for each context component: N tokens for instructions, M for history, K for tool definitions, J for retrieved context. When one grows, another shrinks. Don’t let any single component crowd out the others.

What’s Still Missing

For all the progress in the past year, there are gaps that nobody has solved well yet:

Cross-agent memory consistency. When multiple agents collaborate on a task, there’s no standard protocol for sharing memory state or resolving contradictions between their individual memories. Each agent maintains its own context, and coordinating across agents requires custom plumbing.

Procedural memory. Agents can remember facts and preferences (declarative memory). They’re bad at remembering how to do things (procedural memory). Multi-step workflows that worked last time should be replayable, but most memory systems only store what happened, not the decision logic that produced it.

Context budget negotiation. When an agent has more relevant context than fits in its window, there’s no principled way to decide what to cut. Most systems use heuristics (oldest first, lowest similarity score first). We don’t have good frameworks for making these tradeoffs explicit.

I’m honestly not sure how some of these will get solved. Cross-agent memory, in particular, feels like it might need something analogous to distributed consensus protocols, and the history of distributed systems suggests that’s going to be harder than anyone expects. Procedural memory might turn out to be less about “remembering how” and more about building reliable skill extraction pipelines that convert successful execution traces into reusable workflows.

But the fact that these are the questions we’re asking tells you something about where the field is. A year ago, the conversation was about which model to use. Now it’s about how to architect what the model sees. That shift, from model selection to context engineering, is where the real work of building production agents lives.


📝 Note: This blog post represents my personal views and experiences and does not represent the views of my employer. Any recommendations or architectural patterns discussed are based on publicly available documentation and my own analysis.

Melanie Li is an AWS Solutions Architect specializing in machine learning and AI. She writes about building production AI systems at melanieli.com.au.

💬 Comments

Comments are reviewed before appearing
No comments yet. Be the first to share your thoughts!