⚠️ Disclaimer: All code examples in this post are provided for illustrative and educational purposes only. They are not intended for direct use in production environments without thorough review, testing, and validation against your specific security, compliance, and operational requirements. API parameter names and structures shown may differ from the actual service API—always consult the Amazon Bedrock AgentCore API Reference for the latest specifications. Always conduct your own testing before deploying to production.

Running Claude Agent SDK Inside OpenClaw on AgentCore

📅 2026-03-01📖 ~10 min readAgentCoreClaude Agent SDKAgentic AI
We got asked if we could run the Claude Agent SDK inside an already-running OpenClaw container on Amazon Bedrock AgentCore Runtime. So we built a cost analysis agent to find out. The experiment worked — but the lessons learned challenged our assumptions about when agents are the right answer.

Background

I’ve been running OpenClaw — an open-source AI messaging assistant — on Amazon Bedrock AgentCore Runtime. The setup is straightforward: messages come in from Telegram or Slack, a Router Lambda resolves user identity, and each user gets their own per-user microVM on AgentCore running OpenClaw in a container. Claude on Bedrock handles the AI reasoning. Clean architecture, does what it’s supposed to do.

Then we got asked a question that sounded simple: “Can you run the Claude Agent SDK inside that same AgentCore container where OpenClaw is already running?”

The Agent SDK — the programmatic API behind tools like Claude Code — lets you build autonomous agents that use tools, reason across multiple steps, and produce complex outputs. OpenClaw is already running in a container on AgentCore. What if we embedded a specialized Agent SDK agent alongside it? A sub-agent for complex, multi-step tasks that a single prompt can’t handle well?

It sounded like a natural extension. So we tried it.

What We Built

We chose a concrete test case: a cost analysis agent. It queries three different data sources — AWS Cost Explorer for infrastructure costs, CloudWatch for Bedrock model invocation stats, and DynamoDB for per-user token usage — then cross-references everything and produces a comprehensive report with recommendations.

We wired it up with MCP (Model Context Protocol) servers: a Python stdio server for Cost Explorer and CloudWatch, and an in-process Node.js server for DynamoDB queries. The agent got its own specialized system prompt focused entirely on cost analysis.

The pitch was compelling: isolated context window, structured tool access via MCP, independent testability, reusable from multiple entry points. Everything an architecture diagram loves to show.

Six Iterations to Find the Real Problem

Here’s the part they don’t put in the demo.

v1: Built the agent as an OpenClaw skill. The agent ran perfectly inside the container — queried all three data sources, cross-referenced infrastructure costs with model usage, spotted anomalies, and produced a beautiful 15,000-character report. But when it delivered the report back to the user… nothing arrived. Just “NO_REPLY.”

The problem was architectural: OpenClaw’s WebSocket bridge streams only the final assistant turn to connected clients. In a multi-turn tool-use agent, the actual report gets generated in an intermediate turn. The final turn is just the agent saying “I’m done.” That’s all the user ever sees.

v2–v5: Four iterations of increasingly creative workarounds. Wrapping the output as a tool result. Writing to a temp file and reading it back. Using single-shot mode with a --print flag. Getting in-process MCP servers working elegantly. Each attempt either partially worked or introduced new fragility. Report quality improved, but the fundamental delivery problem remained.

v6: We gave up trying to work within OpenClaw and bypassed it entirely. The bridge layer detects cost analysis requests via regex, spawns the Agent SDK agent as a direct child process, captures stdout, and returns the result. OpenClaw never touches the message.

The irony: The whole point was to build a “skill” that OpenClaw orchestrates. The working solution was to cut OpenClaw out of the loop entirely. The agent became a sidecar that happens to share a container — not a true integration.

The Numbers

The end-to-end system works. A user types “show me my cost report” on Telegram, and 108 seconds later, a 14,852-character report appears with daily infrastructure breakdowns, per-model Bedrock usage, per-user token analysis, trend detection, and prioritized recommendations.

But here are the numbers that tell the real story:

  • Latency: 108 seconds. The Agent SDK spawns a subprocess, connects to Bedrock, starts MCP servers, then does 5–8 rounds of tool-use reasoning. OpenClaw doing the same analysis with direct CLI calls: ~30–40 seconds.
  • Cost: 3–5x more per invocation. Each agent run makes 3–5 separate Bedrock model calls. OpenClaw with exec and one formatting call: 1–2 calls total.
  • Complexity: The codebase tripled in size for this one feature. What could have been aws ce get-cost-and-usage | format became: Agent SDK + 2 MCP servers + system prompt + Zod schemas + permission handling + result extraction + bridge interception.
  • Quality: This is where the agent genuinely earned its keep. The report was better than anything a single CLI command and one model call could produce — cross-referencing three data sources, identifying that AgentCore memory costs were 2x vCPU costs (suggesting idle sessions), generating contextual recommendations. Real multi-step reasoning producing real value.

What I Actually Learned

I went into this experiment expecting to validate the “agents everywhere” pattern. I came out with something more useful: clarity about when the pattern makes sense and when it doesn’t.

The Integration Tax Is Real

When the Agent SDK runs inside a host like OpenClaw, you’re composing two fundamentally different interaction models. The SDK wants to run autonomously and return a result. The host wants to stream every token to the user in real-time. These two models don’t compose naturally — and when they don’t, you end up building bypasses that defeat the purpose of the integration.

Every layer you add introduces friction. Our debugging stack was: Bridge → Agent SDK → claude subprocess → MCP server subprocess (Python) + in-process MCP (Node.js) → Bedrock API. When something failed, we were tracing across 4–5 process boundaries. Each layer is individually sensible. Stacked together, they create real operational complexity.

When the Agent Pattern Actually Shines

Despite the integration challenges, the agent produced genuinely better output than the simpler approach would have. The cost analysis task is a legitimate use case for autonomous multi-step reasoning: the agent queries one source, looks at the results, decides what to query next, notices patterns across data sources, and synthesizes everything into a coherent narrative.

MCP servers are also a genuinely good abstraction. The in-process createSdkMcpServer() pattern — tools running in the same process, sharing connection pools — eliminates subprocess overhead. The Python stdio server for Cost Explorer demonstrates clean, language-agnostic tool integration. If you’re building complex tools with structured inputs and outputs, MCP earns its complexity.

A Decision Framework for Builders

After this experiment, here’s the framework I’d use before reaching for a dedicated agent:

Let your existing system handle it when:

  • The task is a straightforward data lookup or single CLI command
  • The user wants to interact during the process — follow-ups, drill-downs, “now show me February”
  • You need access to conversation history or user context
  • Latency matters — 30 seconds vs. 108 seconds is a real difference in a chat window
  • Your “MCP server” would just be wrapping a CLI command — that’s adding abstraction without adding value

Build a dedicated agent when:

  • The task genuinely requires multi-step autonomous reasoning — querying sources, making decisions about what to query next, synthesizing across results
  • You need a specialized system prompt that would dilute the main agent’s general-purpose capabilities
  • Isolation is a feature — the sub-agent shouldn’t access the main agent’s tools or conversation history
  • The output is a one-shot deliverable (a report, an analysis), not an ongoing conversation

Just write a script when:

  • The workflow is deterministic — you know exactly which APIs to call and in what order
  • The AI part is only the final formatting or interpretation step
  • Cost and reliability are constraints

For this specific cost analyzer? If I were making a product decision today, I’d seriously consider a Node.js script that queries the data sources directly, formats a templated report, and optionally passes the raw data to a single model call for the “recommendations” section. Faster, cheaper, and more reliable than the full agent loop.

The Unexpected Moment: Two AIs Debugging Each Other

Here’s something I didn’t plan for that turned out to be the most memorable part of the whole experiment.

We developed this feature by running Claude Code on EC2 with auto-permission mode. The workflow: Claude Code writes the code, builds the container, deploys to update the AgentCore runtime, then automatically sends test messages to the OpenClaw bot on Telegram for end-to-end testing — monitoring responses through CloudWatch and Lambda logs.

What happened next was genuinely surprising.

First, Opus has personality. When Claude Code kept sending the same cost analysis query repeatedly for testing, the OpenClaw bot — running Claude Opus — got annoyed and refused to answer. It had already answered this question, why was it being asked again? If you’ve worked with Opus, you know this isn’t anthropomorphization — the model genuinely pushes back when it perceives redundancy.

Then something more interesting happened. There were bugs in how the Claude agent was built. Claude Code needed to understand the runtime state of the agent and test small changes — but it couldn’t just redeploy every time, because that would kill the AgentCore session and break the conversation. So instead, it started asking OpenClaw for help. Through Telegram messages, Claude Code would ask OpenClaw to run specific commands inside the container, check logs, test small adjustments, and report back the results. OpenClaw would execute the instructions and respond with what it found. The two AIs were debugging together in real time, within the same live session, passing findings back and forth through Telegram.

Where was I during all this? At the gym. It was a Saturday afternoon in Sydney. I could see OpenClaw’s Telegram responses coming in on my phone, but I wasn’t sending anything — those responses were going to Claude Code’s test messages, not mine. I was quietly watching the conversation between two AIs, trying to guess what they were doing from one side of the dialogue.

They actually solved the problem. Without me.

This small moment crystallized something bigger about where agentic AI is heading. The future isn’t one agent doing everything — it’s multiple specialized agents communicating, debugging, and problem-solving with each other. Humans become supervisors and quality gates, not hands-on-keyboard operators. You set the direction, define the boundaries, review the output. The agents handle the iteration loop.

We’re not there yet for all tasks. But for a build-test-debug cycle? On a Saturday afternoon, from a gym in Sydney? We’re closer than I expected.

The Honest Bottom Line

The AI community has a “when you have a hammer” problem with agents right now. The frameworks make it easy to build them. The demos look impressive. And the temptation to agent-ify everything is real.

But production systems care about different things than demos. They care about latency, cost, reliability, debuggability, and how well new components integrate with existing infrastructure. Sometimes the most sophisticated engineering decision is to not build an agent.

This experiment wasn’t a failure — we proved that the Claude Agent SDK can run autonomous, MCP-equipped agents inside AgentCore containers, and the output quality was genuinely excellent. But the most valuable thing we built wasn’t the agent. It was the clarity about when to use which tool.

If you’re building agentic AI systems, I hope sharing this journey — including the six iterations it took to find a fundamental architecture mismatch — saves you some time and helps you make better build-vs-don’t-build decisions.

📝 Note: This blog post represents my personal views and experiences and does not represent the views of my employer. Any recommendations or architectural patterns discussed are based on publicly available documentation and my own experimentation.

Resources

Related Posts

💬 Comments

Comments are reviewed before appearing
No comments yet. Be the first to share your thoughts!