OpenFang on AWS (Part 2): Security Review, Deployment & Lessons Learned
Part 2 of the OpenFang on AWS series. Read Part 1: OpenFang Agent OS on AWS with Bedrock first.
7. Security: What the WAF Review Revealed
Before deploying any AI agent system beyond a proof of concept, run it through the AWS Well-Architected Framework Security Pillar. Here is what we found — and the lessons apply to any autonomous agent deployment, not just OpenFang.
What Was Already Good
The deployment scored 7/10 for both Identity and Access Management and Infrastructure Protection:
- Zero inbound ports — The security group allows no ingress traffic. Nothing on the internet can connect to this instance.
- Bedrock via PrivateLink — All Bedrock API calls route through a VPC Endpoint with a dedicated security group, never traversing the public internet or NAT Gateway.
- IMDSv2 enforced — Prevents SSRF-based credential theft from the instance metadata service.
- Instance profile — No static AWS credentials anywhere in the system. Temporary credentials rotate automatically.
- Least-privilege IAM — The policy allows only
bedrock:InvokeModelandbedrock:InvokeModelWithResponseStreamon specific model families. - Encrypted EBS — Data at rest is encrypted by default.
- Localhost-bound Docker ports — Both OpenFang (50051) and LiteLLM (4000) bind to
127.0.0.1, not0.0.0.0. - No SSH keys — Access is exclusively through SSM Session Manager.
ℹ️ Baseline: If you are deploying agents on AWS, this is the baseline to aim for.
What Was Missing
The deployment scored 3/10 for Detection Controls and 2/10 for Incident Response:
| Finding | Severity | Issue |
|---|---|---|
| No alerting infrastructure | CRITICAL | No SNS topic, no email notifications. If the instance dies, no one knows. |
| No VPC Flow Logs | HIGH | No network traffic visibility for an agent with web_fetch and shell_exec capabilities. |
| No centralized logging | HIGH | Container logs exist only on the local EBS volume. |
| No CloudWatch Alarms | HIGH | CPU spikes, status check failures, and disk pressure go undetected. |
| Hardcoded LiteLLM master key | HIGH | The same static string sk-litellm-openfang-internal appeared in four places. |
| Unpinned LiteLLM image | HIGH | ghcr.io/berriai/litellm:main-latest is a moving target — supply chain risk. |
| No backup plan | MEDIUM | The knowledge graph and research data sit on a single EBS volume with no snapshots. |
How We Fixed the Critical and High Findings
Using AWS CDK, we added the following to the stack:
- SNS Topic + CloudWatch Alarms — StatusCheckFailed alarm (period 60s, threshold 1) and High CPU alarm (>80%, period 300s) both notify an SNS topic. A
CfnParameteraccepts an alert email at deploy time. - VPC Flow Logs — All traffic logged to a CloudWatch Log Group with 30-day retention and a dedicated IAM role for the VPC Flow Log service.
- Dynamic secret generation — Both the LiteLLM master key and OpenFang API key are generated at runtime with
openssl rand -hex 32. No static secrets in UserData or config files. - Pinned container image — LiteLLM image pinned to
ghcr.io/berriai/litellm:main-v1.65.0. - Termination protection — Enabled on the EC2 instance to prevent accidental stack deletion.
After remediation, the scores improved:
| Security Area | Before | After |
|---|---|---|
| Identity and Access Management | 7/10 | 7/10 |
| Detection Controls | 3/10 | 6/10 |
| Infrastructure Protection | 7/10 | 8/10 |
| Data Protection | 5/10 | 6/10 |
| Incident Response | 2/10 | 5/10 |
| Overall | 4.8/10 | 6.4/10 |
The Takeaway for Any Agent Deployment
🚨 Critical: AI agents with autonomous internet access and shell execution are a different threat model than a web application. An agent that can run shell_exec and web_fetch can reach internal networks, exfiltrate data, or be manipulated via prompt injection to take unintended actions.
Detection controls — logging, monitoring, alerting — are non-negotiable for production agent deployments. Treat the agent like an intern with root access: trust, but verify, and keep the audit trail comprehensive.
8. Deploying It Yourself — CDK Walkthrough
Prerequisites
- Node.js >= 18
- AWS CDK CLI (
npm install -g aws-cdk) - AWS credentials with VPC, EC2, IAM, CloudWatch, and SNS permissions
- CDK bootstrapped in the target account/region (
cdk bootstrap) - SSM Session Manager plugin installed locally
Two Deployment Modes
Mode 1 — New VPC (full self-contained deployment):
git clone https://github.com/melanie531/openfang-on-aws.git
cd openfang-on-aws
npm install
npx cdk deploy --parameters [email protected]
This creates the VPC, subnets, NAT Gateway, Bedrock Runtime VPC Endpoint (PrivateLink), EC2 instance, IAM role, security groups, flow logs, alarms, and SNS topic. Estimated cost: ~$102/month.
Mode 2 — Existing VPC (reuse your networking):
npx cdk deploy -c vpcId=vpc-xxx --parameters [email protected]
This creates the EC2 instance, IAM role, security groups, Bedrock Runtime VPC Endpoint (PrivateLink), and monitoring resources within your existing VPC. Estimated cost: ~$70/month.
Context Variables
| Variable | Default | Description |
|---|---|---|
vpcId |
(none — creates new) | Existing VPC ID to reuse |
instanceType |
t3.xlarge |
EC2 instance type |
bedrockRegion |
(stack region) |
Region for Bedrock API calls (configurable) |
Connecting
# Shell access
aws ssm start-session --target <instance-id> --region us-west-2
# Port forward to dashboard
aws ssm start-session --target <instance-id> \
--document-name AWS-StartPortForwardingSession \
--parameters '{"portNumber":["50051"],"localPortNumber":["4200"]}' \
--region us-west-2
# Get the auto-generated API key
cat /opt/openfang/.env
Cost Breakdown
| Component | Mode 1 (New VPC) | Mode 2 (Existing VPC) |
|---|---|---|
| EC2 t3.xlarge | ~$60/month | ~$60/month |
| NAT Gateway | ~$33/month | $0 (existing) |
| VPC Endpoint (PrivateLink) | ~$7.30/month | ~$7.30/month |
| EBS 30GB gp3 | ~$2.40/month | ~$2.40/month |
| CloudWatch Logs | ~$0.50/month | ~$0.50/month |
| Bedrock tokens | Variable (~$3/month light use) | Variable |
| Total | ~$102/month | ~$70/month |
Tear Down
npx cdk destroy
This removes everything. The S3 temp bucket (if created) may need manual cleanup.
9. What We Learned — Practical Gotchas
cargo build process gets OOM-killed silently during the Docker build. The instance reboots, Docker restarts, the build fails again — an infinite loop that consumes credits. Solution: Upgrade to t3.xlarge (16 GB) and add a 4 GB swap file in UserData before the Docker build. For production, pre-build the Docker image and push it to ECR to avoid compilation on the instance entirely.anthropic.claude-sonnet-4-6. Bedrock expects anthropic.claude-sonnet-4-6-v1. Cross-region inference profiles use us.anthropic.claude-sonnet-4-6. Getting any layer wrong produces cryptic 400 errors from Bedrock with no clear indication of what went wrong. Map all three layers explicitly in LiteLLM’s model_list.127.0.0.1:50051 as its listen address, ignoring the listen_addr configuration field. In standard Docker bridge mode, this means the port is unreachable from outside the container — including from the host and from other containers. The fix is network_mode: host, which lets OpenFang bind to the host’s loopback interface directly.NoCredentialsError because boto3 cannot obtain instance profile credentials. One line in CDK fixes this: HttpPutResponseHopLimit: 2.python3-minimal (~25 MB) in the Dockerfile avoids the edge case.10. Conclusion
The Agent OS is an emerging category. OpenFang (v0.1.0, February 2026) represents one approach — autonomous agents that run as OS-level daemons, not chatbot sessions. It is early, opinionated, and not yet battle-tested in production at scale. But the core concept — agents as managed workloads with scheduling, sandboxing, and audit trails — points to where the industry is heading.
AWS provides a natural foundation for these systems. Bedrock delivers managed LLM access without GPU provisioning. IAM eliminates static credentials. VPC isolation, SSM, and PrivateLink for Bedrock create a zero-trust access model. CloudWatch and VPC Flow Logs provide the detection controls that autonomous agents require.
Three patterns from this deployment are reusable beyond OpenFang:
- The LiteLLM sidecar — Any OpenAI-compatible agent framework can use this pattern to access Bedrock with IAM authentication.
- The WAF security review template — The five-pillar review we conducted applies to any agent deployment. Detection controls and incident response are where most teams underinvest.
- The CDK two-mode pattern — Supporting both new VPC and existing VPC deployments via context variables makes the stack reusable across environments.
The code is open source: github.com/melanie531/openfang-on-aws. Deploy it. Run your own WAF review. And if you build something interesting with OpenFang’s Hands — we would like to hear about it.
Comments