⚠️ Disclaimer: The findings and recommendations in this post are based on experiments conducted using a specific evaluation framework, model, and dataset. Results may vary with different models, task designs, or evaluation methodologies. This content is provided for educational and informational purposes only. Always conduct your own testing and evaluation before applying these patterns to production systems.

Agent Foundry and Why Heavy Skills Backfire

📅 March 2026📖 ~15 min readAgent SkillsEvaluationClaude

We built an evaluation framework called Agent Foundry to test whether adding domain expertise to AI agents improves their performance. Testing three skill designs across three domains, we found that how you add skills matters far more than whether you add them. Heavy, checklist-driven skills degraded performance in every domain — by up to 81% in documentation tasks — while lightweight, knowledge-only skills improved F1 scores by up to 34%. The key insight: skills should enhance an agent’s judgment, not replace it.

📖 Table of Contents

Introduction
The Agent Foundry Framework
Evaluation Methodology
Results — The Surprising Finding
Analysis — Why Heavy Skills Fail
The “Enhance, Don’t Replace” Principle
Implications for Skill Design
Limitations and Future Work
Conclusion

Introduction

Foundation models like Claude are remarkably capable out of the box. They can review code for security vulnerabilities, assess cloud architectures against best practices, and generate technical documentation — all without domain-specific customization. But a natural question follows: can we make them better by equipping them with domain expertise?

The intuition seems obvious. A security expert who has memorized OWASP Top 10 should outperform a generalist. An architect who carries the AWS Well-Architected Framework in their head should catch more issues. It follows that an AI agent armed with the same specialized knowledge and structured reporting tools should outperform a bare foundation model.

We set out to test this assumption rigorously. What we found surprised us.

Adding domain expertise to agents does not universally improve performance. In fact, the most comprehensive skill designs — the ones packed with detailed checklists and mandatory structured reporting tools — consistently degraded agent performance across every domain we tested. Meanwhile, a lighter approach that injects domain knowledge without constraining agent behavior produced meaningful improvements.

This article presents our findings from the Agent Foundry project: an open-source framework for building, evaluating, and comparing domain expert agents. We describe the experimental setup, share precise metrics across nine agent-domain combinations, and distill what we learned into practical guidance for anyone designing AI agent skills.

The Agent Foundry Framework

Agent Foundry is built on the Claude Agent SDK, running Claude Sonnet on Amazon Bedrock. The framework defines three agent configurations and provides an evaluation harness for comparing them head-to-head.

Agent Configurations

Foundation Agent — The baseline. Claude Sonnet with core developer tools (Read, Write, Edit, Bash) and no domain-specific instructions. This represents what the model can do on its own, with general-purpose tool access but no specialized knowledge injection.

Expert-Heavy Agent — Foundation plus a comprehensive skill pack. Each Heavy skill includes:

A detailed system prompt with domain-specific methodology (e.g., systematic OWASP Top 10 scanning for security, six-pillar Well-Architected analysis for architecture)
Exhaustive checklists that the agent is instructed to work through
Mandatory MCP (Model Context Protocol) tools for structured reporting (e.g., report_vulnerability, report_architecture_issue, doc_section)
Required summary tools that enforce structured output (e.g., review_summary, architecture_summary, doc_coverage)

Expert-Light Agent — Foundation plus a knowledge-only skill pack. Each Light skill includes:

A focused system prompt that injects domain knowledge and evaluation frameworks
Guidance language (“consider these areas,” “be aware of”) rather than directives (“check each of these items”)
No MCP tools — the agent reports findings in natural language
Explicit instructions to prioritize quality over quantity and to only report issues supported by evidence

The critical design difference: Heavy skills tell the agent what to check and how to report it. Light skills tell the agent what to know and let it decide what to report.

Skill Composition

Skills are composable system prompt extensions loaded dynamically at runtime. The framework supports a @SkillLoader.register("domain-name") pattern that allows skills to be mixed and matched. Each skill optionally provides an MCP server instance, enabling domain-specific tools to be injected into the agent’s tool set alongside the standard Read/Write/Edit/Bash tools.

Evaluation Methodology

Domain Selection

We chose three domains to represent fundamentally different types of expert tasks:

Security Code Review (detection task) — The agent must identify real vulnerabilities in application code while avoiding false positives. This tests pattern recognition and judgment about what constitutes a genuine security risk.
AWS Architecture Review (knowledge/rules task) — The agent must assess CloudFormation templates against Well-Architected Framework best practices. This tests domain knowledge application and the ability to map abstract principles to concrete infrastructure configurations.
AWS Technical Documentation (creation/coverage task) — The agent must evaluate and generate API documentation, including runnable code examples. This tests creative generation combined with technical accuracy.

Task Design

Each domain includes four tasks: three evaluation tasks with known expected findings, plus one false-positive trap task.

The trap tasks are deliberately clean — secure code, well-architected templates, or complete documentation — with zero expected findings. Any findings reported on a trap task are false positives. These trap tasks are excluded from aggregate metrics to avoid skewing averages but are reported separately as a measure of agent calibration.

Security tasks include a Flask application with SQL injection and weak hashing (12 expected findings), a Node.js financial API with hardcoded secrets and race conditions (11 expected findings), a Python data processor with deserialization vulnerabilities (8 expected findings), and a clean Python application using parameterized queries and proper cryptographic practices (0 expected findings).

Architecture tasks include templates with open security groups and hardcoded passwords (20 expected findings), missing VPC and encryption configurations (14 expected findings), cost optimization gaps (14 expected findings), and a clean production infrastructure template (0 expected findings).

Documentation tasks include S3 Quick Start documentation (6 operations), Bedrock Runtime documentation, and additional AWS service documentation, plus a well-documented service with complete coverage (0 expected findings).

Metrics

For each agent-domain combination, we measure:

Precision: Of the findings the agent reported, how many were correct?
Recall: Of the expected findings, how many did the agent identify?
F1 Score: The harmonic mean of precision and recall — our primary comparison metric.
Token Usage: Total input + output tokens consumed across all tasks, measuring computational cost.
False Positive Count: Findings reported on the trap task (where zero findings are expected).

Experimental Controls

All comparisons use identical conditions:

Same model: Claude Sonnet on Amazon Bedrock
Same max turns per task (controlled across all three agent configurations)
Same evaluation dataset and ground truth
Same fuzzy matching algorithm for extracting and comparing findings
Foundation agent findings are extracted from free-text output using keyword and title matching; Expert-Heavy findings are extracted from structured MCP tool calls

A Note on Measurement

During the project, we discovered and fixed a precision measurement bias in our structured text parser. The original parser inflated Foundation precision to 1.0 by being overly conservative in what it counted as a Foundation “finding” — essentially only counting findings that exactly matched expected items. After fixing the parser to more faithfully extract Foundation findings from free-text output, Foundation precision dropped to realistic levels (e.g., 0.833 in security, 0.604 in architecture). All results reported here use the corrected parser.

Results — The Surprising Finding

Complete Results Table

Domain	Metric	Foundation	Expert-Heavy	Expert-Light
Security	Precision	0.833	0.665	0.815
	Recall	0.710	0.715	0.601
	F1	0.763	0.686	0.690
	Tokens	3,339	7,752	2,674
Architecture	Precision	0.604	0.524	0.750
	Recall	0.516	0.479	0.629
	F1	0.557	0.499	0.684
	Tokens	4,868	13,914	4,198
Documentation	Precision	0.750	0.089	0.750
	Recall	0.261	0.056	0.366
	F1	0.358	0.068	0.480
	Tokens	17,761	27,067	9,688

False Positive Analysis (Trap Tasks)

Domain	Foundation	Expert-Heavy	Expert-Light
Security (clean code)	6	6	4
Architecture (clean template)	0	10	0
Documentation (complete docs)	0	10	0

Token Efficiency

Domain	Foundation	Expert-Heavy	Expert-Light	Heavy vs Foundation	Light vs Foundation
Security	3,339	7,752	2,674	+132%	−20%
Architecture	4,868	13,914	4,198	+186%	−14%
Documentation	17,761	27,067	9,688	+52%	−45%

Finding 1: Heavy Skills Degraded Performance in Every Domain

This is the headline result. Expert-Heavy agents performed worse than the bare Foundation agent in all three domains:

Security: F1 dropped 10% (0.763 → 0.686)
Architecture: F1 dropped 10% (0.557 → 0.499)
Documentation: F1 dropped 81% (0.358 → 0.068)

Not a single domain benefited from the Heavy skill approach. The agent with the most domain knowledge and the most sophisticated tooling consistently underperformed the agent with no domain customization at all.

Finding 2: Light Skills Improved Performance in Two of Three Domains

Expert-Light agents told a different story:

Architecture: F1 improved 23% (0.557 → 0.684) — Light beat Foundation on all metrics: higher precision (+0.146), higher recall (+0.113), and lower token usage (−14%).
Documentation: F1 improved 34% (0.358 → 0.480) — Light matched Foundation’s precision (0.750) while substantially increasing recall (+0.105), using 45% fewer tokens.
Security: F1 dropped 10% (0.763 → 0.690) — Here, the Foundation baseline was already strong, and adding even lightweight knowledge didn’t help. The Foundation agent’s existing security understanding was sufficient.

Finding 3: The Damage Scales with Domain Complexity

The gap between Heavy and Foundation widened as tasks became more complex:

Simple detection (Security): Heavy F1 penalty was −10%. The structured detection task partially aligned with the checklist approach, limiting the damage.
Knowledge-intensive (Architecture): Heavy F1 penalty was −10%. The six-pillar checklist caused moderate over-reporting while providing some useful structure.
Creative generation (Documentation): Heavy F1 penalty was −81%. The documentation task requires the agent to understand what to cover and how deeply — exactly the kind of judgment that checklists and mandatory reporting tools destroy.

Finding 4: Heavy Skills Create Systematic False Positives

The false positive trap results are particularly telling. On the architecture clean template, Foundation reported 0 false findings and Light reported 0 — both correctly identified that the template followed best practices. Expert-Heavy reported 10 false findings, inventing issues that did not exist. The same pattern appeared in documentation: Foundation and Light reported 0 false findings on the well-documented service, while Heavy reported 10.

In security, all three agents reported some false findings on the clean code (Foundation: 6, Heavy: 6, Light: 4), suggesting that security false-positive avoidance is inherently harder. Even so, Light demonstrated the best calibration with the fewest false positives across all three domains.

Finding 5: Light Skills Are More Token-Efficient Than Foundation

This was unexpected. Not only did Light skills improve performance in most domains — they did so while consuming fewer tokens than the Foundation agent:

Security: 2,674 tokens (Light) vs. 3,339 (Foundation) — 20% reduction
Architecture: 4,198 tokens (Light) vs. 4,868 (Foundation) — 14% reduction
Documentation: 9,688 tokens (Light) vs. 17,761 (Foundation) — 45% reduction

Meanwhile, Heavy skills consumed 2–2.9× more tokens than Foundation across all domains. The token cost of mandatory tool calls and structured reporting is substantial, and it delivers worse results.

The explanation: Light skills help the agent focus. By providing domain knowledge upfront, the agent spends less time exploring and more time analyzing. It knows what matters, so it gets to the point faster.

Analysis — Why Heavy Skills Fail

Three root causes explain why comprehensive, checklist-driven skills consistently degrade agent performance.

1. Checklist-Driven Hallucination

When a Heavy skill provides a 50-item checklist and instructs the agent to systematically evaluate each item, the agent treats this as a “find-something-for-every-item” directive. If the code or template doesn’t have a particular issue, the agent stretches to find one anyway — because the checklist says to look there.

This is visible in the precision numbers. Heavy skills had the lowest precision in every domain (0.665, 0.524, 0.089), meaning a large proportion of their reported findings were incorrect. The agent was reporting things it thought the checklist wanted it to report, rather than things that were actually present.

In the architecture domain, Heavy precision was 0.524 — nearly half of all reported findings were wrong. In documentation, Heavy precision collapsed to 0.089 — fewer than 1 in 10 reported findings were accurate.

2. Tool Constraint Overhead

Mandatory MCP tool calls are expensive. Each call to report_vulnerability or report_architecture_issue requires the agent to populate structured fields: pillar, severity, resource type, remediation guidance (including CloudFormation snippets). This structured reporting consumes turns and token budget that would otherwise be available for actual analysis.

With a 12-turn limit, an agent that spends 8 turns filling in structured report fields has only 4 turns left for analysis. The Foundation agent, unconstrained by reporting overhead, uses all 12 turns for reading code, reasoning about it, and writing findings.

The token numbers tell this story clearly: Heavy agents consumed 7,752 to 27,067 tokens per domain, while Foundation used 3,339 to 17,761 — and Light achieved better results with 2,674 to 9,688.

3. Judgment Replacement

Perhaps the most important failure mode: Heavy skills replace the agent’s independent judgment with checklist execution. The agent stops thinking about what’s in front of it and starts looking for what the checklist says should be there.

This is the difference between an analyst and an auditor. An analyst examines the evidence and forms conclusions. An auditor works through a predetermined checklist. For tasks that require genuine understanding — “Is this architecture well-designed?” or “Does this documentation adequately serve developers?” — checklist execution is fundamentally the wrong approach.

Light skills avoid this trap. They provide knowledge (“Here are the six Well-Architected pillars and what each one entails”) without prescribing behavior (“Check each of these 30 items and report your findings using the structured tool”). The agent absorbs the knowledge and applies its own judgment about what matters for the specific artifact it’s reviewing.

The “Enhance, Don’t Replace” Principle

Our results converge on a clear principle: the best skills enhance an agent’s existing capabilities without constraining its judgment.

Light skills work because they operate at the knowledge layer, not the behavior layer. They answer the question “What should I know?” rather than “What should I do?” This distinction sounds subtle but drives a large performance difference.

Consider the difference between these two instructions:

Heavy approach: “Systematically evaluate this CloudFormation template against each of the following 30 items. For each item, use the report_architecture_issue tool to file a structured finding with pillar, severity, resource_type, and remediation_cfn fields.”

Light approach: “You are reviewing a CloudFormation template. Be aware of the six Well-Architected pillars: Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, and Sustainability. Focus on issues that represent genuine risk. Only report findings supported by evidence in the template.”

The Heavy instruction creates a compliant executor. The Light instruction creates an informed analyst. The data shows which one produces better results.

This principle — enhance, don’t replace — extends beyond our specific experiments. It reflects a broader insight about working with foundation models: they already have strong reasoning and judgment capabilities. The goal of skill design should be to inform that judgment, not to override it.

Light skills also make agents more efficient. By front-loading domain knowledge, the agent doesn’t need to reason from first principles about what might matter. It already knows the relevant frameworks and can proceed directly to analysis. This explains the counterintuitive token efficiency result: more knowledge in, fewer tokens out.

Implications for Skill Design

Based on our findings, here are practical recommendations for anyone designing skills for AI agents.

Do

Inject domain knowledge and evaluation frameworks. Tell the agent what the relevant standards, best practices, and common patterns are. Give it the mental models an expert would use.
Use guidance language. Phrases like “consider these areas,” “be aware of,” and “pay attention to” provide direction without constraining behavior.
Let the agent decide what to report. The agent should determine which findings are significant based on its analysis of the specific artifact, not based on a checklist requirement.
Emphasize quality over quantity. Explicitly instruct the agent that fewer accurate findings are better than many questionable ones. The Light skills included language like “only report issues supported by evidence.”
Provide context about severity. Help the agent calibrate by explaining what constitutes a high-severity vs. low-severity issue, but let it apply that calibration to the specific situation.

Don’t

Create exhaustive checklists that must be followed. Checklists transform analysis into compliance, triggering hallucinated findings when the checklist doesn’t match reality.
Force all findings through structured reporting tools. Mandatory structured output consumes turns and tokens while constraining the agent’s natural reasoning. The overhead often exceeds the benefit of structured data.
Require the agent to address every item in a list. If you provide a list of 30 things to check, the agent will find 30 things — whether or not 30 things exist.

Consider

Making tools optional. Instead of “You MUST use report_vulnerability for each finding,” try “You MAY use report_vulnerability for structured output when it helps organize complex findings.” This preserves the benefit of structured data without the cost of mandatory compliance.
Layering skills incrementally. Start with knowledge injection alone. Add optional tools only if the knowledge-only approach falls short for specific use cases. Our data suggests that knowledge-only is a strong default.

Limitations and Future Work

Our findings are directional, not definitive. Several limitations apply.

Public domain knowledge bias. Our test domains — security best practices, AWS Well-Architected Framework, and AWS API documentation — are extensively represented in the model’s training data. The Foundation agent’s strong baseline performance likely reflects pre-existing knowledge acquired during training, not general reasoning ability alone. For proprietary or domain-specific knowledge that the model has not encountered during training (e.g., internal API specifications, company-specific coding standards, specialized compliance requirements), skills are expected to provide substantially more value. These results should be interpreted as guidance on skill design patterns, not as evidence that skills are unnecessary.

Other limitations:

Small evaluation dataset. Each domain includes only four tasks (three evaluated plus one trap). While the patterns are consistent across domains, a larger and more diverse task set would strengthen the conclusions.

Single model tested. All experiments used Claude Sonnet on Amazon Bedrock. Different models may respond differently to skill designs — models with weaker baseline capabilities might benefit more from Heavy-style guidance, while stronger models might be even more degraded by it.

Text extraction methodology. Foundation agent findings are extracted from free-text output using keyword and title matching, while Expert-Heavy findings are extracted from structured MCP tool calls. This asymmetry in extraction methods could introduce measurement differences, though the corrected parser mitigates the most significant bias.

No quality scoring. We measure whether the agent found the right issues (precision and recall) but not the quality of its analysis. An LLM-as-judge approach could evaluate the depth and accuracy of individual findings, not just whether they were present.

Future directions include testing a Balanced skill variant (Light knowledge with optional rather than mandatory tools), expanding to more diverse task types, testing across multiple models and model sizes, and implementing quality-aware scoring.

💡 Key Takeaways

“Add skills = better agent” is not universally true. Heavy skills degraded performance in all three domains we tested, by 10–81%.
Skill design matters more than skill quantity. Light skills improved F1 by 23–34% in two of three domains, while Heavy skills hurt F1 in all three — using the same domain knowledge.
Checklists cause hallucination. Mandatory checklists drive agents to report non-existent issues. Heavy agents produced 10 false findings on clean artifacts where Foundation and Light produced zero.
Less tooling can mean better results. Light skills with zero MCP tools outperformed Heavy skills with mandatory structured reporting in every domain.
Knowledge injection improves efficiency. Light skills used 14–45% fewer tokens than Foundation while achieving equal or better results — domain knowledge helps the agent focus.
Enhance, don’t replace. The best skills amplify an agent’s existing judgment by informing it, not by constraining it. Provide knowledge, not checklists. Offer guidance, not directives.

Conclusion

We started this project with a straightforward hypothesis: agents equipped with domain expertise should outperform general-purpose agents. The reality is more nuanced.

Skill design is a design problem, not a knowledge-stuffing problem. The most feature-rich, comprehensive skills we built — complete with structured reporting tools, detailed checklists, and mandatory methodology — produced the worst results. They consumed 2–3× more tokens and delivered lower precision, lower F1 scores, and dramatically more false positives.

The skills that worked were simple. They injected domain knowledge into the system prompt and stepped back. They informed the agent’s judgment without constraining its behavior. They told the agent what an expert would know, not what an expert would do. And they achieved this with fewer tokens than even the unaugmented Foundation agent.

The principle we’ve distilled — Enhance, Don’t Replace — is both a finding and a design guideline. When building skills for AI agents, resist the temptation to overspecify. Foundation models are capable reasoners. Give them the knowledge they need, and let them reason.

Agent Foundry is open source and available for the community to build on. We hope these findings help practitioners design more effective AI agent systems — and encourage the broader community to rigorously evaluate their skill designs rather than assuming that more is better.

📝 Note: This blog post represents my personal views and experiences and does not represent the views of my employer. Any recommendations or findings discussed are based on experiments conducted as part of the Agent Foundry project and my own analysis.

Melanie Li is a Senior Generative AI Specialist Solutions Architect at AWS. The Agent Foundry project is open source and available on GitHub.

💬 Comments

Comments are reviewed before appearing

No comments yet. Be the first to share your thoughts!