Deep Dive: Agent Skills & Skill Evaluation — How to Build, Measure, and Trust What Your AI Agent Can Do
⚠️ Disclaimer: The code examples, architecture patterns, and tool references in this article are illustrative and intended for educational purposes only. Always review, test, and adapt them to your specific use case before deploying to production. The agent skill ecosystem is evolving rapidly — verify current documentation for the latest specifications.
- What Are Agent Skills?
- The Eval Problem: Why Testing Agent Skills Is Hard
- Anthropic’s Approach: Eval-Driven Skill Development
- OpenAI’s Approach: Systematic Testing with Codex Evals
- The Convergence: What Both Got Right
- Practical Guide: Evaluating Your Own Skills
- Known Limitations of Skill Evaluation
- Beyond Coding: Why Skill Eval Is a Governance Question
- What’s Next
1. What Are Agent Skills?
An agent skill is a packaged set of instructions that teaches an AI agent how to perform a specific task reliably. At its simplest, a skill is a directory containing a SKILL.md file with YAML front matter (name, description) and Markdown instructions:
skill-name/
โโโ SKILL.md # Required: instructions + metadata
โโโ scripts/ # Optional: executable code for deterministic tasks
โโโ references/ # Optional: docs loaded into context as needed
โโโ assets/ # Optional: templates, icons, fonts
The SKILL.md front matter is critical โ it’s how the agent decides whether to invoke the skill:
---
name: setup-demo-app
description: Scaffold a Vite + React + Tailwind demo app with a small,
consistent project structure. Use when the user needs a fresh demo app
for quick UI experiments or reproductions.
---
Skills use a three-level progressive disclosure system:
- Metadata (name + description) โ Always in context. This is what triggers skill invocation.
- SKILL.md body โ Loaded when the skill triggers. Ideally under 500 lines.
- Bundled resources โ Loaded on demand. Scripts can execute without being loaded into context.
What makes skills powerful is that they’re portable. The same SKILL.md format works across Claude Code, Codex, OpenClaw, Cursor, and Gemini CLI. Write once, use everywhere โ at least in theory.
But portability without quality assurance is just portability of problems. Which brings us to evaluation.
2. The Eval Problem: Why Testing Agent Skills Is Hard
Traditional software testing is built on a simple assumption: given the same input, you get the same output. Agent skills violate this assumption fundamentally.
The same skill, same prompt, same model can produce different outputs each time. An agent might:
- Choose different file structures while meeting the same requirements
- Run commands in a different order
- Use slightly different variable names or code patterns
- Take more or fewer intermediate steps
This non-determinism means you can’t just write unit tests with exact expected outputs. But you also can’t rely on “vibes” โ the subjective feeling that a skill is working.
The challenge breaks down into three types of testing that skills require:
Trigger testing. Does the skill activate when it should? Does it stay quiet when it shouldn’t? A skill for “scaffolding a new React app” shouldn’t trigger when the user asks to “add Tailwind to an existing app.” Getting trigger behavior right requires testing both positive and negative cases.
Execution testing. When the skill runs, does it follow the intended process? Did it run npm install? Did it create the expected file structure? These are closer to traditional integration tests, but against a non-deterministic actor.
Quality testing. Is the output good? Does the code follow conventions? Is the architecture sound? This is inherently subjective and can’t be captured by simple assertions.
Both Anthropic and OpenAI had to solve all three. Here’s how they approached it.
3. Anthropic’s Approach: Eval-Driven Skill Development
Anthropic’s approach is integrated into their skill-creator โ the meta-skill that helps you build other skills. The key design decision: evals are part of the default skill creation workflow, not an afterthought.
The process follows a structured loop:
Step 1: Capture Intent โ Write Skill โ Define Evals
Before writing any evaluation code, you define what success looks like. The skill-creator walks through:
- What should this skill enable the agent to do?
- When should this skill trigger?
- What’s the expected output format?
- What does “correct” look like in terms you can measure?
Test cases are saved to evals/evals.json:
{
"skill_name": "setup-demo-app",
"evals": [
{
"id": 1,
"prompt": "Create a demo app using the setup-demo-app skill",
"expected_output": "Vite + React + Tailwind project with Header and Card components",
"assertions": []
}
]
}
Step 2: Run With-Skill vs. Without-Skill Baselines
This is where Anthropic’s approach gets interesting. For every test case, the system spawns two runs simultaneously:
- With-skill run: The agent executes the task with the skill loaded
- Baseline run: The same prompt, same agent, no skill
This comparison answers the fundamental question: does the skill actually help, or would the agent have done fine without it?
Both runs save outputs to a structured workspace:
skill-workspace/
โโโ iteration-1/
โ โโโ eval-01-scaffold-basic/
โ โ โโโ with_skill/
โ โ โ โโโ outputs/
โ โ โ โโโ eval_metadata.json
โ โ โ โโโ timing.json
โ โ โโโ without_skill/
โ โ โโโ outputs/
โ โ โโโ timing.json
โ โโโ eval-02-implicit-trigger/
โ โโโ ...
โโโ iteration-2/
โโโ ...
Step 3: Grade + Aggregate + Analyze (Deterministic Layer)
Once runs complete, a grading step evaluates each assertion against the outputs and saves results to grading.json. Deterministic checks form the first evaluation layer โ fast, reliable, debuggable assertions that answer “did it do the basics?”
# NOTE: This code is an illustrative synthesis of patterns from both
# Anthropic and OpenAI's frameworks, not a direct copy from either.
# Adapt to your specific framework's API.
# Example: Deterministic assertion checking
def check_file_exists(workspace, path):
"""Check if an expected output file was created."""
return os.path.exists(os.path.join(workspace, path))
def check_command_executed(trace_events, command_substring):
"""Check if a specific command was run during execution."""
return any(
command_substring in event.get("command", "")
for event in trace_events
if event.get("type") == "command_execution"
)
# Run checks
results = {
"package_json_exists": check_file_exists(output_dir, "package.json"),
"ran_npm_install": check_command_executed(trace, "npm install"),
"has_tailwind_config": check_file_exists(output_dir, "tailwind.config"),
"components_created": all(
check_file_exists(output_dir, f"src/components/{c}.tsx")
for c in ["Header", "Card"]
),
}
When these fail, you know exactly what went wrong. No interpretation needed.
An aggregation script then produces benchmark data with pass rates, timing, and token usage โ with mean ยฑ standard deviation and deltas between with-skill and without-skill runs.
The system even includes an analyst pass that looks for patterns the aggregate stats might hide:
- Assertions that always pass regardless of skill (non-discriminating โ your test isn’t testing anything)
- High-variance evals (possibly flaky)
- Time/token tradeoffs (does the skill save time but use more tokens, or vice versa?)
Step 4: Review with Eval Viewer (Qualitative Layer)
Anthropic provides eval-viewer/generate_review.py โ a self-contained tool that reads the workspace, embeds all output data into a standalone HTML page, and serves it locally. Reviewers can examine outputs side-by-side, add qualitative feedback, and export their assessments.
This is the second evaluation layer: qualitative, human-in-the-loop assessment. Automated checks tell you if the scaffolding worked; the review viewer lets you assess if the code is actually good. The two layers together โ deterministic checks for speed and reliability, human review for depth and nuance โ give you a complete picture.
Step 5: Iterate
Based on quantitative benchmarks and qualitative feedback, revise the skill and run again. Expand the test set over time. The workspace maintains iteration history so you can track improvement across versions.
4. OpenAI’s Approach: Systematic Testing with Codex Evals
OpenAI’s approach, detailed in their “Testing Agent Skills Systematically with Evals” guide, is more modular โ it provides a pattern you implement rather than an integrated workflow.
Their core insight is articulated cleanly:
An eval is: a prompt โ a captured run (trace + artifacts) โ a small set of checks โ a score you can compare over time.
Define Success Across Four Dimensions
OpenAI recommends splitting checks into four categories before writing the skill. Synthesizing their framework with Anthropic’s approach, these four dimensions cover the full scope of skill evaluation:
Dimension 1: Outcome โ Did the task complete correctly? Check if the expected files exist, if the code compiles, if the tests pass. Outcome checks are almost always deterministic: Does package.json exist? Does npm run dev start successfully? Do the expected components exist in the right directories?
Dimension 2: Process โ Did the agent follow the intended workflow? Did it invoke the right skill? Follow the steps in the right order? Avoid unnecessary operations? Process evaluation requires trace data โ OpenAI’s --json flag and Anthropic’s transcript capture both serve this purpose.
Dimension 3: Style โ Does the output meet qualitative standards? Code style, naming conventions, documentation quality โ these can’t be captured by boolean assertions. The key insight: structure the judge’s output. Don’t ask “is this code good?” Ask for specific criteria with pass/fail and evidence strings. Structured rubrics are reproducible; subjective impressions are not.
Dimension 4: Efficiency โ Did the agent get there without waste? Token usage, execution time, number of retries. Anthropic captures total_tokens and duration_ms per run. OpenAI’s traces contain command counts and retry patterns. Comparing efficiency with-skill vs. without-skill reveals whether the skill saves resources or just shifts them.
Small, Targeted Prompt Sets
Rather than building large benchmarks, OpenAI recommends starting with 10-20 prompts that cover distinct scenarios:
id,should_trigger,prompt
test-01,true,"Create a demo app named devday-demo using the $setup-demo-app skill"
test-02,true,"Set up a minimal React demo app with Tailwind for quick UI experiments"
test-03,true,"Create a small demo app to showcase the Responses API"
test-04,false,"Add Tailwind styling to my existing React app"
Each case tests something different:
- Explicit invocation (test-01): Direct skill reference. Does it work when asked by name?
- Implicit invocation (test-02): Describes the task without naming the skill. Is the description good enough?
- Contextual invocation (test-03): Adds domain noise. Does the skill still trigger in realistic prompts?
- Negative control (test-04): Should NOT trigger. Catches false positives where the skill activates too eagerly.
This prompt taxonomy is one of OpenAI’s strongest contributions. Most developers only test the happy path. The negative control is what catches the subtle bugs โ a skill that fires when it shouldn’t can be worse than a skill that doesn’t fire when it should.
Deterministic Graders with codex exec --json
OpenAI leverages codex exec --json to produce structured JSONL traces of every agent action. This enables deterministic, code-based checks:
// Did the agent run npm install?
function checkRanNpmInstall(events) {
return events.some(
(e) =>
(e.type === "item.started" || e.type === "item.completed") &&
e.item?.type === "command_execution" &&
typeof e.item?.command === "string" &&
e.item.command.includes("npm install")
);
}
// Did package.json get created?
function checkPackageJsonExists(projectDir) {
return existsSync(path.join(projectDir, "package.json"));
}
The value is debuggability. When a check fails, you open the JSONL trace and see exactly what happened โ every command, in order. No guesswork.
LLM-as-Judge for Qualitative Checks
For style, conventions, and architecture quality, OpenAI adds a second evaluation layer using the model itself as a judge. Like Anthropic’s eval viewer, this provides the qualitative depth that deterministic checks can’t capture โ but automates it rather than relying on human review. The key is structuring the output:
codex exec --full-auto --output-format json \
"Evaluate the demo-app repository against these requirements:
- Vite + React + TypeScript project exists
- Tailwind configured via Vite plugin
- Functional components only (no class components)
- Consistent file naming
Respond in JSON with: criterion, pass (bool), evidence (string)"
This produces a structured rubric score that your harness can aggregate across runs:
# NOTE: This code is an illustrative synthesis combining patterns from
# both frameworks into a single example. It is not taken directly from
# either Anthropic's or OpenAI's codebase. Adapt to your needs.
rubric = {
"criteria": [
{
"name": "typescript_usage",
"description": "All components use TypeScript with proper type annotations",
"weight": 1.0
},
{
"name": "functional_components",
"description": "No class components โ all components are functional with hooks",
"weight": 1.0
},
{
"name": "tailwind_only",
"description": "Styling uses only Tailwind utility classes, no CSS modules or inline styles",
"weight": 1.0
},
{
"name": "code_organization",
"description": "Clean file structure with logical component separation",
"weight": 0.5
}
]
}
# The judge model evaluates each criterion and returns:
# { "criterion": "typescript_usage", "pass": true, "evidence": "All .tsx files use proper..." }
Both frameworks converge on this two-layer evaluation architecture: deterministic checks for speed and reliability (Layer 1), and qualitative assessment โ whether via human review or LLM-as-judge โ for depth and nuance (Layer 2). Neither layer alone is sufficient. Deterministic checks catch regressions immediately but miss subtle quality issues. Qualitative grading catches those issues but is slower and less reproducible.
5. The Convergence: What Both Got Right
Despite developing independently, Anthropic and OpenAI converged on the same fundamental pattern:
| Aspect | Anthropic (skill-creator) | OpenAI (Codex evals) |
|---|---|---|
| Core loop | prompt โ run โ grade โ iterate | prompt โ trace โ checks โ score |
| Success definition | Upfront, before writing the skill | Upfront, four-category framework |
| Test cases | evals.json with assertions | CSV with should_trigger flag |
| Deterministic checks | Grading scripts + grading.json | Node.js graders over JSONL traces |
| Qualitative eval | eval-viewer with human review | LLM-as-judge with structured rubrics |
| Baselines | With-skill vs. without-skill | Implicit (not built into the framework) |
| Iteration tracking | Workspace with iteration directories | Manual (up to the developer) |
| Trigger testing | Implicit through test prompts | Explicit with negative controls |
The convergence on these principles is more significant than any individual implementation choice:
- Define success before writing the skill. Both require upfront specification.
- Test with real prompts, not unit tests. The input is natural language, not function arguments.
- Layer deterministic and qualitative evaluation. Neither alone is sufficient.
- Make results comparable over time. Scores must be trackable across iterations.
- Treat skills as software with a development lifecycle. Draft โ test โ iterate โ expand test suite.
6. Practical Guide: Evaluating Your Own Skills
Here’s a concrete workflow for adding evaluation to any agent skill, synthesized from both frameworks:
Step 1: Define Success Criteria
Before touching the skill, write down what “correct” means:
## Success Criteria for my-skill
### Must Pass (Deterministic)
- [ ] Creates expected output files
- [ ] Runs without errors
- [ ] Follows specified file structure
### Should Pass (Qualitative)
- [ ] Code follows project conventions
- [ ] Documentation is clear
- [ ] No unnecessary dependencies added
### Efficiency Targets
- [ ] Completes in < 60 seconds
- [ ] Uses < 50K tokens
Step 2: Write 10 Test Prompts
Cover the four trigger categories:
| Category | Count | Purpose |
|---|---|---|
| Explicit invocation | 2-3 | Skill works when directly referenced |
| Implicit invocation | 3-4 | Skill triggers from natural descriptions |
| Contextual invocation | 2-3 | Skill works with domain-specific noise |
| Negative control | 2-3 | Skill does NOT trigger for adjacent requests |
Step 3: Run and Capture Traces
Use whatever trace format your agent supports:
- Claude Code: Transcript capture in workspace directories
- Codex:
codex exec --jsonfor JSONL traces - Other agents: Structured logging of actions and outputs
Step 4: Grade
Run your deterministic checks first. Then apply LLM-as-judge rubrics for qualitative assessment. Save everything to a structured format for comparison.
Step 5: Compare Against Baselines
Run the same prompts without the skill. Compare:
- Pass rates (with-skill vs. without-skill)
- Execution time
- Token usage
- Output quality scores
If the with-skill run doesn’t measurably outperform the baseline, your skill needs work โ or might not be needed at all.
Step 6: Iterate and Expand
Fix issues, rerun, compare against your previous iteration. Once stable, expand the test set with edge cases you discovered during development.
7. Known Limitations of Skill Evaluation
Both frameworks represent significant progress, but skill evaluation is still a young discipline with real limitations practitioners should understand:
LLM-as-judge reliability. When you use a model to grade another model’s output, you inherit the judge’s biases. LLM judges can be inconsistent across runs, biased toward verbose outputs, or unable to detect subtle domain-specific errors. In high-stakes domains, LLM-as-judge should complement โ not replace โ human review.
Cost of dual runs. Anthropic’s with-skill vs. without-skill baseline comparison is powerful but expensive. Every eval effectively doubles your inference cost โ two full agent runs per test prompt. For skill developers iterating frequently on a suite of 10-20 prompts, this adds up fast. Teams need to budget for eval costs the same way they budget for CI compute.
Reproducibility challenges with non-deterministic outputs. Even with the same model, temperature, and prompt, agent outputs vary between runs. A skill might pass 8/10 evals one run and 6/10 the next โ not because anything changed, but because of inherent stochasticity. This makes it hard to distinguish genuine regressions from noise. Running multiple trials per prompt helps but multiplies cost further.
Environment-dependent flakiness. Skills that invoke external tools (npm install, git clone, API calls) depend on local environment state. An eval that passes on your machine may fail in CI or on a teammate’s laptop. As one practitioner noted: “local eval runs can be fragile and memory-heavy; reproducibility gets messy when evals depend on local env state.”
Assertion design is its own skill. Writing good eval assertions is harder than it looks. Assertions that are too strict break on valid output variations. Assertions that are too loose pass on garbage. Finding the right level of specificity โ checking for the right files without mandating exact content โ requires iteration and judgment.
These limitations don’t invalidate skill evaluation โ they make it more important to approach with realistic expectations. Start with deterministic checks where you can, use LLM-as-judge for qualitative dimensions you can’t automate, and run enough trials to distinguish signal from noise.
8. Beyond Coding: Why Skill Eval Is a Governance Question
Agent skills are rapidly expanding beyond coding agents. Skills now exist for:
- Financial operations โ processing invoices, reconciling accounts
- Content creation โ writing in specific brand voices, formatting for platforms
- Infrastructure management โ deploying services, managing configurations
- Customer support โ handling specific ticket types with defined workflows
- Data analysis โ running standardized reports with specific methodologies
As agents gain access to real-world actions โ making payments (as we saw with the DBS-Visa agentic commerce pilot), booking services, modifying infrastructure โ the question “does this skill actually work?” stops being a developer convenience question and becomes a governance question.
A concrete example: Imagine an agent skill for processing expense reports. An employee submits a $100,000 charge with minimal documentation. The skill is supposed to flag expenses over $10,000 for manager approval โ but due to a prompt edge case, it approves the charge automatically. Without eval, you don’t catch this until an audit months later, after the money has moved. With eval, you catch it in the dev loop: a test prompt with an oversized expense amount, a deterministic assertion that the output includes a flag for manual review, and a negative control confirming routine expenses flow through normally. The cost of building that eval is trivial compared to the cost of a six-figure misapproval.
Consider the parallel to traditional software:
- In the 2000s, open-source packages had no standard testing. npm packages shipped without test suites. The ecosystem matured when testing became expected.
- In the 2020s, API security and compliance testing became mandatory for production systems. OWASP, SOC2, and ISO 27001 drove standardization.
- In 2026, agent skills are at the same inflection point. The ecosystem is exploding, quality is variable, and systematic evaluation is just now emerging.
A Cloud Security Alliance survey from early 2026 found that only 18% of security leaders are confident their identity and access management systems can handle AI agents, and 80% of organizations deploying autonomous AI can’t trace what their agents are doing in real time. You can’t govern what you can’t measure โ and skill evaluation is how you start measuring.
As one practitioner from the tessl.io community put it:
“The first version of a skill often feels helpful but isn’t measurable. Evals force you to define scenarios + assertions, which is how you discover whether your skill actually changes outcomes or just adds tokens.”
This captures the governance problem precisely: without evaluation, you’re deploying capabilities you can’t verify into workflows with real consequences.
The companies and teams that build eval into their skill development workflow now will have an asymmetric advantage. Not because their agents are smarter โ but because they can prove they work correctly, consistently, and within defined boundaries.
9. What’s Next
The skill ecosystem went from zero to open standard in 90 days. The evaluation ecosystem is roughly 90 days behind. Here’s what I expect to see in the next 6-12 months:
Standardized eval formats. Just as SKILL.md became a cross-platform standard, expect evals.json or a similar format to standardize how skill tests are defined and shared. SkillsBench โ the first benchmark specifically designed to evaluate how agents use skills โ is an early indicator of this trend.
CI/CD integration. Skill evals will become part of continuous integration pipelines. When you update a skill, automated tests verify it hasn’t regressed โ exactly like how software tests run on every PR today.
Skill marketplaces with quality scores. Community skill repositories (like the 100+ on awesome-agent-skills) will start showing evaluation metrics โ pass rates, efficiency scores, platform compatibility results. Skills without evals will be treated like npm packages without tests: usable but risky.
Cross-model evaluation. As the same skill runs across Claude, GPT, Gemini, and open-source models, evaluation frameworks will need to handle model-specific behaviors. A skill that works perfectly with Claude Opus might fail with GPT-4o โ and evaluation needs to surface this.
Governance frameworks. For enterprise adoption, skill evaluation will integrate with broader AI governance โ audit trails, compliance checks, access control policies. The trust layer for agentic commerce that Visa (Trusted Agent Protocol) and Mastercard (Verifiable Intent) are building will eventually need to verify not just agent identity, but agent capability.
Key Takeaways
- Agent skills are software. Treat them accordingly. They need tests, version control, and quality assurance โ not just vibes.
- Define success before you write the skill. Both Anthropic and OpenAI converge on this: upfront specification prevents ambiguity.
- Use two-layer evaluation. Deterministic checks for speed and reliability. LLM-as-judge rubrics for depth and nuance. Neither alone is sufficient.
- Test triggers, not just outputs. A skill that fires at the wrong time can be worse than a skill that doesn’t fire at all. Include negative test cases.
- Measure against baselines. The only way to know if a skill helps is to compare with-skill vs. without-skill performance on the same prompts.
- Start small, expand over time. 10-20 test prompts is enough to catch regressions. Grow the suite as you discover edge cases.
- Know the limits. LLM-as-judge is imperfect, dual runs are expensive, and non-determinism makes reproducibility hard. Plan for these realities.
- The eval gap is a governance gap. As agents take real-world actions, skill evaluation becomes a requirement, not an optimization.
References
- Anthropic skill-creator with eval framework: github.com/anthropics/skills/skills/skill-creator
- OpenAI, “Testing Agent Skills Systematically with Evals”: developers.openai.com/blog/eval-skills
- Agent Skills open standard history: laurentkempe.com โ Agent Skills: From Claude to Open Standard
- OpenAI’s quiet adoption of skills (Dec 2025): simonwillison.net
- Awesome Agent Skills (100+ curated): github.com/VoltAgent/awesome-agent-skills
- DBS Bank โ Visa Intelligent Commerce pilot: dbs.com/newsroom
- Cloud Security Alliance โ Agent Identity Governance survey: strata.io/blog
๐ Note: The views and opinions expressed in this article are my own and do not represent those of my employer. This article is written for educational purposes and reflects my personal exploration of the AI agent ecosystem.
Related Posts
- A Complete Framework for Evaluating AI Agent Skills
- Agent Skills: The Quiet Revolution
- Enhance, Donโt Replace: Building Domain Expert Agents
- With Agent Skills, Do We Still Need MCP?
Leave a Comment