Is That Agent Skill Any Good? A Complete Framework for Evaluating AI Agent Skills
- The Trust Problem: Agent Skills Are the New npm Packages
- Four Dimensions of Skill Evaluation
- For Skill Users: Evaluating Third-Party Skills
- For Skill Authors: Proving Your Skill Works
- Continuous Quality with CI/CD
- Real-World Proof: The PR Naming Convention Story
- How We Know It Works: Meta-Evaluation
- The Bigger Picture
- Get Started
You just installed a skill for your AI agent. It fetches weather data, summarizes documents, manages your calendar — exactly what you needed. It works on the first try. Ship it.
But how do you actually know it’s any good?
Not just safe — though that matters. Does it produce correct output? Does it activate when it should and stay quiet when it shouldn’t? Will it still work next week after you update your agent framework? And if you built the skill yourself, how do you prove to your users that it actually does what you claim?
These aren’t hypothetical questions. As agent skill marketplaces grow, they’re the questions every team asks — and most answer with vibes instead of data.
The Trust Problem: Agent Skills Are the New npm Packages
Agent skills — also called plugins, tools, or extensions — are the fastest-growing way to extend AI agents. They let third-party code access your agent’s tools, your files, and your data. Think of them as npm packages for the AI era: incredibly useful, wildly under-evaluated.
The trust problem goes beyond security. When you npm install a package, you’re asking three questions:
- Is it safe? Does it contain malware, leak credentials, or execute arbitrary code?
- Does it work? Does it actually do what the README claims, and do it correctly?
- Is it reliable? Will it keep working as my project evolves?
We’ve built sophisticated answers for npm. We have npm audit for security, test suites for correctness, and CI pipelines for regression. Agent skills have none of this infrastructure — until now.
The gap is especially dangerous because agent skills operate with more authority than a typical npm package. An npm module runs in a sandboxed Node.js process. A malicious or broken agent skill can instruct your agent to read private files, execute shell commands, send data to external servers, and modify system settings — all through the natural language interface the agent was designed to obey.
You need data, not vibes.
Four Dimensions of Skill Evaluation
A complete skill evaluation framework must cover four dimensions. Security alone isn’t enough — a perfectly safe skill that gives wrong answers is still a bad skill. And a skill that works today but breaks silently next month is a ticking time bomb.
1. Safety (Audit)
The first gate: does this skill contain anything dangerous?
Static analysis catches hardcoded secrets, injection surfaces, dangerous shell patterns, supply chain risks, and over-privileged permission requests. Every finding gets a severity level (critical, warning, info) and a concrete fix suggestion.
What the audit detects:
| Category | What It Catches |
|---|---|
| Secrets (SEC-001) | API keys, tokens, passwords, database connection strings |
| Exfiltration (SEC-002) | External URLs that could be data exfiltration channels |
| Shell execution (SEC-003) | subprocess.run(), os.system(), shell=True |
| Supply chain (SEC-004) | curl | bash, unpinned pip install |
| Prompt injection (SEC-005) | User input referenced in executable contexts |
| Deserialization (SEC-006) | pickle.load(), yaml.load() without safe loaders |
| Dynamic imports (SEC-007) | importlib.import_module(), __import__() |
| Obfuscation (SEC-008) | Base64-encoded executable payloads |
| MCP servers (SEC-009) | External Model Context Protocol servers |
| Structure (STR-*) | Malformed frontmatter, missing fields, naming violations |
| Permissions (PERM-*) | Bash(*), sudo access, excessive tool grants |
Scoring starts at 100 and deducts per finding: critical (-25), warning (-10), info (-2). The result maps to a letter grade: A (90+), B (80+), C (70+), D (60+), F (<60).
Scan scope matters. By default, the audit scans only the files an agent actually reads and executes: SKILL.md, scripts/, and agents/. This matches the agentskills.io definition of skill content and avoids false positives from test fixtures or documentation that describes security patterns without being vulnerable.
Use --include-all to scan the entire directory tree — useful for full repository security reviews, but expect findings from test data and examples. skill-eval itself demonstrates this perfectly: the default audit scores 96/A (Grade A), while --include-all drops to 0/F because our test fixtures intentionally contain every security anti-pattern we detect. That’s by design — you need bad examples to test a security scanner.
2. Quality (Functional Evaluation)
The second gate: does this skill actually make your agent better?
This is the dimension most people skip — and arguably the most important. A skill that passes every security check but produces wrong output is worse than useless; it gives you false confidence.
Functional evaluation runs test cases with the skill installed and without it, then grades the difference. Each eval case includes a prompt, expected behavior, and assertions. The grading covers four sub-dimensions:
- Outcome: Did the agent produce the correct result?
- Process: Did it follow a reasonable approach?
- Style: Is the output well-formatted and usable?
- Efficiency: Did it avoid unnecessary steps?
The with-vs-without comparison is critical. It answers the fundamental question: does installing this skill actually improve agent behavior? If the agent performs just as well without the skill, the skill isn’t adding value.
3. Reliability (Trigger Evaluation)
The third gate: does this skill activate at the right times?
A skill that triggers on every prompt is noisy. A skill that never triggers is useless. Trigger evaluation tests activation precision across two sets of queries:
- Positive queries: prompts that should activate the skill (e.g., “What’s the weather in Seattle?” for a weather skill)
- Negative queries: prompts that should not activate it (e.g., “Write me a poem” should not trigger a weather skill)
The pass rate measures how often the skill correctly activates (or correctly stays silent). Low precision means your skill is hijacking unrelated conversations. Low recall means it’s missing the prompts it was built for.
4. Regression & Lifecycle (Snapshot, Diff & Version Tracking)
The fourth gate: does this skill stay good over time?
Skills change. Dependencies update. Agent frameworks evolve. What worked last sprint might not work this sprint. Regression testing captures a baseline snapshot of your skill’s evaluation results and compares future runs against it.
A regression is detected when:
- New critical findings appear that weren’t in the baseline
- The audit score drops by more than 5 points
Beyond snapshots, lifecycle management tracks skill versions through SHA-256 fingerprinting. Every file in the skill directory gets hashed, and changes are detected at file-level granularity — added, modified, or deleted. This answers the question: “what changed since the last time I evaluated this skill?”
# Save a version checkpoint
skill-eval lifecycle ./my-skill --save --label v1.0
# Later: detect what changed
skill-eval lifecycle ./my-skill
# → Changes detected: SKILL.md modified, scripts/helper.py added
# Auto-trigger regression when changes are detected
skill-eval lifecycle ./my-skill --auto-regression
This turns evaluation from a one-time check into a continuous quality gate — with full version history so you can trace exactly when and what changed.
The Unified Grade
Three dimensions combine into a single weighted score. The fourth — regression and lifecycle — acts as an independent pass/fail gate that can block deployment regardless of score:
| Dimension | Weight | What It Measures |
|---|---|---|
| Audit (Safety) | 40% | Security, structure, permissions |
| Functional (Quality) | 40% | Correctness with-skill vs without-skill |
| Trigger (Reliability) | 20% | Activation precision and recall |
The result is a 0–100 score with an A–F letter grade. If a phase is skipped (e.g., no eval cases defined), its weight redistributes to the remaining phases. Regression and lifecycle checks are tracked separately as pass/fail gates — they don’t affect the weighted score but can block deployment independently.
Cost Efficiency: The Hidden Fifth Dimension
There’s a dimension that doesn’t appear in the weighted score but matters enormously in practice: cost efficiency. A skill that improves quality by 5% but triples token usage may not be worth deploying. Conversely, a skill that maintains quality while reducing token consumption is a clear operational win.
When running functional evaluation (with-skill vs without-skill), track both the quality delta and the token delta. The combination reveals the true value of a skill:
| Classification | Quality | Cost | Verdict |
|---|---|---|---|
| 🟢 Pareto Better | Same or better | Same or lower | Install — pure upside |
| 🟡 Tradeoff | Better | Higher | Evaluate ROI — is the quality gain worth the cost? |
| 🟠 Cheaper but Weaker | Worse | Lower | Not recommended unless cost is the primary constraint |
| 🔴 Pareto Worse | Same or worse | Same or higher | Do not install — no upside |
Quality remains the gate: if a skill degrades output quality beyond a threshold, it should be flagged regardless of cost savings. But when quality is maintained, cost efficiency becomes the tiebreaker. The best skills improve both — they make agents smarter and more efficient, because well-designed knowledge injection helps the agent reason more directly instead of exploring from first principles.
For Skill Users: Evaluating Third-Party Skills
You found a skill on a marketplace. Before you install it, run one command:
skill-eval report /path/to/skill
This produces a unified report across all dimensions:
═══════════════════════════════════════════ Unified Skill Report ═══════════════════════════════════════════ Skill: weather-skill Overall Grade: A (0.92) ─────────────────────────────────────────── Audit: 96/100 (A) ████████████████░░ Functional: 0.91 (A) ████████████████░░ Trigger: 0.88 (B) ███████████████░░░ ─────────────────────────────────────────── Result: PASSED ═══════════════════════════════════════════
Now compare with a suspicious skill:
═══════════════════════════════════════════
Unified Skill Report
═══════════════════════════════════════════
Skill: Bad_Skill
Overall Grade: F (0.12)
───────────────────────────────────────────
Audit: 0/100 (F) ░░░░░░░░░░░░░░░░░░
Functional: 0.35 (F) ██████░░░░░░░░░░░░
Trigger: 0.25 (F) ████░░░░░░░░░░░░░░
───────────────────────────────────────────
Result: FAILED
═══════════════════════════════════════════
🔴 [SEC-001] Secret detected: Generic Password
File: SKILL.md:20
Fix: Remove the secret. Use environment variables
or a secrets manager instead.
🔴 [SEC-004] Unsafe install: curl | sh
File: SKILL.md:29
Fix: Pin dependencies in a requirements file.
Never pipe curl output to shell.
The grade difference tells the story immediately: A vs F. But you can also drill into any dimension.
Reading the Report
- Audit score < 60 (F): Don’t install. Critical security findings present.
- Functional score < 0.5: The skill doesn’t improve agent behavior — or makes it worse.
- Trigger rate < 0.5: The skill fires on the wrong prompts or misses the right ones.
- Overall grade < C: The skill has significant issues in at least one dimension.
Setting Team Policies
For teams managing shared agent configurations, skill-eval’s exit codes enable policy enforcement:
# Exit code 0 = passed, 1 = warnings/regressions, 2 = critical findings
skill-eval audit /path/to/skill --fail-on-warning
# Set a minimum score threshold
skill-eval report /path/to/skill # Check grade in output or JSON
A simple team policy: “No skill with an overall grade below B gets installed in production.” The --format json flag makes scores parseable by scripts and dashboards.
For Skill Authors: Proving Your Skill Works
Building a skill that earns an A requires more than avoiding security anti-patterns. You need to prove it works, prove it triggers correctly, and prove it stays working.
Start with Anthropic’s skill-creator
Anthropic’s skill-creator provides a structured workflow for building skills. It includes its own evaluation tooling — run_eval.py for running evals, aggregate_benchmark.py for scoring, and improve_description.py for iterating on your skill description based on eval results.
skill-eval complements skill-creator. Where skill-creator helps you build and iterate on a skill, skill-eval helps you evaluate and gate it — adding security audit, unified grading, regression testing, and CI integration. The data schemas are compatible: skill-eval reads the same evals.json format that skill-creator produces.
The workflow: create with skill-creator → evaluate with skill-eval → iterate → deploy with CI gates.
Scaffold Your Eval Files
Once your skill exists, generate the evaluation structure:
skill-eval init /path/to/your-skill
This creates two files:
evals/evals.json— functional test cases with prompts, expected outputs, and assertionsevals/eval_queries.json— trigger queries tagged as positive (should activate) or negative (should not)
The generated templates use your skill’s name and description from its frontmatter. They’re starting points — you’ll want to replace them with meaningful test cases.
Write Meaningful Eval Cases
The default scaffolds are placeholders. Good eval cases test the substance of your skill, not just “does it respond.” For a weather skill:
{
"id": "weather-current-city",
"prompt": "What is the current weather in Seattle?",
"expected_output": "Current weather conditions for Seattle",
"assertions": [
{"type": "contains", "value": "Seattle"},
{"type": "contains", "value": "temperature"},
{"type": "regex", "value": "\\d+°[CF]"},
{"type": "min_lines", "value": 3}
]
}
Assertions can be deterministic (contains, regex, JSON structure checks) or semantic (LLM-graded for subjective quality). See examples/golden-evals/ in the repo for templates and patterns.
Run Functional Evaluation
skill-eval functional /path/to/your-skill
This runs each eval case twice — once with the skill installed, once without — and grades both. The comparison reveals whether your skill is actually adding value. If the without-skill baseline scores nearly as high, your skill isn’t contributing enough to justify installation.
Run Trigger Evaluation
skill-eval trigger /path/to/your-skill
This tests each query in eval_queries.json and checks whether the skill activates correctly. The pass rate tells you how precise your skill’s trigger conditions are.
Common issues trigger evaluation catches:
- Over-broad triggers: skill activates on generic prompts like “help me with something”
- Under-specific triggers: skill misses prompts that clearly fall within its domain
- Keyword collision: skill triggers on prompts that contain related words but aren’t actually relevant
The Create → Evaluate → Improve Loop
The development workflow looks like this:
- Build the skill with skill-creator
- Scaffold evals:
skill-eval init your-skill - Write meaningful eval cases and trigger queries
- Audit:
skill-eval audit your-skill— fix all critical findings - Test:
skill-eval report your-skill— verify functional quality and trigger reliability - Snapshot:
skill-eval snapshot your-skill— save your audit baseline - Track:
skill-eval lifecycle your-skill --save --label v1.0— record version fingerprint - Iterate: use skill-creator’s
improve_description.pyto refine, re-runskill-eval reportto verify - Ship: add CI workflow and merge with confidence
Each iteration should improve your unified grade. The snapshot baseline ensures you never silently regress.
Continuous Quality with CI/CD
Evaluation isn’t a one-time activity. Skills change, dependencies update, and agent frameworks evolve. CI integration turns skill evaluation into a continuous quality gate.
Regression & Lifecycle Gate
After establishing a baseline with skill-eval snapshot and a version checkpoint with skill-eval lifecycle --save, every subsequent change can be checked:
# Detect file-level changes since last version
skill-eval lifecycle /path/to/skill
# Run regression against audit baseline
skill-eval regression /path/to/skill
══════════════════════════════════════════════════════════
Regression Check Report
══════════════════════════════════════════════════════════
Baseline: v1.2.0 (96/A)
Current: 88/B
Delta: -8 points
──────────────────────────────────────────────────────────
Result: ❌ FAILED — Regression detected: 1 new critical
findings, score 96 → 88
══════════════════════════════════════════════════════════
🔴 New findings (1):
[CRITICAL] SEC-003: Shell execution via subprocess
File: SKILL.md:45
✅ Resolved findings (0):
Summary: 1 new | 0 resolved | 4 unchanged
New critical findings or significant score drops fail the gate. The 5-point tolerance for score drops avoids false alarms from minor info-level changes.
GitHub Actions Integration
Add skill evaluation as a reusable workflow:
# .github/workflows/skill-eval.yml
name: Skill Evaluation
on:
push:
paths: ['skills/**']
pull_request:
paths: ['skills/**']
jobs:
evaluate:
uses: aws-samples/sample-agent-skill-eval/.github/workflows/skill-eval.yml@main
with:
skill_path: "path/to/your-skill"
run_functional: true
run_trigger: true
The reusable workflow outputs passed, grade, and score, which you can use in downstream jobs. Exit codes make integration straightforward: 0 means passed, 1 means warnings or regressions, 2 means critical findings.
Team Governance
Combine the tools into an automated quality gate:
# In your PR workflow
- name: Evaluate skill
run: skill-eval report skills/my-skill --format json > eval-result.json
- name: Check regression
run: skill-eval regression skills/my-skill
- name: Enforce minimum grade
run: |
grade=$(jq -r '.overall_grade' eval-result.json)
if [[ "$grade" == "F" || "$grade" == "D" ]]; then
echo "Skill grade $grade is below minimum. Blocking merge."
exit 1
fi
This gives teams a repeatable, automated standard: no skill merges to production without passing evaluation.
Real-World Proof: The PR Naming Convention Story
Theory is nice. Does it actually work? We tested skill-eval end-to-end using a realistic scenario: building a company-internal PR naming convention validator.
Every company has its own PR rules — [TEAM-42] feat: add auth or platform/TEAM-42-add-auth. These are exactly the kind of domain-specific knowledge that AI models don’t know, making them an ideal test case.
We built three versions:
| Version | How It Was Built | Score | Key Issues |
|---|---|---|---|
| v1 | Developer wrote it themselves | 39/F | Hardcoded token, eval(), shell=True |
| v2 | Rebuilt using Anthropic Skill Creator | 98/A | Pure regex, clean structure |
| v3 | Someone added a feature | 61/D | pickle.load + shell=True regression |
The most compelling result was functional evaluation. Without the skill, the agent applied generic conventions and approved titles that violated company rules. With the skill, it correctly enforced the format — +17% to +33% functional improvement.
The v2→v3 transition shows regression detection in action: security degraded while features appeared to improve. Without automated evaluation, this would slip through code review.
Full demo: examples/lifecycle-demo/ in the repository.
How We Know It Works: Meta-Evaluation
We validated skill-eval against ground truth across all dimensions.
| Dimension | Accuracy | Notes |
|---|---|---|
| Audit accuracy | 100% | Deterministic (regex, AST, YAML). 582 unit tests. |
| Functional grading | 100% / ~90% | 100% for deterministic assertions. ~90% for LLM-judged. |
| Trigger specificity | 100% | Zero false positives. |
| Trigger recall | 25–100% | Varies. Known framework limitation for CLI-based skills. |
Full methodology in examples/self-eval/.
The Bigger Picture
This is eval-driven development for agents.
The concept isn’t new. Test-driven development transformed how we write code: define expected behavior first, then implement until the tests pass. Eval-driven development applies the same discipline to agent capabilities: define what “good” looks like across safety, quality, reliability, and regression — then build until you meet that bar.
The tooling ecosystem is maturing. Anthropic’s skill-creator helps you build skills with structure and best practices. skill-eval helps you evaluate and gate them with data. Together they form a complete lifecycle:
skill-creator (build) → skill-eval (evaluate) → iterate → deploy with CI → lifecycle (monitor)
The agent ecosystem is where the web ecosystem was fifteen years ago: moving fast, building trust gradually, learning from incidents. The difference is we don’t have to repeat the same mistakes. We can build evaluation infrastructure now, before the first major supply chain incident in an agent marketplace.
Four principles guide this approach:
Get Started
- Repository: github.com/aws-samples/sample-agent-skill-eval
- Tutorial: See
docs/tutorial.mdfor guided walkthroughs (consumer and author paths) - Core Concepts: See
docs/concepts.mdfor architecture, scoring, and rule details - Golden Evals: See
examples/golden-evals/for eval case templates - Lifecycle Demo: See
examples/lifecycle-demo/for the PR naming convention end-to-end test - Meta-Evaluation: See
examples/self-eval/for validation methodology - Anthropic skill-creator: github.com/anthropics/skills/tree/main/skills/skill-creator
- Agent Skills Spec: agentskills.io
Install it. Evaluate your skills. Get data, not vibes.
pip install -e .
skill-eval report /path/to/skill
Related Posts
- Agent Skills: The Quiet Revolution
- Deep Dive: Agent Skills & Skill Evaluation
- Enhance, Don’t Replace: Building Domain Expert Agents
- MCP vs Skills