⚠️ Disclaimer: The evaluation framework and tool discussed in this post are provided for educational purposes. The security rules and best practices reflect general agent skill safety considerations and may not cover all possible threat vectors. Always conduct your own security review before deploying agent skills in production environments.

Is That Agent Skill Any Good? A Complete Framework for Evaluating AI Agent Skills

📅 2026-03-15📖 ~12 min readAI AgentsEvaluationClaude CodeDevOps

You just installed a skill for your AI agent. It works on the first try. Ship it. But how do you actually know it’s any good? This post introduces a framework for evaluating agent skills across four dimensions — safety, quality, reliability, and regression — and a tool called skill-eval that measures all four with a single command.

📖 Table of Contents

The Trust Problem: Agent Skills Are the New npm Packages
Four Dimensions of Skill Evaluation
For Skill Users: Evaluating Third-Party Skills
For Skill Authors: Proving Your Skill Works
Continuous Quality with CI/CD
Real-World Proof: The PR Naming Convention Story
How We Know It Works: Meta-Evaluation
The Bigger Picture
Get Started

You just installed a skill for your AI agent. It fetches weather data, summarizes documents, manages your calendar — exactly what you needed. It works on the first try. Ship it.

But how do you actually know it’s any good?

Not just safe — though that matters. Does it produce correct output? Does it activate when it should and stay quiet when it shouldn’t? Will it still work next week after you update your agent framework? And if you built the skill yourself, how do you prove to your users that it actually does what you claim?

These aren’t hypothetical questions. As agent skill marketplaces grow, they’re the questions every team asks — and most answer with vibes instead of data.

The Trust Problem: Agent Skills Are the New npm Packages

Agent skills — also called plugins, tools, or extensions — are the fastest-growing way to extend AI agents. They let third-party code access your agent’s tools, your files, and your data. Think of them as npm packages for the AI era: incredibly useful, wildly under-evaluated.

The trust problem goes beyond security. When you npm install a package, you’re asking three questions:

Is it safe? Does it contain malware, leak credentials, or execute arbitrary code?
Does it work? Does it actually do what the README claims, and do it correctly?
Is it reliable? Will it keep working as my project evolves?

We’ve built sophisticated answers for npm. We have npm audit for security, test suites for correctness, and CI pipelines for regression. Agent skills have none of this infrastructure — until now.

The gap is especially dangerous because agent skills operate with more authority than a typical npm package. An npm module runs in a sandboxed Node.js process. A malicious or broken agent skill can instruct your agent to read private files, execute shell commands, send data to external servers, and modify system settings — all through the natural language interface the agent was designed to obey.

You need data, not vibes.

Four Dimensions of Skill Evaluation

A complete skill evaluation framework must cover four dimensions. Security alone isn’t enough — a perfectly safe skill that gives wrong answers is still a bad skill. And a skill that works today but breaks silently next month is a ticking time bomb.

1. Safety (Audit)

The first gate: does this skill contain anything dangerous?

Static analysis catches hardcoded secrets, injection surfaces, dangerous shell patterns, supply chain risks, and over-privileged permission requests. Every finding gets a severity level (critical, warning, info) and a concrete fix suggestion.

What the audit detects:

Category	What It Catches
Secrets (SEC-001)	API keys, tokens, passwords, database connection strings
Exfiltration (SEC-002)	External URLs that could be data exfiltration channels
Shell execution (SEC-003)	`subprocess.run()`, `os.system()`, `shell=True`
Supply chain (SEC-004)	`curl \| bash`, unpinned `pip install`
Prompt injection (SEC-005)	User input referenced in executable contexts
Deserialization (SEC-006)	`pickle.load()`, `yaml.load()` without safe loaders
Dynamic imports (SEC-007)	`importlib.import_module()`, `__import__()`
Obfuscation (SEC-008)	Base64-encoded executable payloads
MCP servers (SEC-009)	External Model Context Protocol servers
Structure (STR-*)	Malformed frontmatter, missing fields, naming violations
Permissions (PERM-*)	`Bash(*)`, sudo access, excessive tool grants

Scoring starts at 100 and deducts per finding: critical (-25), warning (-10), info (-2). The result maps to a letter grade: A (90+), B (80+), C (70+), D (60+), F (<60).

Scan scope matters. By default, the audit scans only the files an agent actually reads and executes: SKILL.md, scripts/, and agents/. This matches the agentskills.io definition of skill content and avoids false positives from test fixtures or documentation that describes security patterns without being vulnerable.

Use --include-all to scan the entire directory tree — useful for full repository security reviews, but expect findings from test data and examples. skill-eval itself demonstrates this perfectly: the default audit scores 96/A (Grade A), while --include-all drops to 0/F because our test fixtures intentionally contain every security anti-pattern we detect. That’s by design — you need bad examples to test a security scanner.

2. Quality (Functional Evaluation)

The second gate: does this skill actually make your agent better?

This is the dimension most people skip — and arguably the most important. A skill that passes every security check but produces wrong output is worse than useless; it gives you false confidence.

Functional evaluation runs test cases with the skill installed and without it, then grades the difference. Each eval case includes a prompt, expected behavior, and assertions. The grading covers four sub-dimensions:

Outcome: Did the agent produce the correct result?
Process: Did it follow a reasonable approach?
Style: Is the output well-formatted and usable?
Efficiency: Did it avoid unnecessary steps?

The with-vs-without comparison is critical. It answers the fundamental question: does installing this skill actually improve agent behavior? If the agent performs just as well without the skill, the skill isn’t adding value.

3. Reliability (Trigger Evaluation)

The third gate: does this skill activate at the right times?

A skill that triggers on every prompt is noisy. A skill that never triggers is useless. Trigger evaluation tests activation precision across two sets of queries:

Positive queries: prompts that should activate the skill (e.g., “What’s the weather in Seattle?” for a weather skill)
Negative queries: prompts that should not activate it (e.g., “Write me a poem” should not trigger a weather skill)

The pass rate measures how often the skill correctly activates (or correctly stays silent). Low precision means your skill is hijacking unrelated conversations. Low recall means it’s missing the prompts it was built for.

4. Regression & Lifecycle (Snapshot, Diff & Version Tracking)

The fourth gate: does this skill stay good over time?

Skills change. Dependencies update. Agent frameworks evolve. What worked last sprint might not work this sprint. Regression testing captures a baseline snapshot of your skill’s evaluation results and compares future runs against it.

A regression is detected when:

New critical findings appear that weren’t in the baseline
The audit score drops by more than 5 points

Beyond snapshots, lifecycle management tracks skill versions through SHA-256 fingerprinting. Every file in the skill directory gets hashed, and changes are detected at file-level granularity — added, modified, or deleted. This answers the question: “what changed since the last time I evaluated this skill?”

bash

# Save a version checkpoint
skill-eval lifecycle ./my-skill --save --label v1.0

# Later: detect what changed
skill-eval lifecycle ./my-skill
# → Changes detected: SKILL.md modified, scripts/helper.py added

# Auto-trigger regression when changes are detected
skill-eval lifecycle ./my-skill --auto-regression

This turns evaluation from a one-time check into a continuous quality gate — with full version history so you can trace exactly when and what changed.

The Unified Grade

Three dimensions combine into a single weighted score. The fourth — regression and lifecycle — acts as an independent pass/fail gate that can block deployment regardless of score:

Dimension	Weight	What It Measures
Audit (Safety)	40%	Security, structure, permissions
Functional (Quality)	40%	Correctness with-skill vs without-skill
Trigger (Reliability)	20%	Activation precision and recall

The result is a 0–100 score with an A–F letter grade. If a phase is skipped (e.g., no eval cases defined), its weight redistributes to the remaining phases. Regression and lifecycle checks are tracked separately as pass/fail gates — they don’t affect the weighted score but can block deployment independently.

Cost Efficiency: The Hidden Fifth Dimension

There’s a dimension that doesn’t appear in the weighted score but matters enormously in practice: cost efficiency. A skill that improves quality by 5% but triples token usage may not be worth deploying. Conversely, a skill that maintains quality while reducing token consumption is a clear operational win.

When running functional evaluation (with-skill vs without-skill), track both the quality delta and the token delta. The combination reveals the true value of a skill:

Classification	Quality	Cost	Verdict
🟢 Pareto Better	Same or better	Same or lower	Install — pure upside
🟡 Tradeoff	Better	Higher	Evaluate ROI — is the quality gain worth the cost?
🟠 Cheaper but Weaker	Worse	Lower	Not recommended unless cost is the primary constraint
🔴 Pareto Worse	Same or worse	Same or higher	Do not install — no upside

Quality remains the gate: if a skill degrades output quality beyond a threshold, it should be flagged regardless of cost savings. But when quality is maintained, cost efficiency becomes the tiebreaker. The best skills improve both — they make agents smarter and more efficient, because well-designed knowledge injection helps the agent reason more directly instead of exploring from first principles.

For Skill Users: Evaluating Third-Party Skills

You found a skill on a marketplace. Before you install it, run one command:

bash

skill-eval report /path/to/skill

This produces a unified report across all dimensions:

📊 Good Skill Report

═══════════════════════════════════════════
  Unified Skill Report
═══════════════════════════════════════════
  Skill: weather-skill
  Overall Grade: A (0.92)
───────────────────────────────────────────
  Audit:      96/100 (A)  ████████████████░░
  Functional: 0.91  (A)   ████████████████░░
  Trigger:    0.88  (B)   ███████████████░░░
───────────────────────────────────────────
  Result: PASSED
═══════════════════════════════════════════

Now compare with a suspicious skill:

🚨 Bad Skill Report

═══════════════════════════════════════════
  Unified Skill Report
═══════════════════════════════════════════
  Skill: Bad_Skill
  Overall Grade: F (0.12)
───────────────────────────────────────────
  Audit:      0/100  (F)  ░░░░░░░░░░░░░░░░░░
  Functional: 0.35  (F)   ██████░░░░░░░░░░░░
  Trigger:    0.25  (F)   ████░░░░░░░░░░░░░░
───────────────────────────────────────────
  Result: FAILED
═══════════════════════════════════════════

  🔴 [SEC-001] Secret detected: Generic Password
     File: SKILL.md:20
     Fix: Remove the secret. Use environment variables
          or a secrets manager instead.

  🔴 [SEC-004] Unsafe install: curl | sh
     File: SKILL.md:29
     Fix: Pin dependencies in a requirements file.
          Never pipe curl output to shell.

The grade difference tells the story immediately: A vs F. But you can also drill into any dimension.

Reading the Report

Audit score < 60 (F): Don’t install. Critical security findings present.
Functional score < 0.5: The skill doesn’t improve agent behavior — or makes it worse.
Trigger rate < 0.5: The skill fires on the wrong prompts or misses the right ones.
Overall grade < C: The skill has significant issues in at least one dimension.

Setting Team Policies

For teams managing shared agent configurations, skill-eval’s exit codes enable policy enforcement:

bash

# Exit code 0 = passed, 1 = warnings/regressions, 2 = critical findings
skill-eval audit /path/to/skill --fail-on-warning

# Set a minimum score threshold
skill-eval report /path/to/skill  # Check grade in output or JSON

A simple team policy: “No skill with an overall grade below B gets installed in production.” The --format json flag makes scores parseable by scripts and dashboards.

For Skill Authors: Proving Your Skill Works

Building a skill that earns an A requires more than avoiding security anti-patterns. You need to prove it works, prove it triggers correctly, and prove it stays working.

Start with Anthropic’s skill-creator

Anthropic’s skill-creator provides a structured workflow for building skills. It includes its own evaluation tooling — run_eval.py for running evals, aggregate_benchmark.py for scoring, and improve_description.py for iterating on your skill description based on eval results.

skill-eval complements skill-creator. Where skill-creator helps you build and iterate on a skill, skill-eval helps you evaluate and gate it — adding security audit, unified grading, regression testing, and CI integration. The data schemas are compatible: skill-eval reads the same evals.json format that skill-creator produces.

The workflow: create with skill-creator → evaluate with skill-eval → iterate → deploy with CI gates.

Scaffold Your Eval Files

Once your skill exists, generate the evaluation structure:

bash

skill-eval init /path/to/your-skill

This creates two files:

evals/evals.json — functional test cases with prompts, expected outputs, and assertions
evals/eval_queries.json — trigger queries tagged as positive (should activate) or negative (should not)

The generated templates use your skill’s name and description from its frontmatter. They’re starting points — you’ll want to replace them with meaningful test cases.

Write Meaningful Eval Cases

The default scaffolds are placeholders. Good eval cases test the substance of your skill, not just “does it respond.” For a weather skill:

json

{
  "id": "weather-current-city",
  "prompt": "What is the current weather in Seattle?",
  "expected_output": "Current weather conditions for Seattle",
  "assertions": [
    {"type": "contains", "value": "Seattle"},
    {"type": "contains", "value": "temperature"},
    {"type": "regex", "value": "\\d+°[CF]"},
    {"type": "min_lines", "value": 3}
  ]
}

Assertions can be deterministic (contains, regex, JSON structure checks) or semantic (LLM-graded for subjective quality). See examples/golden-evals/ in the repo for templates and patterns.

Run Functional Evaluation

bash

skill-eval functional /path/to/your-skill

This runs each eval case twice — once with the skill installed, once without — and grades both. The comparison reveals whether your skill is actually adding value. If the without-skill baseline scores nearly as high, your skill isn’t contributing enough to justify installation.

Run Trigger Evaluation

bash

skill-eval trigger /path/to/your-skill

This tests each query in eval_queries.json and checks whether the skill activates correctly. The pass rate tells you how precise your skill’s trigger conditions are.

Common issues trigger evaluation catches:

Over-broad triggers: skill activates on generic prompts like “help me with something”
Under-specific triggers: skill misses prompts that clearly fall within its domain
Keyword collision: skill triggers on prompts that contain related words but aren’t actually relevant

The Create → Evaluate → Improve Loop

The development workflow looks like this:

Build the skill with skill-creator
Scaffold evals: skill-eval init your-skill
Write meaningful eval cases and trigger queries
Audit: skill-eval audit your-skill — fix all critical findings
Test: skill-eval report your-skill — verify functional quality and trigger reliability
Snapshot: skill-eval snapshot your-skill — save your audit baseline
Track: skill-eval lifecycle your-skill --save --label v1.0 — record version fingerprint
Iterate: use skill-creator’s improve_description.py to refine, re-run skill-eval report to verify
Ship: add CI workflow and merge with confidence

Each iteration should improve your unified grade. The snapshot baseline ensures you never silently regress.

Continuous Quality with CI/CD

Evaluation isn’t a one-time activity. Skills change, dependencies update, and agent frameworks evolve. CI integration turns skill evaluation into a continuous quality gate.

Regression & Lifecycle Gate

After establishing a baseline with skill-eval snapshot and a version checkpoint with skill-eval lifecycle --save, every subsequent change can be checked:

bash

# Detect file-level changes since last version
skill-eval lifecycle /path/to/skill

# Run regression against audit baseline
skill-eval regression /path/to/skill

📊 Regression Check Report

══════════════════════════════════════════════════════════
  Regression Check Report
══════════════════════════════════════════════════════════
  Baseline: v1.2.0 (96/A)
  Current:  88/B
  Delta:    -8 points
──────────────────────────────────────────────────────────
  Result: ❌ FAILED — Regression detected: 1 new critical
          findings, score 96 → 88
══════════════════════════════════════════════════════════

  🔴 New findings (1):
    [CRITICAL] SEC-003: Shell execution via subprocess
      File: SKILL.md:45

  ✅ Resolved findings (0):

  Summary: 1 new | 0 resolved | 4 unchanged

New critical findings or significant score drops fail the gate. The 5-point tolerance for score drops avoids false alarms from minor info-level changes.

GitHub Actions Integration

Add skill evaluation as a reusable workflow:

yaml

# .github/workflows/skill-eval.yml
name: Skill Evaluation
on:
  push:
    paths: ['skills/**']
  pull_request:
    paths: ['skills/**']

jobs:
  evaluate:
    uses: aws-samples/sample-agent-skill-eval/.github/workflows/skill-eval.yml@main
    with:
      skill_path: "path/to/your-skill"
      run_functional: true
      run_trigger: true

The reusable workflow outputs passed, grade, and score, which you can use in downstream jobs. Exit codes make integration straightforward: 0 means passed, 1 means warnings or regressions, 2 means critical findings.

Team Governance

Combine the tools into an automated quality gate:

yaml

# In your PR workflow
- name: Evaluate skill
  run: skill-eval report skills/my-skill --format json > eval-result.json

- name: Check regression
  run: skill-eval regression skills/my-skill

- name: Enforce minimum grade
  run: |
    grade=$(jq -r '.overall_grade' eval-result.json)
    if [[ "$grade" == "F" || "$grade" == "D" ]]; then
      echo "Skill grade $grade is below minimum. Blocking merge."
      exit 1
    fi

This gives teams a repeatable, automated standard: no skill merges to production without passing evaluation.

Real-World Proof: The PR Naming Convention Story

Theory is nice. Does it actually work? We tested skill-eval end-to-end using a realistic scenario: building a company-internal PR naming convention validator.

Every company has its own PR rules — [TEAM-42] feat: add auth or platform/TEAM-42-add-auth. These are exactly the kind of domain-specific knowledge that AI models don’t know, making them an ideal test case.

We built three versions:

Version	How It Was Built	Score	Key Issues
v1	Developer wrote it themselves	39/F	Hardcoded token, `eval()`, `shell=True`
v2	Rebuilt using Anthropic Skill Creator	98/A	Pure regex, clean structure
v3	Someone added a feature	61/D	`pickle.load` + `shell=True` regression

The most compelling result was functional evaluation. Without the skill, the agent applied generic conventions and approved titles that violated company rules. With the skill, it correctly enforced the format — +17% to +33% functional improvement.

The v2→v3 transition shows regression detection in action: security degraded while features appeared to improve. Without automated evaluation, this would slip through code review.

Full demo: examples/lifecycle-demo/ in the repository.

How We Know It Works: Meta-Evaluation

We validated skill-eval against ground truth across all dimensions.

Dimension	Accuracy	Notes
Audit accuracy	100%	Deterministic (regex, AST, YAML). 582 unit tests.
Functional grading	100% / ~90%	100% for deterministic assertions. ~90% for LLM-judged.
Trigger specificity	100%	Zero false positives.
Trigger recall	25–100%	Varies. Known framework limitation for CLI-based skills.

Full methodology in examples/self-eval/.

The Bigger Picture

This is eval-driven development for agents.

The concept isn’t new. Test-driven development transformed how we write code: define expected behavior first, then implement until the tests pass. Eval-driven development applies the same discipline to agent capabilities: define what “good” looks like across safety, quality, reliability, and regression — then build until you meet that bar.

The tooling ecosystem is maturing. Anthropic’s skill-creator helps you build skills with structure and best practices. skill-eval helps you evaluate and gate them with data. Together they form a complete lifecycle:

🚀 Complete Lifecycle

skill-creator (build) → skill-eval (evaluate) → iterate → deploy with CI → lifecycle (monitor)

The agent ecosystem is where the web ecosystem was fifteen years ago: moving fast, building trust gradually, learning from incidents. The difference is we don’t have to repeat the same mistakes. We can build evaluation infrastructure now, before the first major supply chain incident in an agent marketplace.

Four principles guide this approach:

1. Measure everything. Don’t rely on manual review or gut feeling. Automated evaluation across all four dimensions gives you data to make decisions with.

2. Gate on quality, not just safety. A safe skill that produces wrong answers is still a bad skill. Functional evaluation is as important as security audit.

3. Test activation, not just output. A skill that triggers on every prompt degrades the entire agent experience. Trigger precision matters.

4. Prevent regression. Snapshot your baselines. Run regression checks in CI. A skill that’s good today should stay good tomorrow.

Get Started

Repository: github.com/aws-samples/sample-agent-skill-eval
Tutorial: See docs/tutorial.md for guided walkthroughs (consumer and author paths)
Core Concepts: See docs/concepts.md for architecture, scoring, and rule details
Golden Evals: See examples/golden-evals/ for eval case templates
Lifecycle Demo: See examples/lifecycle-demo/ for the PR naming convention end-to-end test
Meta-Evaluation: See examples/self-eval/ for validation methodology
Anthropic skill-creator: github.com/anthropics/skills/tree/main/skills/skill-creator
Agent Skills Spec: agentskills.io

Install it. Evaluate your skills. Get data, not vibes.

bash

pip install -e .
skill-eval report /path/to/skill

📝 Note: This blog post represents my personal views and experiences and does not represent the views of my employer. The skill-eval tool is an independent open-source project.

💬 Comments

Comments are reviewed before appearing

No comments yet. Be the first to share your thoughts!

Is That Agent Skill Any Good? A Complete Framework for Evaluating AI Agent Skills

The Trust Problem: Agent Skills Are the New npm Packages

Four Dimensions of Skill Evaluation

1. Safety (Audit)

2. Quality (Functional Evaluation)

3. Reliability (Trigger Evaluation)

4. Regression & Lifecycle (Snapshot, Diff & Version Tracking)

The Unified Grade

Cost Efficiency: The Hidden Fifth Dimension

For Skill Users: Evaluating Third-Party Skills

Reading the Report

Setting Team Policies

For Skill Authors: Proving Your Skill Works

Start with Anthropic’s skill-creator

Scaffold Your Eval Files

Write Meaningful Eval Cases

Run Functional Evaluation

Run Trigger Evaluation

The Create → Evaluate → Improve Loop

Continuous Quality with CI/CD

Regression & Lifecycle Gate

GitHub Actions Integration

Team Governance

Real-World Proof: The PR Naming Convention Story

How We Know It Works: Meta-Evaluation

The Bigger Picture

Get Started

Related Posts