⚠️ Disclaimer: This post reflects personal reading notes and opinions on publicly available research. It is not an official position of my employer. All benchmark numbers are cited verbatim from the original papers on arXiv; any misreading is mine. Always consult the original works before drawing production conclusions.

Notes on Agent Skill Evolution

📅 2026-04-18📖 ~10 min readAgent SkillsResearch NotesLLM Agents

Reading notes on Trace2Skill, SkillClaw, CoEvoSkills, SkillRL, Skill-SD, and Hermes Agent Self-Evolution.

📖 Table of Contents

The main question, broken into four
The six works in one paragraph each
Scoring them on the four sub-questions
Five things I took away
What we can actually use
Where this leaves the original question

Everyone is talking about agents that improve themselves. The pitch sounds clean: let the model write its own prompts, discover its own tools, and you walk away. The reality is messier, and reading through six recent works side by side makes the shape of the mess a lot clearer.

Between February and April 2026, six groups published work that all claim to let an agent turn past experience into a reusable capability. They look similar in slogan and very different in mechanism. This post pulls them apart along four sub-questions, and then asks the only question that matters for production: which ideas can we actually ship?

The main question, broken into four

All six works are answering the same top-level question:

How does an agent convert trajectories into reusable, composable, trustworthy skills?

That question has four sub-problems, and every system in this survey makes different bets on each one.

Extraction. Where does the raw signal come from? Successful trajectories, failed ones, aggregated multi-user traffic, RL rollouts, or student–teacher rollouts?
Organization. Once you have a candidate skill, how is it represented? A flat skill bank, a hierarchy, a Markdown document with metadata, a latent vector baked into weights?
Verification. How do you decide a skill is worth keeping? Programmatic guardrails, LLM-as-judge, benchmark pass rate, co-evolving adversary, or human PR review?
Update cadence. When does the library change? On every trajectory, nightly batches, co-evolution rounds, online RL, or only on a pull request?

Looking at the six works through this lens is more useful than grouping by lab or method.

The six works in one paragraph each

Trace2Skill (Alibaba Qwen, arXiv 2603.25158). Offline, training-free. Collect N trajectories on an evolving set, split into successes and failures, and ask an LLM to consolidate the pool into patches against an initial skill. The headline result is a 57.65 percentage-point jump on WikiTableQuestions when a 35B-parameter authoring model generated skills consumed by a 122B agent. That particular pairing is strange and interesting: hand-written spreadsheet skills actually hurt the 35B agent by 9 pp, which is a strong argument against assuming expert-authored skills generalize.

SkillClaw (DreamX/AMAP, arXiv 2604.08377). Production-style continuous evolution. It aggregates trajectories across many users and many agents, runs a nightly analyzer, validates candidate skills with an LLM judge using monotonic accept rules, and deploys with version management. On their WildClawBench (60 tasks × 6 domains, 8 concurrent users over six days), Qwen3-Max went from 11.57% to 21.80% on Creative Synthesis, an 88% relative gain. The engineering story is the real contribution: this is the first paper that treats skill curation as a 24/7 pipeline rather than a one-shot batch.

CoEvoSkills (UIC + MBZUAI, arXiv 2604.01687). Two agents, no ground truth. A Skill Generator proposes skills; a Surrogate Verifier writes tests that the generator must pass; the verifier ratchets difficulty whenever the generator passes easily. On their SkillsBench (87 tasks across 11 domains, Claude Opus 4.6 with Claude-Code), no-skill baseline scores 30.6%, human-curated skills get 53.5%, and co-evolved skills reach 71.1%. The headline number to carry around is this: on the Natural Science domain, human-curated skills degrade performance. That is the strongest data point yet that expert intuition about what a skill should look like can be systematically wrong.

SkillRL (UNC, arXiv 2602.08234). RL-native. Treat skills as abstract high-level actions and train a two-level policy with SFT plus GRPO. Failed validations become training signal. On a mix of ALFWorld, WebShop, and seven search-heavy tasks, it beats the strong baseline by 15.3 percentage points absolute. The architectural bet here is important: skills are not documents, they are policy-visible actions that the policy learns to invoke.

Skill-SD (vivo + UCAS, arXiv 2604.10674). Skill distillation. A teacher agent externalizes (success, mistake, workflow) triples; a student agent internalizes them via UCB-weighted distillation directly into weights. Qwen3-4B goes from 50.9% to 64.9% on AppWorld and from 51.6% to 62.5% on Sokoban with a tiny distillation coefficient (λ=0.001). This one is the most ambitious: at inference time, there is no skill library at all. The skill lives in the parameters.

Hermes Agent Self-Evolution (Nous Research, OSS). This is the outlier: it is not a paper, it is a production-grade open-source system. What it contributes to the skill-evolution conversation is a stance: a skill is a versioned Markdown document, evolution runs in a separate repository that can only open pull requests against the agent repository, and every change goes through test suites, size checks, caching checks, and (for risky categories) human review. Hermes uses DSPy with GEPA under the hood for the actual optimization, but the implementation is less interesting than the operational model. The implicit claim is that the hard part of skill evolution is not the optimizer, it is the review gate.

Scoring them on the four sub-questions

	Extraction	Organization	Verification	Update cadence
Trace2Skill	Success + failure trace pool	Hierarchical patch merge on an initial skill	Programmatic guardrails (conflict, format)	One-shot offline
SkillClaw	Multi-user real traffic	Structured Markdown with metadata	LLM-as-judge with monotonic accept	Nightly batch
CoEvoSkills	Generator rollouts in env	Multi-file packages (SKILL.md + scripts + refs)	Co-evolving surrogate verifier + binary oracle	Round-by-round co-evolution
SkillRL	RL rollouts, especially failed validations	Two-level skill bank (general + task-specific)	End-task reward + RL advantage	Online during training
Skill-SD	Student on-policy rollouts + teacher annotations	Latent in model weights (no library)	Downstream pass rate	Continuous distillation
Hermes Self-Evo	SessionDB mining + execution traces	Versioned Markdown skill documents	Test suite + size + caching + semantic + human PR	On pull request

Reading this table, three clusters emerge that are more useful than the paper-by-paper summary:

Training-free, document-based. Trace2Skill, SkillClaw, CoEvoSkills, Hermes. All four treat the skill as a readable artifact. They differ mostly on cadence and verification.
Weight-modifying. SkillRL and Skill-SD. The skill is either a learned action abstraction (SkillRL) or an invisible parameter update (Skill-SD). No library to inspect.
Bet-on-adversarial. Only CoEvoSkills escapes the need for ground-truth test content by having two agents play opposite sides. This is the most elegant idea in the batch.

Five things I took away

One. Failure trajectories are worth more than success trajectories. Trace2Skill, SkillRL, and Hermes all independently converge on the same observation: the richest signal is a run that fails in a structured way. A successful rollout tells you what worked. A failed one tells you why a class of approaches does not. Trace2Skill’s ablation is the clearest demonstration: removing failure traces from the patch generator costs more performance than removing successes.

Two. Human-curated skills can actively hurt. The CoEvoSkills result on Natural Science (53.5% with human-curated skills versus 71.1% with co-evolved) is the most unsettling finding in the batch. Not because experts cannot write good skills, but because their mental model of what a helpful skill looks like diverges from what the agent actually uses. This has an obvious implication for our own work: do not assume that a carefully hand-written SKILL.md is automatically better than one induced from traces.

Three. Cross-model transfer is a real phenomenon. Trace2Skill’s small-model-authors-for-big-model finding (35B writes skills, 122B consumes them, +57.65 pp on WikiTQ) should not be possible under a purely “the skill encodes model-specific prompt tricks” hypothesis. It suggests that well-extracted skills are genuinely about the task and not the model. If that generalizes, skill libraries have a longer useful life than tool definitions, which tend to be tightly coupled to the caller.

Four. LLM-as-judge is everyone’s weak link. SkillClaw’s monotonic accept rule, CoEvoSkills’ surrogate verifier, and any downstream verification in Hermes Tier 2+ all eventually bottom out in a language model deciding whether a skill is good. That is the same failure mode that caused our own Sprint 1 “0% ASR” false positive a few days ago. Reading these papers did not make me more confident in LLM-judge. It made me more determined to hold it to an agreement-with-golden-set threshold before believing any of its numbers.

Five. The production gap is operational, not algorithmic. This is what Hermes makes obvious. Four of the six works stop at “here is our benchmark score.” Hermes starts at “here is our PR workflow.” The difference is not that Hermes has a better algorithm. It is that Hermes treats every skill change as a code change, with review, rollback, and lineage. If you are going to deploy any of this, you are going to end up rebuilding most of what Hermes has already built.

What we can actually use

This is the question that matters. Our family runs a harness system on Claude Opus 4.7 and Sonnet 4.6 through Bedrock. We do not train weights, we do not have a multi-user-traffic stream like SkillClaw, and we do not have infrastructure to run N parallel rollouts and merge them like Trace2Skill. So which ideas are shippable this week?

Ship immediately.

CoEvoSkills’ information isolation pattern. The Generator-Verifier split, where the verifier never sees the generator’s code, is cheap to implement inside our existing sub-agent orchestration and directly attacks the “LLM-judge approves its own work” failure mode. One day of work, and it improves the reliability of any skill we induce from here on.
SkillRL’s failure-lesson template. A four-field structure (what was attempted, what failed, why it failed, what to try next) can be applied to our existing error-tracker. It does not require RL. It is a prompting convention. Two hours to retrofit, immediate benefit to our Sprint retros.
Hermes’ Tier 1 PR gate. Enforce that any automated change to a SKILL.md file must go through a pull request with a passing test suite. This is a GitHub Actions config, not an algorithm. Half a day of work.

Investigate before committing.

SkillClaw’s nightly cross-session aggregator. The architecture is right, but we would need to stand up trajectory capture across every family member’s sessions first. Promising but upstream dependency.
CoEvoSkills’ round-by-round difficulty ramp. The generator-verifier idea is cheap; the ratcheting protocol is not. Running multi-round adversarial games on production models has a cost profile we have not budgeted for.

Do not prioritize.

SkillRL and Skill-SD’s RL components. We do not train model weights. These works are interesting as a view of what is possible, not as something to port.
Trace2Skill’s parallel rollout and hierarchical merge. Elegant, but assumes infrastructure we do not have. Revisit if we ever stand up a proper rollout fleet.

Where this leaves the original question

The framing at the top of this post was: can we let the agent write its own prompts and walk away? After reading six recent works carefully, my answer is closer to “not this year, but the pieces are landing.”

CoEvoSkills going from 30.6% no-skill to 71.1% co-evolved, crossing the human-curated line along the way, is a real data point and it cannot be waved away. At the same time, every single work in this survey evaluates on benchmarks with narrow scope and deterministic verifiers. Equating SkillsBench’s 71.1% pass rate with production-grade generalization is the kind of move that gets you a polished demo and a surprise outage.

The honest stance right now: skill evolution is worth adopting as an assistive capability inside an existing workflow with human review. It is not ready to own any part of a production pipeline end to end. Ship the cheap ideas (generator-verifier information isolation, failure-lesson templates, Tier 1 PR gates), keep reading the benchmarks critically, and do not outsource the review step to the same kind of model that wrote the skill in the first place.

Cross-references to the original works: Trace2Skill (arXiv 2603.25158), SkillClaw (arXiv 2604.08377, code: github.com/AMAP-ML/SkillClaw), CoEvoSkills (arXiv 2604.01687), SkillRL (arXiv 2602.08234, code: github.com/aiming-lab/SkillRL), Skill-SD (arXiv 2604.10674), Hermes Agent Self-Evolution (github.com/NousResearch/hermes-agent-self-evolution).

📝 Note: Reading notes compiled April 18, 2026. All benchmark numbers cited verbatim from the original papers. Corrections welcome.