SKILL0 Puts Skills in the Weights

📅 April 8, 2026
📖 ~15 min read
Research Paper
Reinforcement Learning
Agent Skills
Skill Internalization

⚠️ Disclaimer: This article represents my personal analysis and interpretation of the SKILL0 research paper. The opinions expressed here are my own and do not represent the views of my employer. Always refer to the original paper for authoritative details.

TL;DR: Current AI agents follow skill descriptions from their context window every time — they never actually learn the skills. SKILL0 introduces In-Context Reinforcement Learning (ICRL), a training framework that uses skills as scaffolding that is progressively removed. The result: a 3B model that matches or beats skill-augmented baselines while using 5.8x fewer tokens per step — with zero skill retrieval at inference time.

📖 Table of Contents

The Problem with “Read and Follow” Agents
How SKILL0 Works
Results: Strong Performance, Tiny Context
A Different Kind of Context Efficiency
Why This Matters for the AI Agent Ecosystem
Conclusion

Have you ever watched someone cook by reading every single step of a recipe, every single time, even after making the same dish a hundred times? That is essentially how most AI agent systems work today.

When an agent needs to perform a task (navigating a website, searching for information, or executing a multi-step workflow), it retrieves a skill description from a library, injects it into its context window, and follows the instructions. This approach, known as inference-time skill augmentation, has become the standard architecture for extending agent capabilities. And it works. But it has a fundamental problem that the research community has largely accepted rather than solved.

A new paper from Zhejiang University, Meituan, and Tsinghua University, titled “SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization” (arXiv:2604.02268), takes a different approach entirely. Instead of asking how to better retrieve and inject skills, it asks: can skills be internalized into model parameters, making retrieval unnecessary at inference time?

The Problem with “Read and Follow” Agents

To understand why SKILL0 matters, let’s look at what happens under the hood of current skill-augmented agents.

How Skill Augmentation Works Today

Modern agent frameworks organize reusable knowledge into skill libraries: structured collections of procedural instructions, tool invocation patterns, and domain-specific strategies stored as markdown files or similar formats. When the agent encounters a task, a retrieval system selects relevant skills and injects them into the model’s context alongside the task description and interaction history.

This creates three compounding problems:

1. Retrieval noise. Semantic similarity is an imperfect proxy for task relevance. The retriever sometimes pulls in skills that are topically related but procedurally wrong, and the model dutifully follows them anyway. When the wrong skill gets injected, it does not just fail to help; it actively corrupts the agent’s reasoning by introducing irrelevant or misleading procedural steps.

2. Token overhead. Each skill consumes tokens. In multi-turn interactions, this overhead compounds: the agent carries skill descriptions alongside its growing interaction history, steadily consuming its context budget. The paper notes that skill-augmented methods like SkillRL cost 2.2k+ tokens per step, compared to SKILL0’s under 0.5k. That is more than a 4x difference in context efficiency.

3. No genuine learning. This is the most fundamental issue. A model that follows skill descriptions in its prompt is executing skills, not learning them. The competence resides in the context, not in the model. Remove the skill descriptions, and the capability vanishes completely. The model has processed thousands of task completions guided by the same skills, yet it has not internalized any of that experience.

The Human Analogy

The paper draws an apt comparison to how humans learn. When you first learn to cook a new dish, you follow the recipe step by step. But after making it twenty times, something shifts. You no longer need the recipe. The knowledge has moved from external reference to internal competence. You have internalized the skill.

Psychologist John Anderson described this as the transition from the declarative stage (following explicit instructions) to the procedural stage (executing from memory). Current skill-augmented agents are permanently stuck in the declarative stage. SKILL0 proposes a training methodology to push them through to the procedural stage.

How SKILL0 Works

SKILL0 introduces In-Context Reinforcement Learning (ICRL), a framework that uses skills as training scaffolding that is progressively removed. The key insight: provide full skill guidance during training, but systematically withdraw it so that the model must learn to operate without external help.

The Three-Part Architecture

1. Relevance-Driven Skill Grouping

Before training begins, SKILL0 organizes the skill library and creates a validation infrastructure. Skills are structured in a directory hierarchy (e.g., skills/search/entity_attribute_lookup.md), and each skill file is associated with a dedicated validation sub-task. This offline grouping ensures there is a principled way to measure whether each specific skill is still helping the current policy.

2. In-Context Reinforcement Learning

During training rollouts, the agent receives task instructions along with selected skills and its interaction history. But here is where SKILL0 introduces a clever compression trick: instead of feeding all this text directly into the model’s context window (which would be expensive), it renders the text (both interaction history and skill descriptions) as an RGB image, then passes that image through a vision encoder to produce compressed visual representations.

This text → image → vision token pipeline dramatically reduces token overhead. Think of it as taking a full page of instructions and photographing it. The visual representation preserves the structural information while consuming far fewer tokens than the raw text.

What makes this even more interesting is that the compression is self-adaptive. At each step, the model does not just generate a task action. It also generates a compression ratio c_t that controls how aggressively the context is compressed. The model learns to decide “I need high detail right now” versus “I can work with a compressed overview.” This means the agent actively manages its own context budget, allocating more attention to steps that need it and compressing routine steps more aggressively.

The training objective combines two signals:

Task reward: Did the agent complete the task correctly?
Compression reward: How efficiently did it manage its context? This uses a logarithmic formulation (ln(c_t)), reflecting diminishing returns from higher compression. Going from 50% to 25% compression is more valuable than going from 25% to 12.5%.

This joint optimization pushes the model toward both competent and efficient behavior.

3. Dynamic Curriculum Learning

This is the core innovation. Training is divided into stages with a linearly decaying skill budget. At each stage, fewer skills are available. The selection of which skills to keep uses a helpfulness metric: for each skill file, SKILL0 compares the agent’s performance with and without that skill on matched validation tasks, essentially running an A/B test for every skill against the current policy. Only skills where the “with skill” group measurably outperforms the “without skill” group are retained. Once a skill’s A/B test shows no significant difference, it means the policy has already absorbed that knowledge, and the skill is dropped from the budget.

The algorithm is elegantly simple:

Curriculum Progression

Stage 1: All N skills available    → Train with RL
Stage 2: Top M skills (by helpfulness) available → Continue training
Stage 3: Budget = 0, no skills     → Agent operates autonomously

Skills are not removed on a fixed schedule. A skill is dropped only when the current policy demonstrates it can handle the corresponding tasks without it. This adaptive approach prevents the abrupt distribution shifts that would destabilize training if skills were removed arbitrarily.

The Helpfulness Dynamics

One of the paper’s most revealing findings is the characteristic trajectory of skill helpfulness over training. For each sub-task, helpfulness follows a rise-then-fall pattern:

Early training: Helpfulness is low. The policy has not yet learned to use the skill prompts effectively.
Mid training: Helpfulness peaks. The policy is actively grounding its actions in the skill context.
Late training: Helpfulness drops toward zero. The knowledge has been internalized into the model’s parameters, making the external skill redundant.

This trajectory validates the core thesis: skills serve as effective yet transient scaffolding during policy optimization.

Results: Strong Performance, Tiny Context

The experimental results are compelling across two distinct evaluation environments.

ALFWorld (Embodied Task Completion)

ALFWorld is a text-based game simulating household activities: pick and place objects, clean items, heat food, and so on. It requires multi-step planning and tool interaction across six task categories.

Method	Success Rate	Tokens/Step
Zero-Shot (3B)	48.4%	0.28k
Few-Shot Skills (3B)	58.2%	2.28k
AgentOCR (3B)	78.2%	0.36k
SkillRL (3B)	86.3%†	2.21k
SKILL0 (3B)	87.9%	0.38k
SKILL0 (7B)	89.8%	0.43k

† Evaluated with skill augmentation at inference time

The headline result: SKILL0 (3B) achieves 87.9% without any skills at inference, a +9.7% improvement over AgentOCR (78.2%), the standard RL baseline with visual context compression. It also edges past SkillRL (86.3%) which requires skill augmentation at inference, while using 5.8x fewer tokens per step. The AgentOCR comparison is the more meaningful one: both methods operate skill-free at inference, making it an apples-to-apples measurement of what skill internalization adds over standard RL.

Search-QA (Information Retrieval)

On search-augmented question answering across seven benchmarks (including NQ, TriviaQA, HotpotQA, and multi-hop datasets), SKILL0 showed similar strengths:

SKILL0 (3B): 40.8% average accuracy, 0.18k tokens/step
SKILL0 (7B): 44.4% average accuracy, 0.22k tokens/step
These results surpass Search-R1 (38.5%), ZeroSearch (39.1%), and compete with EvolveR (43.1%)

Ablation: The Curriculum Matters

The ablation studies reveal how critical the adaptive curriculum is:

Fixed Full skills throughout: When trained with all skills always present but then tested without skills at inference, performance drops -12.3%. This training-inference gap reveals severe skill dependency: the model learned to rely on the crutch rather than internalize the knowledge.
Static budget [3,3,3]: A constant low skill budget throughout training limits early exploration, leading to unstable learning and lower peaks.
Random skill selection (without helpfulness ranking): Catastrophic collapse at -13.7%. Retaining unhelpful skills actively poisons the learning signal.
SKILL0’s adaptive [6,3,0]: The only setting that shows positive transfer (+1.6%) when skills are removed at inference. The model actually performs slightly better without skills than with them.

That last point is remarkable. It suggests that once internalized, the skill knowledge is accessed more reliably from parameters than from context.

A Different Kind of Context Efficiency

SKILL0 is not the only recent work tackling context efficiency in LLMs. Methods like TriAttention address the same memory bottleneck from a completely different angle, compressing the KV cache at the attention mechanism level by exploiting mathematical properties of positional embeddings.

The distinction matters: TriAttention and similar KV cache compression methods make the infrastructure more efficient (fitting more tokens into the same memory budget), while SKILL0 makes the agent more efficient (eliminating unnecessary tokens from the context in the first place). These are complementary strategies. An agent running SKILL0 on a model with TriAttention-style KV compression would benefit from both: fewer tokens going in and more efficient processing of those that remain.

Why This Matters for the AI Agent Ecosystem

The Broader Context

The agent skills ecosystem has been growing rapidly. Open standards like AgentSkills have emerged, skill libraries and retrieval pipelines are becoming more sophisticated, and frameworks like Strands and OpenClaw have adopted structured skills as first-class primitives.

But all of this assumes a retrieve-and-prompt paradigm. SKILL0 suggests a complementary path: once a skill has been used enough times across enough contexts, it can potentially be baked into the model itself.

Practical Implications

For agent framework designers: Consider instrumenting skill usage telemetry. Knowing which skills are used most frequently, in which contexts, and with what success rates would be exactly the data needed to identify internalization candidates.

For model providers: The SKILL0 methodology requires weight access for RL training. This is straightforward for open-weight models, but the implications for API-based models are interesting. Could a model provider offer “skill compilation” as a service, taking a customer’s most-used skill patterns and fine-tuning a model variant?

For production deployments: SKILL0’s token efficiency advantage is not just about cost. Lower context usage means more headroom for complex tasks, longer interaction histories, and parallel tool invocations. At 0.18k tokens per step versus 0.87k for SkillRL, you are looking at meaningfully different scaling economics.

The Hybrid Future

I think the most likely trajectory is a hybrid approach:

Internalize your stable, high-frequency skills: the ones used thousands of times across diverse contexts, where the procedural knowledge is well-established and rarely changes.
Keep runtime retrieval for the long tail: domain-specific skills that change frequently, newly created skills that have not been validated at scale, and niche capabilities that do not justify training compute.

This maps naturally to how organizations already think about optimization: the hot path gets compiled and cached; the cold path stays interpreted.

Open Questions

Several challenges remain before skill internalization becomes practical at scale:

Skill library quality: SKILL0’s performance depends on the initial skill bank. Poor skills lead to poor internalization. How do you quality-gate skills before investing training compute?
Skill drift: Enterprise workflows change. A skill internalized last month might be outdated today. What is the refresh cadence, and can incremental updates work, or do you need full retraining?
Scale: The paper evaluates on 6 task categories (ALFWorld) and 7 QA benchmarks. Real-world agent deployments might have hundreds of skills. Does the curriculum approach scale, and how does the validation overhead grow?
Proprietary models: Most production agents use proprietary models through APIs. The RL training loop requires weight access. Can the core ideas be adapted for distillation or other approaches that work with API-only models?

Conclusion

SKILL0 represents a meaningful conceptual shift in how we think about AI agent capabilities. The dominant paradigm (retrieve skills at runtime, inject them into context) works, but it treats the model as a perpetual novice that can never learn from its own experience. SKILL0 demonstrates that with the right training curriculum, models can genuinely internalize procedural knowledge, operating autonomously while being both more capable and more efficient.

The cooking analogy is apt: we are moving from agents that need the recipe every time to agents that have actually learned to cook.

📝 Note: This analysis is based on my reading of the original paper. Technical details, experimental setups, and results are sourced directly from the paper. For the full methodology and additional experiments, I recommend reading the original paper on arXiv. The code is available at github.com/ZJU-REAL/SkillZero.

What are your thoughts on skill internalization vs. runtime augmentation? I’d love to hear your perspective.

⚠️ Disclaimer: The views and opinions expressed in this article are my own and do not represent those of my employer. This content is for educational purposes only.