Graph of Skills and PageRank for Agent Skill Retrieval
This is not a hypothetical problem. Agent skill libraries are growing quickly. Platforms like ClawHub already host hundreds of reusable skill packages, and academic projects like SkillNet are building repository-scale collections. As the ecosystem matures, a single agent might need access to thousands of specialized skills covering everything from data preprocessing to cloud deployment to financial modeling.
So the question becomes: when a user asks the agent to do something, how does it find the right set of skills?
A new paper called “Graph of Skills” (GoS) proposes an answer that caught my attention, partly because it uses a class of algorithms I spent years working with during my PhD research on trust propagation in social networks. Let me walk through what GoS does, why it matters, and what the results look like.
The Problem: Two Bad Options
Today, there are two common approaches to handling large skill libraries.
Option A: Load everything. Prepend the entire skill set to the context window. This maximizes recall, since the right skill is always in there somewhere. But the costs are real. Token usage grows linearly with library size. At 2,000 skills, you might be pushing 5–6 million input tokens per task. The model drowns in irrelevant instructions, and critical constraints get buried. Research has shown that language models struggle to use information effectively when it sits in the middle of very long contexts.
Option B: Vector retrieval. Use embeddings to find the most semantically similar skills. This is more efficient because you only pull in the top-k matches. But it has a blind spot that turns out to be critical.
Consider this scenario: a user asks “deploy my model to SageMaker.” Semantic search finds the deployment skill easily. But that deployment skill depends on “build a Docker image,” which depends on “write a Dockerfile,” which depends on “configure an ECR registry.” None of these prerequisites look anything like “deploy to SageMaker” in embedding space. They get missed.
The Prerequisite Gap: The disconnect between what is semantically similar and what is functionally necessary. In the paper’s experiments, pure vector retrieval actually performed worse than loading everything, because retrieving an incomplete skill set left the agent unable to execute the full task pipeline.
The Solution: Build a Graph, Walk It Backward
GoS takes a different approach entirely. Instead of treating skills as isolated documents to be searched, it treats them as nodes in a dependency graph.
The system works in two phases.
Phase 1: Offline Graph Construction
Each skill package gets parsed into a standardized node containing its name, capabilities, inputs, outputs, domain tags, scripts, and entry points. Then GoS builds a directed graph with four types of edges:
- Dependency edges (
dep): Skill A produces something that Skill B needs. These are the most important edges. If A depends on B and you don’t retrieve B, A probably won’t work. - Workflow edges (
wf): Skills that commonly execute in sequence. - Semantic edges (
sem): Skills that are topically related. - Alternative edges (
alt): Skills that can substitute for each other.
The key insight is that dependency edges carry the most information about what you actually need to execute a task, but they’re the ones that semantic search misses most often.
Phase 2: Online Retrieval
When a task query arrives, GoS runs three steps:
Step 1: Hybrid Seeding. Find initial “seed” skills using both semantic search (embeddings) and lexical search (BM25 keyword matching). Combining both signals gives better entry points than either alone. The paper’s ablation study shows that removing the lexical component drops performance by 7.7 points.
Step 2: Reverse-Weighted Personalized PageRank. This is the core mechanism. Starting from the seed skills, GoS propagates importance scores through the graph, but with a twist. It walks backward along dependency edges.
Why backward? Because dependencies point from “high-level task” to “prerequisite.” If A depends on B, the edge goes A→B. But when you’ve matched A as a seed, you need to discover B. So you reverse the direction: from A, walk backward to find everything A needs.
The algorithm works like this: imagine a random walker starting at your seed skills. At each step, it either follows an edge to a neighbor (with 85% probability) or teleports back to the seeds (with 15% probability). After enough steps, the frequency of visiting each node converges to a stable score. Nodes that are tightly connected to the seeds through dependency chains get high scores, even if they’re semantically distant from the original query.
Different edge types get different weights. Dependency edges get the highest weight because missing a dependency is the most likely cause of task failure. Semantic edges get lower weight. This means the walker preferentially follows dependency paths, spending the retrieval budget on things the agent actually needs rather than things that merely look related.
Under the Hood: The Math
If you want to understand what’s actually being computed, here’s the core equation:
s(l+1) = α * p + (1-α) * Tᵀ * s(l)
Let’s break each piece down:
- p is the seed vector. If your hybrid seeding found skills A and B as relevant, p has non-zero values at those positions and zeros everywhere else. It represents “where do I start looking?”
- α (alpha) is the teleport probability, typically around 0.15. It controls how much the algorithm stays anchored to the seeds vs. exploring outward. Higher alpha = more focused on seeds. Lower alpha = more willing to walk far along dependency chains.
- T is the unified transition operator. This is where the interesting engineering happens. GoS builds T by combining forward and reverse adjacency matrices for all four edge types:
T = RowNorm( λ_dep * (T_dep_forward + γ_dep * T_dep_reverse)
+ λ_wf * (T_wf_forward + γ_wf * T_wf_reverse)
+ λ_sem * (T_sem_forward + γ_sem * T_sem_reverse)
+ λ_alt * (T_alt_forward + γ_alt * T_alt_reverse) )
The λ weights control which edge types matter most (λ_dep is highest). The γ values control how strongly the algorithm walks backward along each edge type (γ_dep is set high so that reverse dependency traversal is aggressive). RowNorm makes each row sum to 1 so we get valid transition probabilities.
- Tᵀ is the transpose of T. Why transpose? Because the PPR update
Tᵀ * scomputes, for each node, the total score flowing into it from all neighbors. Without the transpose, you’d be computing what flows out of each node, which is not what we want. - s(l) is the score vector at iteration l. Start at s(0) = p, then iterate.
A concrete example with 4 skills makes this tangible. Say the seed is skill A (“Deploy to SageMaker”), with dependencies A→B (“Build Docker Image”), B→C (“Write Dockerfile”), and a weak semantic link A→D (“Configure CloudWatch”).
Iteration 0: scores = [A:1.0, B:0, C:0, D:0 ] Iteration 1: scores ≈ [A:0.15, B:0.68, C:0, D:0.17] Iteration 2: scores ≈ [A:0.15, B:0.10, C:0.58, D:0.03]
After just two iterations, C (“Write Dockerfile”) has the second-highest score despite being semantically irrelevant to “Deploy to SageMaker.” That’s the whole point: the graph discovered a critical prerequisite that embedding similarity would have completely missed.
Step 3: Budgeted Hydration. Once the scores converge, GoS doesn’t just take the top-k by graph score. It combines the converged diffusion score with direct field-level evidence:
ρ(q) = s* + μ * m(q)
Here s* is the converged PPR score and m(q) aggregates how well the skill’s name, capability summary, artifacts, and entry points match the original query. The μ parameter controls how much to preserve local grounding after graph expansion. Skills are then hydrated (materialized into agent-consumable format) in descending order of ρ until you hit the token budget.
Why This Matters: The Numbers
The paper evaluates GoS on two benchmarks across three model families.
SkillsBench contains diverse real-world technical tasks (macroeconomic analysis, power-grid feasibility, 3D scan processing, financial modeling) with curated skill libraries. ALFWorld is an interactive household environment where agents navigate rooms and manipulate objects.
Here are the headline results:
| Model | Method | Reward (SkillsBench) | Reward (ALFWorld) | Token Reduction |
|---|---|---|---|---|
| Claude Sonnet 4.5 | Vanilla Skills | 25.0% | 89.3% | baseline |
| Claude Sonnet 4.5 | Vector Skills | 19.3% | 93.6% | ~7% |
| Claude Sonnet 4.5 | GoS | 31.0% | 97.9% | ~11% |
| GPT-5.2 Codex | Vanilla Skills | 27.4% | 89.3% | baseline |
| GPT-5.2 Codex | Vector Skills | 21.5% | 92.9% | ~61% |
| GPT-5.2 Codex | GoS | 34.4% | 93.6% | ~57% |
A few things stand out:
Vector retrieval can actually hurt performance. On SkillsBench, Vector Skills scored lower than loading everything, across all three models. Under Claude Sonnet, it dropped from 25.0 to 19.3. The reason: SkillsBench tasks are long-horizon and require combining multiple skills, including prerequisites. Retrieving only the semantically closest skills gives you an incomplete toolkit.
GoS beats both baselines everywhere. It improves over full loading (better performance with fewer tokens) and over vector retrieval (better performance because it captures dependencies). The +43.6% average reward improvement over Vanilla Skills, combined with 37.8% token reduction, is a strong result.
The advantage grows with library size. In the ablation study scaling from 200 to 2,000 skills, GoS maintained its reward advantage at every scale. At 200 skills, Vanilla Skills slightly edged it out (32.5 vs 32.1). By 500+ skills, GoS consistently outperformed. Meanwhile, Vanilla Skills token usage jumped from 1.85M to 5.84M tokens, while GoS stayed near 1.1–1.4M.
What About Latency?
A natural question: doesn’t adding a graph lookup step slow things down?
The short answer: the net effect is probably faster, not slower.
The graph walk itself runs in milliseconds. It’s a few iterations of matrix multiplication on a precomputed graph. The expensive step in any agent pipeline is the LLM inference, and that gets faster when you send fewer tokens. With 37.8% fewer input tokens, the model’s time-to-first-token and overall generation time both improve.
There’s also a second-order benefit: when the agent gets a complete skill bundle on the first try, it doesn’t need multi-round retries to discover missing prerequisites. Each retry means another full LLM call, so avoiding them saves real time.
The paper’s data confirms this. Under Claude Sonnet and MiniMax, GoS actually had lower agent runtime than Vanilla Skills. Only GPT-5.2 Codex showed slightly higher runtime for GoS, likely due to model-specific caching optimizations for fixed-prompt patterns.
The Bigger Picture: From Flat Search to Structural Reasoning
What I find most interesting about this work isn’t the specific numbers. It’s the pattern.
During my PhD, I worked on a problem that sounds unrelated at first: how do you evaluate whether users and tweets on Twitter are trustworthy? The intuition was that you can’t judge a tweet in isolation, and you can’t judge a user in isolation either. They’re connected. A trustworthy user tends to post trustworthy content, and trustworthy content tends to come from trustworthy users. So rather than building a single network, I developed CoRank, a Coupled Dual Networks Trust Ranking method that models users and tweets as two separate but interconnected graphs and lets trust scores propagate between them. The approach outperformed standard PageRank, TURank, and Weighted PageRank on real Twitter data (published in World Wide Web journal, 2020).
The core math was the same family as what GoS uses: define a transition operator on the graph, pick your seed nodes, and iterate until convergence. The difference was that CoRank propagated trust scores across two coupled networks (users and tweets), while GoS propagates importance scores across a single skill dependency graph. But the underlying principle is identical: relationships between nodes carry information that looking at individual nodes cannot capture.
GoS applies this insight to a completely different domain. You can’t determine whether a skill is the right one by looking at it in isolation. You need to understand what it depends on, what depends on it, and how execution flows through the skill graph.
The underlying principle: When you need to find things that work together, not just things that look alike, flat retrieval is not enough. You need structure.
This matters beyond skill retrieval. The same kind of structural reasoning could apply to tool selection, API composition, workflow planning—really any setting where the right answer is not a single item but a group of items that function as a unit.
What’s Next
The paper acknowledges some limitations. The graph quality depends on how well-documented the skill packages are. Poorly described inputs and outputs mean weak dependency edges, which means the graph walk is less effective. The graph is also static and doesn’t yet learn from execution traces or user feedback.
Both of these seem solvable. Skill documentation quality is improving as standards like the Agent Skills spec mature. And incorporating execution feedback to refine edge weights is a natural extension that the authors flag as future work.
For practitioners building agent systems with growing skill libraries, the takeaway is clear: as your skill count grows past what fits comfortably in a context window, you need a retrieval strategy that understands dependencies, not just semantics. GoS shows one principled way to do that.
Paper: “Dependency-Aware Structural Retrieval for Massive Agent Skills” by Dawei Liu, Zongxia Li et al.
Code: github.com/davidliuk/graph-of-skills
Related Posts