Hermes Agent: The Self-Improving AI Agent That Learns From Its Own Mistakes | Dr. Melanie Li


Hermes and the Closed Learning Loop

📅 April 13, 2026
📖 12 min read
Self-Improving Agents
GEPA
Agent Memory
⚠️ Disclaimer: The views and opinions expressed in this article are my own and do not represent the views of my employer. This post reflects my personal experience and analysis as a practitioner working with AI agents.
Most AI agents forget everything between sessions. You tell them your preferences, walk them through a workflow, watch them solve a problem. Next time, they start from scratch. Hermes Agent, built by Nous Research, takes a fundamentally different approach: it treats every interaction as training data for becoming a better agent.

I spent time studying Hermes after it caught my attention on GitHub (57K+ stars and growing fast, MIT license). What makes it stand out isn’t any single feature. It’s how the pieces fit together into a closed loop: the agent works, reflects on what worked and what didn’t, extracts reusable skills from experience, and improves those skills over time. Most agent frameworks stop at “the agent can use tools.” Hermes asks: “what if the agent could also get better at using them?”

Here’s what I found.

The Closed Learning Loop

The core idea behind Hermes is what Nous Research calls a “closed learning loop.” After completing a complex task, Hermes doesn’t just move on. It looks back at the execution trace and asks: what did I just do that could be generalized?

The answer takes the form of a skill: a reusable procedure stored as a markdown file (following the AgentSkills open standard from agentskills.io). But unlike manually authored skills, these are extracted automatically from the agent’s own experience. If Hermes figures out a multi-step process for, say, setting up a CI pipeline with custom test stages, it can distill that into a skill that it (or other agents) can reuse later.

This is different from simply saving conversation history or stuffing facts into a vector database. A skill is procedural knowledge: steps, conditions, dependencies, expected inputs and outputs. It’s the difference between remembering that you once set up a CI pipeline and knowing exactly how to do it again.

GEPA: Evolving Prompts Through Reflection, Not Rewards

The most technically interesting part of Hermes is how it improves skills over time. The mechanism is called GEPA (Genetic-Pareto Prompt Evolution), published at ICLR 2026 (arxiv.org/abs/2507.19457).

The problem GEPA solves: once you have a skill (or any prompt-based procedure), how do you make it better without expensive RL training?

Traditional approaches use scalar reward signals. Run the prompt, get a score, update toward higher scores. This works, but it’s expensive (hundreds or thousands of rollouts), opaque (you know the score went up but not why), and brittle (optimizing for one metric often degrades others).

GEPA replaces scalar rewards with natural language reflection. After running a skill, instead of producing a number, the system produces a text critique: “The deployment step failed because the Dockerfile didn’t include the health check endpoint. The skill should add a health check configuration step before the build stage.”

This reflection becomes the mutation signal for the next generation. GEPA maintains a population of skill variants and evolves them using three mechanisms:

1. Genetic prompt evolution: Treat each skill version as an individual in a population. Create new variants by combining successful elements from different versions. Concretely, if Version A has a good error handling section and Version B has better parameter validation, the crossover operation can produce a Version C that combines both strengths. This is fundamentally different from simply editing a single prompt. You’re exploring a space of possible skill formulations in parallel.
2. Natural language reflection: Instead of “score: 0.73”, the system produces written analysis of what went wrong and what could improve. For example, after a failed deployment skill execution, the reflection might be: “Step 3 assumed the Docker image was already built, but the skill was invoked on a fresh environment. The skill needs an explicit image-build step before the push.” This gives the optimizer semantic information about how to change the prompt, not just whether to change it. The reflection also captures higher-order patterns. After seeing the same type of failure across multiple runs, it might note: “Skills that assume environment state tend to fail on first invocation. Always include a setup/validation phase.”
3. Pareto-based selection: Skills are evaluated on multiple criteria simultaneously (accuracy, token efficiency, reliability, generality). GEPA uses Pareto dominance to select candidates, avoiding the problem of optimizing one dimension at the expense of others. A skill that’s 95% accurate but uses 4K tokens isn’t strictly better than one that’s 90% accurate at 1K tokens. Pareto selection keeps both on the frontier, letting the system explore different trade-offs rather than collapsing toward a single solution.

The results from the ICLR paper: GEPA matches or beats RL-based prompt optimization while using 35x fewer rollouts. A typical optimization run costs $2-10 in API calls. No GPU needed. The key insight is that natural language carries far more information per evaluation than a scalar score. One good reflection can tell the optimizer exactly what to change and why, while a scalar reward just says “try something different.”

What this means practically: Hermes can take a rough, auto-extracted skill and refine it into something reliable through a few rounds of reflection and evolution. The skill gets better not because someone manually edited it, but because the system identified what was failing and evolved the instructions to address those failures. And because reflections are human-readable, you can inspect the optimization trace and understand why the skill changed, which is something black-box RL optimizers can’t offer.

Four-Layer Memory With Hard Limits

Hermes implements a layered memory system that solves a problem most agent developers have encountered: how do you keep useful context without letting memory files grow until they overflow the context window?

The four layers:

MEMORY.md: The curated long-term memory file. Hard-capped at around 800 tokens. This forces aggressive summarization: only the most important facts, preferences, and patterns survive. The agent periodically reviews and consolidates this file, removing stale entries and compressing redundant ones. The hard cap is deliberate. Without it, memory files tend to grow monotonically. Information gets added but rarely removed, and eventually the context window fills with facts about projects that ended months ago, preferences that were superseded, and decisions that are no longer relevant. By forcing a cap, Hermes makes memory management an active process rather than a passive one.

Session search (FTS5): Full-text search over past conversation transcripts. When the agent needs to recall something specific from weeks ago, it searches its own session history rather than trying to keep everything in the persistent memory file. This is backed by SQLite FTS5, giving fast keyword and phrase matching with LLM-powered summarization of search results. The key design choice here is separation of concerns: MEMORY.md holds the compressed essentials for every conversation, while FTS5 provides on-demand access to the full uncompressed history. This means the agent’s working context stays small, but it can still dig into specific past interactions when needed.

Dialectic user modeling (Honcho): Integration with Plastic Labs’ Honcho framework for building evolving models of who the user is. This goes beyond storing preferences. It models how the user thinks, what they care about, and how those things change over time. Honcho uses a dialectic approach: each new interaction either reinforces or challenges the current user model, and contradictions are resolved through synthesis rather than simple overwriting.

Skills as procedural memory: The skill library itself is a form of memory. When the agent extracts a skill from experience, it’s converting episodic memory (what happened in one session) into procedural memory (how to do something in general). This is the same transition that happens in human learning. You do something for the first time by thinking through each step. After enough repetitions, the procedure becomes automatic. Hermes makes this transition explicit and inspectable: you can read the skill file and see exactly what the agent “learned.”

The design philosophy here is explicit: memory should be curated, not accumulated. Most agent memory systems grow without bound and eventually hurt the agent’s performance by filling the context window with irrelevant history. Hermes takes the opposite approach. It actively prunes, summarizes, and compresses, keeping only what actually helps.

This is worth contrasting with the more common RAG-based memory approach, where past interactions are embedded into a vector store and retrieved by semantic similarity. RAG retrieval is useful for factual recall (“what was the API endpoint for that service?”), but it struggles with procedural knowledge (“what’s the correct sequence of steps to deploy that service?”). Hermes addresses both: semantic recall through FTS5 search, and procedural recall through skills.

Not Just a Chatbot: Multi-Platform, Multi-Backend

Hermes runs as a persistent process with a messaging gateway. It connects to Telegram, Discord, Slack, WhatsApp, Signal, and CLI, all from a single gateway process. You can talk to it from your phone while it runs tasks on a cloud VM.

The terminal backend system supports six different environments: local execution, Docker containers, SSH remote hosts, Daytona (serverless dev environments), Singularity (HPC containers), and Modal (serverless GPU). This means Hermes can run on a $5 VPS, wake on demand through serverless infrastructure, or scale up to GPU clusters for heavy computation.

It also has a built-in cron scheduler for unattended work. You can set up daily reports, scheduled data processing, or recurring checks, all described in natural language and delivered to whichever messaging platform you prefer.

What Caught My Attention (and What I Borrowed)

I studied Hermes as part of evaluating self-improvement patterns for production agent systems. Three aspects stood out as particularly worth learning from:

Auto skill extraction is underrated. Most skill systems need someone to manually write and maintain skills. Hermes shows that agents can extract their own skills from execution traces, and the quality is surprisingly good, especially after a round or two of GEPA optimization. This is a capability more agent frameworks should adopt.

Hard memory limits force quality. The 800-token cap on MEMORY.md seems restrictive until you see what it produces: only the truly important information survives. This is better than the alternative (an ever-growing memory file that eventually drowns the agent in stale context).

Reflection beats rewards for prompt improvement. The GEPA paper’s result (matching RL performance with 35x fewer rollouts using natural language reflection) has implications beyond Hermes. Any system that uses prompts (which is every LLM-based system) could benefit from this optimization approach.

After this research, I applied some of these patterns to agent systems I work with, including an auto skill extraction pipeline, a memory consolidation system with soft signal limits, and GEPA-style optimization for skill libraries. More on that below.

How Hermes Compares to OpenClaw

Hermes doesn’t exist in a vacuum. The other major open-source AI agent platform right now is OpenClaw, and since I use both, the comparison is worth laying out honestly.

The core difference is in what each project optimizes for. OpenClaw is built around orchestration: connecting an LLM to your messaging apps, browsers, shell, calendar, files, and workflows. It’s an operator. Hermes is built around learning: making the agent better at its job over time through memory management and skill evolution. It’s a practitioner.

Here’s how they stack up across the dimensions that matter:

Channels and integrations. OpenClaw supports 22+ messaging platforms and has deep integrations with browsers, cameras, phones, and system-level automation. Hermes supports 14 platforms through its messaging gateway. If your use case involves routing work across many surfaces (Slack, WhatsApp, email, browser automation), OpenClaw has a clear edge.

Skills ecosystem. OpenClaw’s ClawHub has thousands of community-contributed skills, all manually authored and maintained. Hermes takes the opposite approach: skills are auto-extracted from the agent’s execution traces and refined through GEPA. OpenClaw’s approach gives you breadth and reliability (someone tested the skill before publishing it). Hermes’s approach gives you personalization (the skills are tuned to what you actually do). The ideal is probably both: a curated community library plus automatic extraction for your unique workflows.

Memory. This is where Hermes pulls ahead. Its four-layer memory system with hard token limits produces genuinely better long-term context than OpenClaw’s unbounded workspace file approach. OpenClaw relies on the user (or the agent itself) to manually maintain memory files, which works well in practice but requires discipline. Hermes makes memory management a first-class system concern rather than leaving it to convention.

Self-improvement. Hermes has this built in: GEPA optimization, automatic skill extraction, and periodic memory consolidation. OpenClaw doesn’t have a native self-improvement loop, though its skill system supports manual iteration and you can build improvement workflows on top of its cron and automation primitives.

Execution environments. Hermes supports six terminal backends (local, Docker, SSH, Daytona, Singularity, Modal), letting you run workloads on anything from a $5 VPS to a GPU cluster. OpenClaw runs locally or on a server, with less flexibility for remote execution environments.

Model support. Both are model-agnostic. Hermes leans into open-weight models through Nous Research’s own model lineup plus OpenRouter’s 200+ models. OpenClaw works with any provider through its converse-stream interface.

The community consensus, based on recent comparison articles from Hongkiat, Medium, and several technical deep-dives, is that these aren’t competing projects. They’re complementary. A common setup among power users: OpenClaw for daily operations and multi-channel automation, Hermes for personal learning and memory-intensive work.

Building Specialist Agents That Learn: Where Agent Greenhouse Fits In

Studying Hermes’s architecture raised a question I’ve been working on in a different context: what does it look like to build domain-specific agents that have both Hermes-style learning capabilities and production-grade infrastructure?

This is the problem that Agent Greenhouse (github.com/aws-samples/sample-agent-greenhouse) addresses. It’s an open-source framework I’ve been working on, built on Amazon Bedrock that takes an opinionated approach to specialist agent construction.

The core idea is a separation between a Foundation Agent (shared infrastructure: memory pipelines, hook middleware, guardrails, observability, deployment) and Domain Harnesses (pure-data configurations that define what makes each specialist agent unique: its skills, policies, persona, and memory layout). You write a Domain Harness as a frozen dataclass. The Foundation Agent reads it and assembles everything at construction time. No boilerplate duplication across agents.

Where this connects to Hermes: after studying Hermes’s auto skill extraction and GEPA optimization, I integrated similar patterns into the Greenhouse pipeline. The Foundation Agent now supports AgentSkills (the same open standard Hermes uses, from agentskills.io) with progressive loading, and we added a skill extraction step to the harness workflow that can generate SKILL.md files from execution traces. The memory system uses a three-layer approach (session history, STM-to-LTM pipeline via Amazon Bedrock AgentCore Memory, and workspace files) that draws from the same design principles as Hermes’s layered memory.

The key architectural difference: Greenhouse is designed for multi-tenant environments where you need to grow multiple specialist agents that share infrastructure but have distinct expertise. Hermes is a single-agent system optimized for one user. Both are valid patterns for different scales of deployment.

If you’re building production agents on AWS, Agent Greenhouse gives you the infrastructure layer (hooks, memory, guardrails, deployment to AgentCore Runtime) so you can focus on domain expertise rather than plumbing. If you’re building a personal AI assistant that should get smarter over time, Hermes is hard to beat.

The Bigger Picture: Agents That Get Better Over Time

The most compelling thing about Hermes isn’t any single technical choice. It’s the thesis: an AI agent should be a system that improves with use. Not just through model updates or manual prompt engineering, but through its own experience of succeeding and failing at real tasks.

Most current agent frameworks treat the agent as a static system. You configure it, deploy it, and it performs at roughly the same level forever (minus model drift). Hermes treats the agent as something more like a practitioner who develops expertise: each task teaches it something, and over time the accumulated skills and refined memory make it meaningfully more capable at the types of work it actually does.

The self-improvement direction is clearly where the field is heading. Whether through Hermes’s GEPA-based skill evolution, Agent Greenhouse’s harness-driven skill extraction on AWS, or some future approach we haven’t seen yet, agents that learn from their own work will outperform agents that don’t. If your agent does the same type of task 50 times and isn’t better at it by the 50th time, something is missing.

Resources:


📝 Note: This article represents my personal views and analysis. It is not affiliated with or endorsed by my employer.


Related Posts

💬 Comments


Comments are reviewed before appearing
No comments yet. Be the first to share your thoughts!