Recursive Self-Evolving Agents via Held-Out Selection

Summary


A new paper from researchers Michael Nguyen, Quoc Nguyen, and Paul Vuong introduces RSEA (Recursive Self-Evolving Agent), a method that allows large language model (LLM) agents to improve themselves without weight updates by evolving a compact, three-layer natural-language state: an imperative strategy, reusable skills, and a procedural playbook. RSEA rewrites all three layers from its own trajectories across generations, and commits a candidate only if it does not regress on a disjoint held-out split—a strict keep-better gate. The work provides a systematic, apples-to-apples comparison across four diverse benchmarks and six faithful baselines, all evaluated on one shared local backbone.


Key Results


  • No single artifact universally wins. RSEA is the strongest single-pass method on ALFWorld, reaching 69.3% (vs. 64.6% for ReAct, McNemar p=0.015), and 79.4% with retry (best overall). However, concrete-workflow induction (AWM) is best on strong-backbone tool-use tasks.
  • Unguarded context evolution is high-variance and unsafe. Dynamic Cheatsheet, which curates context online without a held-out gate, is near-best on ALFWorld (70.7%), yet collapses on WebShop (score 0.14 vs. 0.43 for ReAct).
  • RSEA’s strict held-out selection makes recursive self-evolution monotone-safe: it never significantly underperforms the base agent on any benchmark and falls back to vanilla ReAct when evolved context would hurt.

2026 Context


As of early 2026, the AI community is increasingly focused on agentic LLMs that can adapt to specific tasks without expensive fine-tuning. The RSEA framework addresses a critical gap: how to safely and reliably evolve an agent’s context (prompts, strategies, skills) without catastrophic regression. With the rise of agentic workflows in production, methods like RSEA that offer monotone-safe improvement are especially valuable.


Implications


RSEA provides a principled approach to recursive self-evolution that ensures safety and monotonic improvement. The held-out selection mechanism could become a standard safeguard for any system that modifies its own context online. The paper’s broad evaluation—spanning benchmarks like ALFWorld, GAIA, τ-bench, and WebShop—demonstrates both the strengths and limitations of current context-evolution techniques, and suggests that future work may need to combine multiple artifact types (e.g., strategies + workflows) to achieve universal gains.

via ArXiv AI

Related