The path to recursive self-improvement (RSI) is to let AI take over how humans build AI.
A-Evo Lab, led by Henry Lu, studies self-evolving agents under one thesis β AI-as-researcher: frontier agents and models play the researcher in the loop that builds better AI. Today humans build AI in three critical stages β pre-training β post-training β harness building. We are building an autonomous AI researcher for each, have reached SOTA results where we've shipped, and develop everything on one shared stack, A-Evolve, so we can iterate fast.
| Human stage of building AI | Our program | What the AI researcher does | Status |
|---|---|---|---|
| Harness building | AI-Harness | Evolves prompts / skills / memory / tools around a frozen model | β SOTA across benchmarks |
| β³ long-running deployment | AI-Harness Β· Adaptive | Sustains performance on open-ended task streams | β Leads every reported stream metric |
| Post-training | AI-Training | Designs data mixtures, schedules, HPs & ablations end-to-end | π Human-team parity @ 30B β report in prep |
| Pre-training | AI-Pretraining | β | π§ The open frontier |
With zero manual harness engineering, A-Evolve's reference algorithms push a single Claude Opus-4.6 base model to top-tier performance across diverse agentic benchmarks:
Single Claude Opus-4.6 base model, evolved with A-Evolve's reference algorithms. 0 hours of human harness engineering. CL-Bench, SWE-bench Lite, Ο-bench & WebArena-Infinity show before β after on the same base model. Data checked March 2026.
Key finding β evolver capability decouples from harness quality. A 9B model (Qwen3.5) writes harness updates as good as Claude Opus 4.6 (best-vs-worst evolver β€ 3.1pp); benefit is non-monotonic β mid-tier agents gain most, weak agents fail to even load the harness. Implication: put your capability budget on the agent, not the evolver.
π Evolver-Solver-Bench β Harness Updating Is Not Harness Benefit. arXiv 2605.30621 Β· HF Daily π Evo-Harness β Context-to-Harness Skill Compilation (online evolution: feedback grounding, abstraction level, solverβevolver alignment). Releasing soon.
Naive self-evolving agents peak early and then decline β a single dense harness overfits to early evidence. Adaptive Auto-Harness fixes this with a stateful multi-agent evolver, a harness tree with solve-time routing, and scoped human-steering hooks β leading every reported metric against five auto-harness baselines plus the human-designed OctoTools:
| Stream | Domain | A-Evolve-Adaptive | Next best |
|---|---|---|---|
| PolyBench | Prediction markets | 80.9% Accuracy | 50.8% |
| CTF-Dojo | Security competitions | 50.2% Pass | 45.2% |
| FutureX | Event forecasting | 49.5% Pass | 47.5% |
π Adaptive Auto-Harness β Sustained Self-Improvement on Open-Ended Task Streams. Releasing soon.
The same loop, carried all the way into model weights: an evolver autonomously runs end-to-end 30B post-training β designing data mixtures, training schedules, hyperparameter regimes, and ablation protocols β reaching parity with a human post-training team. To our knowledge, the first time an autonomous system has done so at this scale.
Tech report in preparation β full results and methodology on release.
The largest and most expensive stage of building AI β and the one we have not automated yet. It is where this thesis goes next.
Every result above was developed on A-Evolve, our open-source infrastructure for self-improving agents β "the PyTorch for Agentic AI." It evolves any agent, in any domain, with any evolution algorithm, and is what makes fast iteration across all three programs possible.
import agent_evolve as ae
evolver = ae.Evolver(agent="./my_agent", benchmark="swe-verified")
results = evolver.run(cycles=10) # SOTA agent. 3 lines. 0 hours of manual harness engineering.Adopted & integrated by: OpenRLHF Β· DeepSpeed Β· SGLang Β· GEPA Β· AutoResearch
β Star the repo β github.com/A-EVO-Lab/a-evolve
Building in this direction, or want to collaborate? Reach out β X / Twitter Β· LinkedIn.
- 5/30 New Paper β Harness Updating Is Not Harness Benefit (arXiv 2605.30621). 7 evolver models Γ 6 solver agents Γ 3 benchmarks: counterintuitive answers on who produces good harness updates and who benefits.
- 05/04 New Benchmark Results β A-Evolve results on ARC-AGI-3, evolving a multi-agent system from 10% β 12%.
- 04/20 New Algorithm β GEPA, submitted by the GEPA team.
- 04/10 Integration β into Orch-Research Skills Library, alongside AutoResearch, OpenRLHF, DeepSpeed, SGLang.
- 04/07 New Agent β transplanted our Terminal-Bench 2.0 harness onto ClawCode: 67.8% β 72.9% (+5.1pp).
- 04/03 New Algorithm β Meta-Harness.
- 03/25 π Open-sourced A-Evolve + 4 reference algorithms achieving SOTA (#1, ~#5, ~#7, #2) on MCP-Atlas, SWE-bench Verified, Terminal-Bench 2.0, SkillsBench.
- 02/17 π Position paper: Agentic Evolution is the Path to Evolving LLMs (arXiv 2602.00359).
We are evolving fast β support our research by leaving a β on A-Evolve.



