Skip to content

A-EVO-Lab/.github

Repository files navigation

A-Evo Lab (Agentic Evolution Laboratory) 🧬

An AI researcher for every stage of building AI

The path to recursive self-improvement (RSI) is to let AI take over how humans build AI.

A-Evo Lab, led by Henry Lu, studies self-evolving agents under one thesis β€” AI-as-researcher: frontier agents and models play the researcher in the loop that builds better AI. Today humans build AI in three critical stages β€” pre-training β†’ post-training β†’ harness building. We are building an autonomous AI researcher for each, have reached SOTA results where we've shipped, and develop everything on one shared stack, A-Evolve, so we can iterate fast.


πŸ—Ί The Map

Human stage of building AI Our program What the AI researcher does Status
Harness building AI-Harness Evolves prompts / skills / memory / tools around a frozen model βœ… SOTA across benchmarks
↳ long-running deployment AI-Harness Β· Adaptive Sustains performance on open-ended task streams βœ… Leads every reported stream metric
Post-training AI-Training Designs data mixtures, schedules, HPs & ablations end-to-end πŸ”œ Human-team parity @ 30B β€” report in prep
Pre-training AI-Pretraining β€” 🧭 The open frontier

πŸ›  AI-Harness β€” replacing human harness engineering

With zero manual harness engineering, A-Evolve's reference algorithms push a single Claude Opus-4.6 base model to top-tier performance across diverse agentic benchmarks:

🟒 MCP-Atlas



πŸ₯‡ #1
Baseline β†’ 79.4% (+3.4pp)

πŸ”΅ SWE-bench Verified



~#5
Baseline β†’ 76.8% (+2.6pp)

🟣 Terminal-Bench 2.0



~#7
Baseline β†’ 76.5% (+13.0pp)

🟑 SkillsBench



#2
Baseline β†’ 34.9% (+15.2pp)

🟒 ARC-AGI



πŸ₯‡ #2 Community Leaderboard
Baseline β†’ 12.3% (+2.2pp)

πŸ”΅ OSWorld



β€”
Baseline β†’ 69.6% (+3.9pp)

🟣 SWE-bench Lite



Evolved
63.7 β†’ 67.0% (+3.3pp)

🟑 Ο„-bench



Evolved
72.7 β†’ 77.0% (+4.3pp)

🟒 CL-Bench



Evolved
29.5 β†’ 34.0% (+4.5pp)

πŸ”΅ WebArena-Infinity



Evolved
72.5 β†’ 76.3% (+3.8pp)

Single Claude Opus-4.6 base model, evolved with A-Evolve's reference algorithms. 0 hours of human harness engineering. CL-Bench, SWE-bench Lite, Ο„-bench & WebArena-Infinity show before β†’ after on the same base model. Data checked March 2026.

Key finding β€” evolver capability decouples from harness quality. A 9B model (Qwen3.5) writes harness updates as good as Claude Opus 4.6 (best-vs-worst evolver ≀ 3.1pp); benefit is non-monotonic β€” mid-tier agents gain most, weak agents fail to even load the harness. Implication: put your capability budget on the agent, not the evolver.

Evolver capability barely matters β€” a 9B model matches Opus 4.6

πŸ“„ Evolver-Solver-Bench β€” Harness Updating Is Not Harness Benefit. arXiv 2605.30621 Β· HF Daily πŸ“„ Evo-Harness β€” Context-to-Harness Skill Compilation (online evolution: feedback grounding, abstraction level, solver–evolver alignment). Releasing soon.

↳ Adaptive β€” sustaining agents on long-running streams

Naive self-evolving agents peak early and then decline β€” a single dense harness overfits to early evidence. Adaptive Auto-Harness fixes this with a stateful multi-agent evolver, a harness tree with solve-time routing, and scoped human-steering hooks β€” leading every reported metric against five auto-harness baselines plus the human-designed OctoTools:

Stream Domain A-Evolve-Adaptive Next best
PolyBench Prediction markets 80.9% Accuracy 50.8%
CTF-Dojo Security competitions 50.2% Pass 45.2%
FutureX Event forecasting 49.5% Pass 47.5%

Self-evolving agents peak early then decline; Adaptive sustains the gains

πŸ“„ Adaptive Auto-Harness β€” Sustained Self-Improvement on Open-Ended Task Streams. Releasing soon.


πŸ§ͺ AI-Training β€” replacing human post-training

The same loop, carried all the way into model weights: an evolver autonomously runs end-to-end 30B post-training β€” designing data mixtures, training schedules, hyperparameter regimes, and ablation protocols β€” reaching parity with a human post-training team. To our knowledge, the first time an autonomous system has done so at this scale.

Tech report in preparation β€” full results and methodology on release.


🧭 AI-Pretraining β€” the open frontier

The largest and most expensive stage of building AI β€” and the one we have not automated yet. It is where this thesis goes next.


βš™οΈ One Shared Stack: A-Evolve

Every result above was developed on A-Evolve, our open-source infrastructure for self-improving agents β€” "the PyTorch for Agentic AI." It evolves any agent, in any domain, with any evolution algorithm, and is what makes fast iteration across all three programs possible.

import agent_evolve as ae

evolver = ae.Evolver(agent="./my_agent", benchmark="swe-verified")
results = evolver.run(cycles=10)        # SOTA agent. 3 lines. 0 hours of manual harness engineering.

Adopted & integrated by: OpenRLHF Β· DeepSpeed Β· SGLang Β· GEPA Β· AutoResearch

⭐ Star the repo β†’ github.com/A-EVO-Lab/a-evolve

A-Evolve framework


πŸ“« Contact

Building in this direction, or want to collaborate? Reach out β€” X / Twitter Β· LinkedIn.


πŸ“’ News

  • 5/30 New Paper β€” Harness Updating Is Not Harness Benefit (arXiv 2605.30621). 7 evolver models Γ— 6 solver agents Γ— 3 benchmarks: counterintuitive answers on who produces good harness updates and who benefits.
  • 05/04 New Benchmark Results β€” A-Evolve results on ARC-AGI-3, evolving a multi-agent system from 10% β†’ 12%.
  • 04/20 New Algorithm β€” GEPA, submitted by the GEPA team.
  • 04/10 Integration β€” into Orch-Research Skills Library, alongside AutoResearch, OpenRLHF, DeepSpeed, SGLang.
  • 04/07 New Agent β€” transplanted our Terminal-Bench 2.0 harness onto ClawCode: 67.8% β†’ 72.9% (+5.1pp).
  • 04/03 New Algorithm β€” Meta-Harness.
  • 03/25 πŸš€ Open-sourced A-Evolve + 4 reference algorithms achieving SOTA (#1, ~#5, ~#7, #2) on MCP-Atlas, SWE-bench Verified, Terminal-Bench 2.0, SkillsBench.
  • 02/17 πŸ“„ Position paper: Agentic Evolution is the Path to Evolving LLMs (arXiv 2602.00359).

We are evolving fast β€” support our research by leaving a ⭐ on A-Evolve.

LinkedIn | Twitter/X

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors