Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs

📣 What's New

[2026.05.26] Our NP reasoning task has been integrated into verl. Get started quickly with NP Task!
[2026.5.25] We release our training data in OliverLee/NP. 🎉🎉🎉
[2026.5.8] We updated NP-Engine by training it from a reasoning model and incorporating more RL algorithms. Check it out at 📃 arXiv: Forge!
[2026.4.6] The Forge (NP-Engine) has been accepted at ACL 2026! See you in San Diego! 🎉🎉🎉
[2025.10.14] We have released NP-Bench in OliverLeeXZ/NP-Engine. 🎉🎉🎉
[2025.10.13] We have released model checkpoint in OliverLee/Qwen2.5-7B-NP. 🎉🎉🎉
[2025.6.10] Our NP-Engine Paper is released! Check it at 📃Arxiv: NP-Engine ! Our Dataset will be open-sourced soon! 🎉🎉🎉

👨‍💻 Todo List

✅ Paper Release
✅ Checkpoint Release
✅ NP-Bench Release
✅ RLVR Training Code Release
✅ Training Data Release

🌟 Highlights

NP-ENGINE evaluates and trains reasoning models on 10 NP-hard optimization tasks across five categories, including subset selection and path planning. Its automated pipeline—composed of a Data Generator, Solution Validator, and Heuristic Solver—enables controllable data synthesis, rigorous evaluation, and scalable training. A case study on Hamiltonian Circuit demonstrates the model’s ability to find optimal solutions, while OOD evaluations show improved general reasoning capabilities beyond the training tasks.

We introduce NP-ENGINE, a scalable framework that generates near-infinite and hierarchically difficult NP-hard problems within the RLVR paradigm, empowering LLMs' optimization reasoning abilities. NP-ENGINE enables Qwen2.5-7B-Instruct to significantly outperform GPT-4o in optimization reasoning tasks using only 5K training examples.
We propose NP-BENCH, a benchmark consisting of 10 NP-level tasks spanning five categories: Graph Clustering, Resource Scheduling, Graph Partitioning, Subset Selection, and Path Planning. NP-BENCH provides instances with varying difficulty levels and evaluates both the feasibility and quality of solutions.
Through extensive experiments, we demonstrate that training on NP-ENGINE-DATA enables QWEN2.5-7B-NP to generalize to both reasoning and non-reasoning OOD tasks. We also observe a positive correlation between task diversity and cross-task generalization performance, offering new insights into the scaling behavior of RLVR-based training.

📚 Dataset Statistics

🏆 NP-Bench Leaderboard

Performance of reasoning LLMs, general LLMs, and our trained LLMs on \bench.

Model	Graph SR	Graph AR	Schedule SR	Schedule AR	Partition SR	Partition AR	Selection SR	Selection AR	Planning SR	Planning AR	Overall SR	Overall AR
Proprietary LLMs
DS-V3.1-Thinking	86.0	78.2	99.0	91.4	98.0	77.1	99.3	98.5	61.9	54.3	88.8	79.9
gpt-o3	97.0	86.4	99.0	94.5	100.0	51.4	87.4	87.3	74.0	65.1	91.5	76.9
Qwen3-235B-Thinking	66.7	62.9	95.0	93.0	100.0	55.8	98.0	97.1	52.0	44.8	82.3	70.7
gpt-4o-2024-08-06	64.7	29.3	79.0	59.8	100.0	53.0	14.7	9.3	52.0	29.6	62.1	36.2
Open-Source LLMs
Qwen3-32B	44.7	39.3	94.0	93.9	99.0	52.6	94.1	91.4	21.6	11.2	70.7	57.6
Qwen3-8B	22.7	16.8	78.0	75.3	98.0	51.0	86.0	82.6	3.0	1.2	57.5	45.4
DS-R1-Qwen-32B	23.3	18.1	49.0	45.1	96.0	48.6	85.7	79.7	15.4	7.9	53.9	39.9
DS-R1-Qwen-14B	18.0	13.4	52.0	51.7	32.0	16.2	67.3	63.4	4.5	1.4	34.8	29.2
Qwen2.5-72B	34.7	15.2	59.0	58.5	90.0	39.5	27.0	17.4	6.5	2.1	43.4	26.5
Qwen2.5-32B	35.3	15.2	15.0	12.8	100.0	51.7	32.0	22.4	23.5	6.7	41.2	21.8
Qwen2.5-14B	30.0	11.5	21.0	15.8	89.0	44.3	23.3	12.7	17.5	4.9	36.2	17.8
InternLM3-8b	15.0	3.6	20.0	9.5	86.0	43.3	41.3	23.7	16.0	4.1	35.7	16.8
LLama3.1-8B	23.0	8.0	9.0	7.8	0.0	0.0	28.7	11.4	15.0	1.8	15.1	5.8
Qwen2.5-3B	7.7	2.9	17.0	5.0	6.0	2.2	23.0	10.7	15.5	3.7	13.8	4.9
DS-R1-Qwen-7B	6.3	1.9	1.0	0.9	2.0	1.0	13.7	8.9	0.5	0.1	4.7	2.5
Qwen2.5-7B	11.0	3.1	40.0	19.8	67.0	34.0	26.7	15.2	3.5	1.0	29.6	14.6
Model	89.7	27.8	85.0	43.5	99.0	53.8	93.7	79.1	98.2	28.9	93.1	46.6
Delta	+78.7	+24.7	+45.0	+23.7	+32.0	+19.8	+67.0	+63.9	+94.7	+27.9	+63.5	+32.0

Setup

🖊️ Citation

If you find this work helpful, please consider to star🌟 this repo and cite this paper. Thanks for your support!

@misc{li2026forgequalityawarereinforcementlearning,
      title={Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs}, 
      author={Xiaozhe Li and Xinyu Fang and Shengyuan Ding and Yang Li and Linyang Li and Haodong Duan and Qingwen Liu and Kai Chen},
      year={2026},
      eprint={2605.08905},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2605.08905}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
NP_tasks		NP_tasks
backend		backend
data		data
images/webpages		images/webpages
scripts		scripts
static		static
.gitignore		.gitignore
README.md		README.md
evaluation.py		evaluation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs

📣 What's New

👨‍💻 Todo List

🌟 Highlights

📚 Dataset Statistics

🏆 NP-Bench Leaderboard

Performance of reasoning LLMs, general LLMs, and our trained LLMs on \bench.

Setup

🖊️ Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs

📣 What's New

👨‍💻 Todo List

🌟 Highlights

📚 Dataset Statistics

🏆 NP-Bench Leaderboard

Performance of reasoning LLMs, general LLMs, and our trained LLMs on \bench.

Setup

🖊️ Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages