Add ArxivRollBench tasks by liangzid · Pull Request #1245 · huggingface/lighteval

liangzid · 2026-05-24T05:34:54Z

Summary

Adds built-in ArxivRollBench task configs for the 2024b, 2025a, and 2026a benchmark releases.
Uses the compact 50-sample split as the default task family and exposes full-split variants with the _full suffix.
Covers eight arXiv domains: cs, q_fin, math, physics, stat, q_bio, econ, and eess.
Covers the three SCP task types: sequencing (s), cloze (c), and prediction (p).
Adds a small custom generative exact-match metric that extracts Selection 1 through Selection 4 for sequencing/cloze tasks and A through D for prediction tasks.

What is ArxivRollBench?

ArxivRollBench is a dynamic private benchmark generated by ArxivRoll, a one-time-pad-inspired framework for constructing fresh LLM evaluation items from recent arXiv papers. The benchmark is designed to study benchmark overestimation: the gap between model performance on public benchmarks and performance on newly generated private items that are less likely to have appeared in training data or benchmark-aware tuning.

The benchmark periodically rolls forward with new arXiv papers and evaluates models through three SCP construction strategies:

sequencing: recover the correct order of shuffled text fragments
cloze: select the correct ordering around masked or removed sentences
prediction: choose the most plausible next sentence from candidate continuations

Project links:

Paper: https://doi.org/10.1609/aaai.v40i44.41098
Leaderboard: https://arxivroll.moreoverai.com
Project code: https://github.com/liangzid/ArxivRoll/

Task names

Default compact 50-sample tasks:

aggregate alias: arxivrollbench
release alias: arxivrollbench2026a
leaf task: arxivrollbench2026a:cs:s

Full-split tasks:

aggregate alias: arxivrollbench_full
release alias: arxivrollbench2026a_full
leaf task: arxivrollbench2026a_full:cs:s

The implementation also exposes arxivrollbench:<release>:<domain>:<task_type> and arxivrollbench_full:<release>:<domain>:<task_type> aliases so users can run all releases through LightEval superset expansion.

Validation

python -m compileall -q src/lighteval/tasks/tasks/arxivrollbench.py
python -m ruff check src/lighteval/tasks/tasks/arxivrollbench.py
python -m ruff format --check src/lighteval/tasks/tasks/arxivrollbench.py
git diff --check
Registry check: loaded 288 ArxivRollBench task aliases.
Split check: arxivrollbench2026a:cs:s maps to liangzid/robench2026a_test_all_category_setcsSCP-s-50; arxivrollbench2026a_full:cs:s maps to liangzid/robench2026a_test_all_category_setcsSCP-s.
Metric unit check: Selection 1 scores 1.0 on a gold Selection 1; Selection 2 scores 0.0.
Smoke evaluation: python -m lighteval accelerate "model_name=hf-internal-testing/tiny-random-gpt2,dtype=float32,device=cpu" "arxivrollbench2026a:cs:s|0" --max-samples 2 --output-dir /tmp/arxivrollbench_lighteval_smoke completed successfully.

Add ArxivRollBench tasks

91530dc

liangzid marked this pull request as ready for review May 24, 2026 05:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ArxivRollBench tasks#1245

Add ArxivRollBench tasks#1245
liangzid wants to merge 1 commit into
huggingface:mainfrom
liangzid:add-arxivrollbench-tasks

liangzid commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

liangzid commented May 24, 2026

Summary

What is ArxivRollBench?

Task names

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant