Skip to content

Add ArxivRollBench tasks#1245

Open
liangzid wants to merge 1 commit into
huggingface:mainfrom
liangzid:add-arxivrollbench-tasks
Open

Add ArxivRollBench tasks#1245
liangzid wants to merge 1 commit into
huggingface:mainfrom
liangzid:add-arxivrollbench-tasks

Conversation

@liangzid
Copy link
Copy Markdown

Summary

  • Adds built-in ArxivRollBench task configs for the 2024b, 2025a, and 2026a benchmark releases.
  • Uses the compact 50-sample split as the default task family and exposes full-split variants with the _full suffix.
  • Covers eight arXiv domains: cs, q_fin, math, physics, stat, q_bio, econ, and eess.
  • Covers the three SCP task types: sequencing (s), cloze (c), and prediction (p).
  • Adds a small custom generative exact-match metric that extracts Selection 1 through Selection 4 for sequencing/cloze tasks and A through D for prediction tasks.

What is ArxivRollBench?

ArxivRollBench is a dynamic private benchmark generated by ArxivRoll, a one-time-pad-inspired framework for constructing fresh LLM evaluation items from recent arXiv papers. The benchmark is designed to study benchmark overestimation: the gap between model performance on public benchmarks and performance on newly generated private items that are less likely to have appeared in training data or benchmark-aware tuning.

The benchmark periodically rolls forward with new arXiv papers and evaluates models through three SCP construction strategies:

  • sequencing: recover the correct order of shuffled text fragments
  • cloze: select the correct ordering around masked or removed sentences
  • prediction: choose the most plausible next sentence from candidate continuations

Project links:

Task names

Default compact 50-sample tasks:

  • aggregate alias: arxivrollbench
  • release alias: arxivrollbench2026a
  • leaf task: arxivrollbench2026a:cs:s

Full-split tasks:

  • aggregate alias: arxivrollbench_full
  • release alias: arxivrollbench2026a_full
  • leaf task: arxivrollbench2026a_full:cs:s

The implementation also exposes arxivrollbench:<release>:<domain>:<task_type> and arxivrollbench_full:<release>:<domain>:<task_type> aliases so users can run all releases through LightEval superset expansion.

Validation

  • python -m compileall -q src/lighteval/tasks/tasks/arxivrollbench.py
  • python -m ruff check src/lighteval/tasks/tasks/arxivrollbench.py
  • python -m ruff format --check src/lighteval/tasks/tasks/arxivrollbench.py
  • git diff --check
  • Registry check: loaded 288 ArxivRollBench task aliases.
  • Split check: arxivrollbench2026a:cs:s maps to liangzid/robench2026a_test_all_category_setcsSCP-s-50; arxivrollbench2026a_full:cs:s maps to liangzid/robench2026a_test_all_category_setcsSCP-s.
  • Metric unit check: Selection 1 scores 1.0 on a gold Selection 1; Selection 2 scores 0.0.
  • Smoke evaluation: python -m lighteval accelerate "model_name=hf-internal-testing/tiny-random-gpt2,dtype=float32,device=cpu" "arxivrollbench2026a:cs:s|0" --max-samples 2 --output-dir /tmp/arxivrollbench_lighteval_smoke completed successfully.

@liangzid liangzid marked this pull request as ready for review May 24, 2026 05:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant