ReproNLP 2026: A Third Replication of the Human Evaluation of a QAG System for Children's Storybooks
This repository contains all data, code, and supplementary materials for the paper:
ReproNLP 2026: A Third Replication of the Human Evaluation of a QAG System for Children's Storybooks Marcel Mroczek, Chiara Albarello, Paul-Emmanuel Floch, Maciej Gawinecki Proceedings of the 5th Workshop on Generation, Evaluation and Metrics (GEM 2026), July 2026, San Diego, CA, USA
This paper presents a third independent replication of the human evaluation originally conducted by Yao et al. (2022), which assessed an automated Question-Answer Generation (QAG) system for children's storybooks. The QAG system was evaluated against a baseline and human-authored ground truth across three criteria — Readability, Question Relevance, and Answer Relevance — using five Natural Language Processing (NLP)-literate annotators. Our replication confirms the main findings of the original study: the QAG system outperforms the baseline on Readability and Question Relevance, and the Ground Truth (GT) ranks highest across all criteria. System rankings are preserved across all three criteria, with the exception of a statistically non-significant difference in Answer Relevance. We additionally document several methodological concerns, including data quality issues and evaluation design limitations identified during our pilot study, some of which were unreported in prior replications.
This work is a submission to the ReproNLP 2026 shared task, part of the ongoing ReproHum research programme.
.
├── annotations/ # Raw annotation data from all four studies
├── images/ # Figures generated for the paper and poster
├── qra/ # Input/output files for the QRA3 reproducibility tool
├── datacard.json # Human Evaluation Datasheet (HEDS) for this study
├── replication_study.ipynb # Main analysis notebook
└── requirements.txt # Python dependencies
The annotations/ folder contains item-level human annotations from all four studies analysed in this paper. Each CSV file provides one row per annotated question-answer (QA) pair, with columns for labeller_id, story_name, qa_id, section_id, System (one of QAG, PAQ, GT), and the three quality criteria scores.
| File | Description |
|---|---|
mroczek2026.csv |
Annotations collected for this replication study |
yao2022.csv |
Annotations from the original study by Yao et al. (2022) |
florescu2024.csv |
Annotations from the first replication by Florescu et al. (2024) |
braun2025.csv |
Annotations from the second replication by Braun (2025) |
Note: The notebook downloads the
yao2022,florescu2024, andbraun2025annotation files automatically from their original public repositories on demand if they are not already present locally.
The evaluation was conducted on a subset of the FairytaleQA dataset (Xu et al., 2022), which comprises approximately 10,000 expert-annotated QA pairs from 278 children's storybooks spanning kindergarten to eighth-grade reading level.
replication_study.ipynb is a Jupyter Notebook that reproduces all tables and in-text statistics reported in the paper, as well as additional visualisations for the conference poster.
| Section | Content |
|---|---|
| Table 1 | Mean scores and standard deviations across all studies |
| Table 2 | Inter-annotator agreement (Krippendorff's α) |
| Table 3 | Pairwise system comparisons (this study) |
| Table 4 | QAG vs. PAQ t-test comparison across all studies |
| Table 5 | Pairwise CV* reproducibility across studies |
| Tables 6 & 7 | QRA3 input files (consumed by the external QRA3 tool) |
§Discussion: val impact |
Pairwise comparisons on val-placeholder-filtered data |
| §Discussion: Item-level correlation | Pearson/Spearman/Kendall correlations between the original study and this replication |
| Poster | Additional visualisations (QAG output examples, IAA collapse heatmap, full vs. cleaned data comparison) |
Requires Python 3.12. Install all dependencies via:
pip install -r requirements.txtThen launch the notebook:
jupyter notebook replication_study.ipynbWe report reproducibility using the QRA3 framework (Belz & Thomson, 2026), which provides quantitative reproducibility assessments at three levels of granularity using measures comparable across studies.
The notebook computes aggregated system-level means and saves them as QRA3-format input files in the qra/ folder. These are then consumed by the external QRA3 command-line tool.
For n = 3 studies (all three criteria — Readability, Question Relevance, Answer Relevance):
qra \
--input "qra/metrics-3-studies.csv" \
--config "qra/config-3-studies.json" \
--krippendorff-alpha-mode ordinal \
--out-dir "qra/output-3-studies" \
--forceFor n = 4 studies (Readability only, including Braun, 2025 who evaluated only this criterion):
qra \
--input "qra/metrics-4-studies.csv" \
--config "qra/config-4-studies.json" \
--krippendorff-alpha-mode ordinal \
--out-dir "qra/output-4-studies" \
--forcedatacard.json is the Human Evaluation Datasheet (HEDS) (Belz & Thomson, 2022) for this study. A separate HEDS record is also available at: https://github.com/nlp-heds/repronlp2026
The images/ directory contains all figures generated by the notebook for use in the paper and poster, including:
- Inter-annotator agreement (IAA) heatmaps
- Answer Relevance comparison: full data vs.
val-artifact-cleaned data
If you use the data, code, or findings from this paper, please cite:
@inproceedings{mroczek-etal-2026-repronlp,
author = {Mroczek, Marcel and Albarello, Chiara and Floch, Paul-Emmanuel and Gawinecki, Maciej},
title = {ReproNLP 2026: A Third Replication of the Human Evaluation of a QAG System for Children's Storybooks},
booktitle = {Proceedings of the 5th Workshop on Generation, Evaluation and Metrics (GEM 2026)},
month = {July},
year = {2026},
address = {San Diego, CA, USA},
publisher = {Association for Computational Linguistics}
}Please also consider citing the original study and prior replications that this work builds upon:
- Original study: Yao et al. (2022) — It is AI's Turn to Ask Humans a Question: Question-Answer Pair Generation for Children's Story Books
- First replication: Florescu et al. (2024) — Once Upon a Replication: It is Humans’ Turn to Evaluate AI’s Understanding of Children’s Stories for QA Generation
- Second replication: Braun (2025) — ReproHum #0031-01: Reproducing the Human Evaluation of Readability from “It is AI’s Turn to Ask Humans a Question”
- FairytaleQA dataset: Xu et al. (2022)
- QRA3 framework: Belz & Thomson (2026)
- ReproNLP shared task series: Belz & Thomson (2023), Belz & Thomson (2024), Belz et al. (2025), Belz et al. (2026)
We thank Katarzyna Basista for her participation in the pilot study, the anonymous annotators who participated in the human evaluation, the ReproNLP shared task organisers (especially Craig Thomson), and the original authors for providing additional information to the ReproHum team.