[Paper Review] Embracing data abundance: BookTest Dataset for Reading Comprehension
This paper introduces BookTest, a reading comprehension dataset 60x larger than the Children’s Book Test (CBT), enabling training of more robust models. By training an Attention-Sum Reader on BookTest, the authors achieve a 14.8% accuracy gain on CBT over prior model architecture improvements, with an ensemble exceeding Facebook's human baseline on the named-entity CBT subset, while a human study confirms remaining room for improvement.
There is a practically unlimited amount of natural language data available. Still, recent work in text comprehension has focused on datasets which are small relative to current computing possibilities. This article is making a case for the community to move to larger data and as a step in that direction it is proposing the BookTest, a new dataset similar to the popular Children's Book Test (CBT), however more than 60 times larger. We show that training on the new data improves the accuracy of our Attention-Sum Reader model on the original CBT test data by a much larger margin than many recent attempts to improve the model architecture. On one version of the dataset our ensemble even exceeds the human baseline provided by Facebook. We then show in our own human study that there is still space for further improvement.
Motivation & Objective
- To address the underutilization of large-scale data in text comprehension research, despite the availability of vast natural language corpora.
- To propose a new, significantly larger dataset—BookTest—that enables training on abundant data, mimicking real-world data abundance.
- To demonstrate that data scale alone can yield larger performance gains than architectural innovations on smaller datasets.
- To evaluate whether models trained on larger, related data can generalize effectively to standard benchmarks like CBT.
- To investigate the remaining gap between state-of-the-art models and human performance through a targeted human study.
Proposed method
- The BookTest dataset is constructed using a method similar to CBT, generating cloze-style questions from a large corpus of children's books.
- The dataset contains over 14 million examples, making it more than 60 times larger than the original CBT dataset.
- An Attention-Sum Reader model is trained on BookTest data and evaluated on the standard CBT test split.
- The model uses attention mechanisms to attend to relevant parts of the context document when predicting answers.
- An ensemble of models is created to improve generalization and robustness, particularly on challenging examples.
- A human study evaluates 100 previously misclassified CBT questions (50 named entities, 50 common nouns) to assess the remaining performance gap.
Experimental results
Research questions
- RQ1Can training on a dataset 60x larger than CBT lead to significantly larger performance gains than architectural improvements on the original CBT data?
- RQ2Does training on a larger, related dataset (BookTest) improve generalization to the standard CBT benchmark, despite domain shift?
- RQ3Can a model trained on BookTest surpass the human performance baseline reported by Facebook on the CBT named-entity subset?
- RQ4What is the remaining gap between state-of-the-art models and human performance on the CBT dataset?
- RQ5Are there still examples that humans can answer correctly but current models cannot, indicating room for further improvement?
Key findings
- Training on BookTest improved the Attention-Sum Reader's accuracy on the CBT test set by 14.8%, far exceeding the 2.1% gain achieved by architectural improvements on the original CBT data.
- The ensemble of models trained on BookTest exceeded the human baseline reported by Facebook on the named-entity version of the CBT dataset.
- On the common noun version of CBT, the model achieved over 96% accuracy, indicating strong performance on this subset.
- A human study showed that 66% of named entity questions and 82% of common noun questions previously misclassified by the model were correctly answered by humans, indicating a remaining performance gap.
- A system combining model and human predictions could achieve over 92% accuracy on the named-entity validation and test sets, and over 96% on the common noun sets, suggesting continued potential for improvement.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.