[论文解读] FigureQA: An Annotated Figure Dataset for Visual Reasoning
FigureQA 是一个合成的视觉推理语料库,包含超过 one million 个基于 100k figure images 的问答对,并附带边界框和辅助任务的数值数据。基线结果显示关系推理提供了最强表现,但仍低于人类水平。
We introduce FigureQA, a visual reasoning corpus of over one million question-answer pairs grounded in over 100,000 images. The images are synthetic, scientific-style figures from five classes: line plots, dot-line plots, vertical and horizontal bar graphs, and pie charts. We formulate our reasoning task by generating questions from 15 templates; questions concern various relationships between plot elements and examine characteristics like the maximum, the minimum, area-under-the-curve, smoothness, and intersection. To resolve, such questions often require reference to multiple plot elements and synthesis of information distributed spatially throughout a figure. To facilitate the training of machine learning systems, the corpus also includes side data that can be used to formulate auxiliary objectives. In particular, we provide the numerical data used to generate each figure as well as bounding-box annotations for all plot elements. We study the proposed visual reasoning task by training several models, including the recently proposed Relation Network as a strong baseline. Preliminary results indicate that the task poses a significant machine learning challenge. We envision FigureQA as a first step towards developing models that can intuitively recognize patterns from visual representations of data.
研究动机与目标
- Create a large-scale, annotated dataset of figure-based visual questions to study reasoning over plotted data.
- Provide ground-truth numerical data and bounding boxes for all figure elements to enable auxiliary supervision.
- Assess baseline neural models, including relational reasoning, on figure-based questions.
- Enable curriculum-like extensions by extending templates, data types, and task complexity through synthetic generation.
提出的方法
- Generate synthetic figures of five types (vertical/horizontal bar, line, dot-line, pie) from sampled numerical data.
- Create 15 binary-question templates addressing extrema, area-under-curve, smoothness, and element relations.
- Ensure balanced yes/no answers across templates and figures to avoid bias.
- Annotate each figure with bounding boxes for all plot elements and provide underlying data and color metadata.
- Use Bokeh to render figures and output bounding boxes; modify backend to export annotations.
- Evaluate four baselines: a text-only LSTM, a CNN+LSTM with learned visual features, a CNN+LSTM with VGG-16 features, and a Relation Network (RN) for relational reasoning.
实验结果
研究问题
- RQ1Can a neural model perform accurate visual reasoning over synthetic figure data using only image and question input?
- RQ2Does relational reasoning (RN) outperform standard CNN+LSTM baselines on figure-based questions?
- RQ3How close can models approach human performance on a synthetic figure-reading task?
- RQ4What is the impact of color alternation schemes on model performance and bias resistance?
主要发现
| 模型 | 验证准确率 (%) | 测试准确率 (%) |
|---|---|---|
| Text only | 50.01 | 50.01 |
| CNN+LSTM | 56.16 | 56.00 |
| CNN+LSTM on VGG-16 features | 52.31 | 52.47 |
| RN | 72.54 | 72.40 |
- RN substantially outperforms CNN+LSTM baselines on the FigureQA test set.
- RN achieves 72.54% validation and 72.40% test accuracy with alternated colors, and 76.52% in a non-alternated setup.
- Human annotators achieve 91.21% on the same subset, highlighting remaining gap to human-level reasoning.
- Text-only and CNN+LSTM baselines lag behind RN, indicating the importance of relational reasoning for this task.
- The dataset includes 100k training figures (1.3M questions) and 20k validation/test figures (≈250k questions each).
- The corpus provides underlying numerical data and bounding boxes to support auxiliary supervision and analysis.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。