QUICK REVIEW

[论文解读] SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs

Shengzhi Li, Nima Tajbakhsh|arXiv (Cornell University)|Aug 7, 2023

Topic Modeling被引用 26

一句话总结

tldr: SciGraphQA 引入了一个 295k 的开放词汇、面向真实世界科学图的多轮问答数据集，基于 290k 的 ArXiv 论文，使用 Palm-2 生成，支持零样本和微调的 MLLM 评估。

ABSTRACT

In this work, we present SciGraphQA, a synthetic multi-turn question-answer dataset related to academic graphs. SciGraphQA is 13 times larger than ChartVQA, the previously largest chart-visual question-answering dataset. It is also the largest open-sourced chart VQA dataset with non-synthetic charts. To build our dataset, we selected 290,000 Computer Science or Machine Learning ArXiv papers published between 2010 and 2020, and then used Palm-2 to generate 295K samples of open-vocabulary multi-turn question-answering dialogues about the graphs. As context, we provided the text-only Palm-2 with paper title, abstract, paragraph mentioning the graph, and rich text contextual data from the graph itself, obtaining dialogues with an average 2.23 question-answer turns for each graph. We asked GPT-4 to assess the matching quality of our question-answer turns given the paper's context, obtaining an average rating of 8.7/10 on our 3K test set. We evaluated the 0-shot capability of the most popular MLLM models such as LLaVa, mPLUGowl, BLIP-2, and openFlamingo's on our dataset, finding LLaVA-13B being the most performant with a CIDEr score of 0.08. We further enriched the question prompts for LLAVA by including the serialized data tables extracted from the graphs using the DePlot model, boosting LLaVA's 0-shot CIDEr to 0.15. To verify the validity of our dataset, we also fine-tuned LLaVa using our dataset, reaching a substantially higher CIDEr score of 0.26. We anticipate further accuracy improvement by including segmentation mask tokens and leveraging larger LLM backbones coupled with emergent prompting techniques. Our code and data are open-sourced.

研究动机与目标

扩展并多样化一个聚焦于科学图的多轮问答基准，以反映科学文献中的真实场景。
提供丰富的上下文（标题、摘要、图注及引用图的段落）以生成自然的对话。
使得对图理解任务进行零样本和微调的多模态大模型评估成为可能。
评估使用图派生数据表增加提示是否能提升模型性能。
提供一个开源的大规模数据集，以促进科学领域 MLLMs 的指令微调和预训练。

提出的方法

通过扩展 SciCap+，加入图注、OCR 文本、标题、摘要以及引用图的第一段来构建 SciGraphQA。
使用 Palm-2 结合上下文示例（经 GPT-4 验证的提示）生成 295k 条 QA 对话。
使用基于关键词的启发式方法筛选确保与图相关的问题，得到 295k 条高质量轮次（平均每个图 2.23 轮）。
在数据集上评估流行 MLLMs（如 LLaVA、mPLUGowl、BLIP-2、OpenFlamingo）的零样本性能，使用 CIDEr、BLEU-4、ROUGE。
用 DePlot 提取的数据表来增强提示，以提升零样本性能。
在 SciGraphQA 上对 LLaVA-13B（使用 LoRA 适配器）进行微调，以及在一个 DePlot 增强子集上微调，以评估提升。

Figure 1: Illustration of multi-turn dialogue generation process. For higher quality dialogues, we use comprehensive textual context together with in-context learning when prompting Palm-2.

实验结果

研究问题

RQ1在零样本设置下，当前的多模态大语言模型（MLLMs）在理解和回答关于真实世界科学图的问题方面表现如何？
RQ2通过 DePlot 提取的表格对话结构来增强提示，是否能改善图像-文本问答风格的指标？
RQ3对 SciGraphQA 的微调以及数据集规模对基于图的问答的模型性能有何影响？
RQ4在 CIDEr/BLEU-4/ROUGE 上，SciGraphQA-baseline 模型（在 SciGraphQA 上微调的 LLaVA-13B）在多大程度上超越零样本基线？
RQ5影响科学图 VQA 性能的实际考虑因素（训练设置、适配器和数据增强）有哪些？

主要发现

SciGraphQA 的规模比 ChartVQA 大 13 倍，是拥有真实世界图的最大开源图表 VQA 数据集（295K QA 对）。
GPT-4 评分在 3K 测试子集上的平均分为 8.7/10，表现出与上下文匹配的高质量对话生成和筛选。
零样本评估显示模型性能随 backbone 大小增加而提升；在未进行增强的情况下，LLaVA-13B 在测试模型中表现最佳（CIDEr ~0.08，BLEU-4 ~0.07，ROUGE ~0.23）。
用 DePlot 提取的数据表增强提示将 CIDEr 从 0.08（LLaVA-13B）提升到 0.153（DePlot+LLaVa-13B），再通过 SciGraphQA-baseline 微调提升到 0.268 CIDEr。
对 SciGraphQA（SciGraphQA-baseline）的微调得到 CIDEr 0.268 和 ROUGE 0.31，显著超越零样本基线。
数据集规模与微调性能呈正相关；最佳增益出现在数据的前一半，数据增强和更大 backbone 带来额外增益。

Figure 2: (left) distribution of the number of question-answer turns in our SciGraphQA dataset. (right) distribution of GPT-4 ratings (0–10) when GPT-4 was used as a judge to measure the matching of questions and answers from a 3k subset of the the SciGraphQA dataset.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。