QUICK REVIEW

[论文解读] Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

Pan Lu, Swaroop Mishra|arXiv (Cornell University)|Sep 20, 2022

Topic Modeling被引用 214

一句话总结

这篇论文介绍 ScienceQA，一个包含讲座和解释的大型多模态科学问答数据集，并显示语言模型中的链式思维生成在少量示例和微调设置下提高性能，GPT-3 和 UnifiedQA 从 CoT 解释中受益。

ABSTRACT

When answering a question, humans utilize the information available across different modalities to synthesize a consistent and complete chain of thought (CoT). This process is normally a black box in the case of deep learning models like large-scale language models. Recently, science question benchmarks have been used to diagnose the multi-hop reasoning ability and interpretability of an AI system. However, existing datasets fail to provide annotations for the answers, or are restricted to the textual-only modality, small scales, and limited domain diversity. To this end, we present Science Question Answering (ScienceQA), a new benchmark that consists of ~21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations. We further design language models to learn to generate lectures and explanations as the chain of thought (CoT) to mimic the multi-hop reasoning process when answering ScienceQA questions. ScienceQA demonstrates the utility of CoT in language models, as CoT improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA. We also explore the upper bound for models to leverage explanations by feeding those in the input; we observe that it improves the few-shot performance of GPT-3 by 18.96%. Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data. The data and code are available at https://scienceqa.github.io.

研究动机与目标

创建一个具注释讲座和解释的大规模、多模态科学问答数据集，以揭示推理路径。
评估语言模型中链式思维 (CoT) 生成对 ScienceQA 问答准确性的影响。
展示解释在少量-shot 和微调体系下提升学习效率和数据效率的潜力。

提出的方法

组装 ScienceQA，包含 21,208 道跨自然、社会、语言科学的多模态选择题，每题关联讲座和解释。
在 QCM 样式提示上对基线进行基准测试，包括 VQA 模型和大型语言模型（UnifiedQA 和 GPT-3）。
修改 UnifiedQA，在训练和评估阶段生成一个答案以及一个讲座和解释（CoT）。
对 GPT-3 进行链式思考提示，生成答案、讲座和解释，并与标准提示进行比较。
用自动度量（BLEU-1/4、ROUGE-L、相似度）和人工评估来分析生成的解释在相关性、正确性与完整性方面的表现。
通过向提示输入黄金解释来探索潜在收益的上限情景，以衡量潜在提升。

实验结果

研究问题

RQ1一个具有依据讲座和解释的多模态科学问答数据集是否能够支持多跳推理评估与模型可解释性？
RQ2在少量-shot 和微调体系中，链式思维解释是否提高了多模态科学问题的问答准确性？
RQ3在 ScienceQA 上，解释在提升语言模型的学习效率和数据效率方面能提高到何种程度？
RQ4带有解释的多模态科学问题上，机器与人类表现之间存在多大的差距？

主要发现

模型	学习	格式	NAT	SOC	LAN	TXT	IMG	NO	G1-6	G7-12	平均
Random chance	-	M→A	40.28	46.13	29.25	47.45	40.08	33.66	39.35	40.67	39.83
Q only	train set	Q→A	41.34	27.22	47.00	41.79	35.15	44.60	39.28	40.87	39.85
C I only	train set	CI→A	41.34	29.25	45.45	42.33	36.09	42.93	39.21	41.07	39.87
Q+M only	train set	QM→A	52.66	51.86	60.18	55.57	50.37	57.42	52.53	57.88	54.44
Q+C T+M only	train set	QCTM→A	57.28	49.04	61.36	60.46	52.80	58.82	54.44	60.51	56.61
Q+C I+M only	train set	QCIM→A	58.97	53.77	60.45	62.85	54.49	57.63	56.72	61.04	58.26
MCAN	train set	QCM→A	56.08	46.23	58.09	59.43	51.17	55.40	51.65	59.72	54.54
Top-Down	train set	QCM→A	59.50	54.33	61.82	62.90	54.88	59.79	57.27	62.16	59.02
BAN	train set	QCM→A	60.88	46.57	66.64	62.61	52.60	65.51	56.83	63.94	59.37
DFAF	train set	QCM→A	64.03	48.82	63.55	65.88	54.49	64.11	57.12	67.17	60.72
ViLT	train set	QCM→A	60.48	63.89	60.27	63.20	61.38	57.00	60.72	61.90	61.14
Patch-TRM	train set	QCM→A	65.19	46.79	65.55	66.96	55.28	64.95	58.04	67.50	61.42
VisualBERT	train set	QCM→A	59.33	69.18	61.18	62.71	62.17	58.54	62.96	59.92	61.87
UnifiedQA BASE	zero-shot	QCM→A	47.78	40.49	46.00	50.24	44.12	44.39	45.56	46.21	45.79
UnifiedQA BASE	train set	QCM→A	68.16	69.18	74.91	63.78	61.38	77.84	72.98	65.00	70.12
UnifiedQA BASE (CoT)	train set	QCM→AE	71.00	76.04	78.91	66.42	66.53	81.81	77.06	68.82	73.33
GPT-3	zero-shot	QCM→A	75.04	66.59	78.00	74.24	65.74	79.58	76.36	69.87	74.04
GPT-3	2-shot	QCM→A	74.64	69.74	76.00	74.44	67.28	77.42	76.80	68.89	73.97
GPT-3 (CoT)	2-shot	QCM→AE	76.60	65.92	77.55	75.51	66.09	79.58	78.49	67.63	74.61
GPT-3 (CoT)	2-shot	QCM→ALE	75.44	70.87	78.09	74.68	67.43	79.93	78.23	69.68	75.17
Human	-	QCM→A	90.23	84.97	87.48	89.60	87.50	88.10	91.59	82.42	88.40

ScienceQA 含有 21,208 道跨自然、社会与语言科学的多模态题目，具有丰富的上下文（文本和/或图像）以及注释讲座和解释。
UnifiedQA 配合链式推理（CoT）在微调后平均准确率相较于不使用 CoT 提升 3.99%（QCM→ALE）。
GPT-3 配合 CoT 提示在 2-shot 设置下，在 ScienceQA 上实现 75.17% 的平均准确率，优于非 CoT 提示。
在提示中包含正确的解释可以在 GPT-3 少量样本表现中带来高达 18.96% 的绝对改进（上限分析）。
解释帮助模型用更少数据学习：UnifiedQA with CoT 在训练数据的 40% 下达到相似的准确性。
65.2% 的 GPT-3（CoT）生成的解释在人工评估中达到金标准（相关、正确、完整）。
人类在所有模型上显著超越，图像-context 问题大约有 20 点的差距。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。