QUICK REVIEW

[論文レビュー] Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

Pan Lu, Swaroop Mishra|arXiv (Cornell University)|Sep 20, 2022

Topic Modeling被引用数 214

ひとこと要約

この論文は Lecture と説明を含む大規模なモ multimodal 科学QAデータセット ScienceQA を紹介し、言語モデルにおける chain-of-thought (CoT) 生成が few-shot および fine-tuning 設定で性能を向上させ、GPT-3 と UnifiedQA が CoT 説明から恩恵を受けることを示します。

ABSTRACT

When answering a question, humans utilize the information available across different modalities to synthesize a consistent and complete chain of thought (CoT). This process is normally a black box in the case of deep learning models like large-scale language models. Recently, science question benchmarks have been used to diagnose the multi-hop reasoning ability and interpretability of an AI system. However, existing datasets fail to provide annotations for the answers, or are restricted to the textual-only modality, small scales, and limited domain diversity. To this end, we present Science Question Answering (ScienceQA), a new benchmark that consists of ~21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations. We further design language models to learn to generate lectures and explanations as the chain of thought (CoT) to mimic the multi-hop reasoning process when answering ScienceQA questions. ScienceQA demonstrates the utility of CoT in language models, as CoT improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA. We also explore the upper bound for models to leverage explanations by feeding those in the input; we observe that it improves the few-shot performance of GPT-3 by 18.96%. Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data. The data and code are available at https://scienceqa.github.io.

研究の動機と目的

大規模で多モーダルな科学QAデータセットを、講義と説明を付した注釈付きで作成し、推論経路を明らかにする。
言語モデルにおける chain-of-thought (CoT) 生成が ScienceQA における質問応答の正確さに与える影響を評価する。
few-shot および fine-tuning レジームにおいて、説明が学習効率とデータ効率を改善する潜在力を示す。

提案手法

講義と説明に結びついた自然科学、社会科学、言語科学を横断する 21,208 個のモーダル多肢選択問題を含む ScienceQA を作成する。
QCMスタイルのプロンプトに対して、VQAモデルや大規模言語モデル（UnifiedQA および GPT-3）をベースラインとして評価する。
学習時と評価時に、UnifiedQA を修正して答えと講義および説明（CoT）を生成する。
GPT-3 の prompting に chain-of-thought を用いて answer、lecture、explanation を生成し、標準プロンプトと比較する。
自動評価指標（BLEU-1/4、ROUGE-L、Similarity）と、人間評価（関連性・正確性・完全性）を用いて、生成された説明を分析する。
プロンプトに gold explanations を入力して潜在的な gains を測定する upper-bound シナリオを探索する。

実験結果

リサーチクエスチョン

RQ1 grounded lectures and explanations を備えた大規模モーダル科学QAデータセットは multi-hop 推論の評価とモデル解釈可能性を支援するか？
RQ2 chain-of-thought 説明は few-shot および fine-tuning レジームにおいてモーダル科学質問の QA 正確性を改善するか？
RQ3 ScienceQA において、説明を活用して言語モデルの学習効率とデータ効率を高めることはどの程度可能か？
RQ4 説明付きのモーダル科学質問に対する機械と人間の性能ギャップはどれほどか？

主な発見

モデル	学習	フォーマット	NAT	SOC	LAN	TXT	IMG	NO	G1-6	G7-12	Avg
Random chance	-	M→A	40.28	46.13	29.25	47.45	40.08	33.66	39.35	40.67	39.83
Q only	train set	Q→A	41.34	27.22	47.00	41.79	35.15	44.60	39.28	40.87	39.85
C I only	train set	CI→A	41.34	29.25	45.45	42.33	36.09	42.93	39.21	41.07	39.87
Q+M only	train set	QM→A	52.66	51.86	60.18	55.57	50.37	57.42	52.53	57.88	54.44
Q+C T+M only	train set	QCTM→A	57.28	49.04	61.36	60.46	52.80	58.82	54.44	60.51	56.61
Q+C I+M only	train set	QCIM→A	58.97	53.77	60.45	62.85	54.49	57.63	56.72	61.04	58.26
MCAN	train set	QCM→A	56.08	46.23	58.09	59.43	51.17	55.40	51.65	59.72	54.54
Top-Down	train set	QCM→A	59.50	54.33	61.82	62.90	54.88	59.79	57.27	62.16	59.02
BAN	train set	QCM→A	60.88	46.57	66.64	62.61	52.60	65.51	56.83	63.94	59.37
DFAF	train set	QCM→A	64.03	48.82	63.55	65.88	54.49	64.11	57.12	67.17	60.72
ViLT	train set	QCM→A	60.48	63.89	60.27	63.20	61.38	57.00	60.72	61.90	61.14
Patch-TRM	train set	QCM→A	65.19	46.79	65.55	66.96	55.28	64.95	58.04	67.50	61.42
VisualBERT	train set	QCM→A	59.33	69.18	61.18	62.71	62.17	58.54	62.96	59.92	61.87
UnifiedQA BASE	zero-shot	QCM→A	47.78	40.49	46.00	50.24	44.12	44.39	45.56	46.21	45.79
UnifiedQA BASE	train set	QCM→A	68.16	69.18	74.91	63.78	61.38	77.84	72.98	65.00	70.12
UnifiedQA BASE (CoT)	train set	QCM→AE	71.00	76.04	78.91	66.42	66.53	81.81	77.06	68.82	73.33
GPT-3	zero-shot	QCM→A	75.04	66.59	78.00	74.24	65.74	79.58	76.36	69.87	74.04
GPT-3	2-shot	QCM→A	74.64	69.74	76.00	74.44	67.28	77.42	76.80	68.89	73.97
GPT-3 (CoT)	2-shot	QCM→AE	76.60	65.92	77.55	75.51	66.09	79.58	78.49	67.63	74.61
GPT-3 (CoT)	2-shot	QCM→ALE	75.44	70.87	78.09	74.68	67.43	79.93	78.23	69.68	75.17
Human	-	QCM→A	90.23	84.97	87.48	89.60	87.50	88.10	91.59	82.42	88.40

ScienceQA は natural、social、language sciences に跨る 21,208 問のモーダル質問を、豊富な文脈（テキストおよび/または画像）と講義および説明の注釈付きで含む。
CoT を用いた UnifiedQA は Fine-tune（QCM → ALE）時に平均正解率を 3.99% 向上させる（CoT なしと比較）。
CoT prompting を用いた GPT-3 は ScienceQA において 2-shot 設定で平均正解率 75.17% を達成し、非 CoT プロンプトを上回る。
プロンプトに ground-truth explanations を含めると GPT-3 の few-shot 性能が最大で絶対 18.96% 改善（upper bound 分析）。
説明はモデルがより少ないデータから学ぶのを助ける：CoT を備えた UnifiedQA は 40% の学習データで同等の正確さを達成。
GPT-3（CoT）生成の説明のうち 65.2% が人間評価で gold standards（関連性、正確さ、完全さ）を満たす。
人間は全モデルを大きく上回り、画像コンテキスト問では約 20 ポイントのギャップを示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。