Skip to main content
QUICK REVIEW

[论文解读] Multimodal Chain-of-Thought Reasoning in Language Models

Zhuosheng Zhang, Aston Zhang|arXiv (Cornell University)|Feb 2, 2023
Topic Modeling被引用 96
一句话总结

本文提出 Multimodal-CoT,一种两阶段微调框架,能够从文本和视觉输入生成推理理由,然后利用这些多模态推理来推断答案,在ScienceQA上以1B模型实现了最先进水平。

ABSTRACT

Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies have primarily focused on the language modality. We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework that separates rationale generation and answer inference. In this way, answer inference can leverage better generated rationales that are based on multimodal information. Experimental results on ScienceQA and A-OKVQA benchmark datasets show the effectiveness of our proposed approach. With Multimodal-CoT, our model under 1 billion parameters achieves state-of-the-art performance on the ScienceQA benchmark. Our analysis indicates that Multimodal-CoT offers the advantages of mitigating hallucination and enhancing convergence speed. Code is publicly available at https://github.com/amazon-science/mm-cot.

研究动机与目标

  • 激励多模态(文本+视觉)链路思维(CoT)推理,以改进答案推断。
  • 探究为何1B模型在CoT方面存在困难,以及视觉如何减轻推理陷阱。
  • 提出一个两阶段微调框架,将理由生成与答案推断分离。
  • 在ScienceQA基准上评估该方法,并与仅语言和更大模型进行对比。

提出的方法

  • 分两阶段对一个基于 T5 的文本到文本 Transformer 进行微调:理由生成与答案推断。
  • 使用一个视觉编码器(DETR)提取图像特征,并通过一个门控融合机制将其与语言表示融合。
  • 在第一阶段,从语言+视觉输入生成一个理由 R;在第二阶段,在原始输入和 R 的条件下推断答案。
  • 采用两阶段训练方案,对 ScienceQA 注释的理由和答案进行监督学习。
  • 通过语言与视觉表示之间的基于注意力的交互将视觉特征引入,从而提升理由质量和答案准确性。
Figure 1: Example of the multimodal CoT task.
Figure 1: Example of the multimodal CoT task.

实验结果

研究问题

  • RQ1多模态(文本+视觉)CoT 推理是否能在多模态问答基准上超越语言专用的CoT?
  • RQ2当提供多模态输入时,1B 模型是否能从两阶段的理由生成与答案推断框架中受益?
  • RQ3使用视觉特征(DETR)与使用字幕在理由质量和最终答案上的影响有何不同?
  • RQ4多模态融合(带注意力的门控融合)相较于文本单一基线,对推理和准确性有何影响?

主要发现

ModelSizeNATSOCLANTXTIMGNOG1-6G7-12Avg
Human-90.2384.9787.4889.6087.5088.1091.5982.4288.40
MCAN (2019)95M56.0846.2358.0959.4351.1755.4051.6559.7254.54
Top-Down (2018)70M59.5054.3361.8262.9054.8859.7957.2762.1659.02
BAN (2018)112M60.8846.5766.6462.6152.6065.5156.8363.9459.37
DFAF (2019)74M64.0348.8263.5565.8854.4964.1157.1267.1760.72
ViLT (2021)113M60.4863.8960.2763.2061.3857.0060.7261.9061.14
Patch-TRM (2021)90M65.1946.7965.5566.9655.2864.9558.0467.5061.42
VisualBERT (2019)111M59.3369.1861.1862.7162.1758.5462.9659.9261.87
UnifiedQA Base (2020)223M68.1669.1874.9163.7861.3877.8472.9865.0070.12
UnifiedQA Base + CoT223M71.0076.0478.9166.4266.5381.8177.0668.8274.11
GPT-3.5 (2020)175B74.6469.7476.0074.4467.2877.4276.8068.8973.97
GPT-3.5 + CoT175B75.4470.8778.0974.6867.4379.9378.2369.6875.17
Multimodal-CoT Base223M87.5277.1785.8287.8882.9086.8384.6585.3784.91
Multimodal-CoT Large738M95.9182.0090.8295.2688.8092.8992.4490.3191.68
  • 在 ScienceQA 上,带视觉特征的 Multimodal-CoT 比 GPT-3.5 提高 16 个百分点(Large 设置下为 91.68% 对 75.17%)。
  • 两阶段 Multimodal-CoT 的准确率高于直接预测答案的一阶段基线。
  • 使用视觉特征(DETR)显著提升理由质量(RougeL)和最终答案准确率(84.91%),减少幻觉引起的错误。
  • 不同视觉特征影响性能;DETR 提供显著增益,而 CLIP 和 ResNet 在此设置中较差。
  • 该方法在不同骨干模型(UnifiedQA Base/Large, FLAN-T5 Base/Large)上具有泛化能力,并在 1B- 到 ~0.7B 参数规模仍然有效。
  • 消融显示若移除两阶段设计或视觉特征,性能下降(如无 Two-Stage Framework 平均降至 82.57;无 Vision Features 平均降至 70.53)。
Figure 2: Example of the two-stage framework without vision features (baseline) and with vision features (ours) for generating rationales and predicting answers. The upper part presents the problem details with a gold rationale, and the lower part shows the outputs of the baseline and our method inc
Figure 2: Example of the two-stage framework without vision features (baseline) and with vision features (ours) for generating rationales and predicting answers. The upper part presents the problem details with a gold rationale, and the lower part shows the outputs of the baseline and our method inc

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。