QUICK REVIEW

[論文レビュー] DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models

Zheng Ge, Bin Yang|arXiv (Cornell University)|Oct 25, 2023

Topic Modeling被引用数 15

ひとこと要約

DDCoTを紹介する。推論と認識を分離し、ネガティブスペース・プロンプティングを用いることでマルチモーダル推論を向上させる、責務分離型のチェーン・オブ・ソート・プロンプティング手法。ゼロショットおよびファインチューニングのマルチモーダル推論を改善。ScienceQAで最先端の結果を示し、一般化と説明可能性を向上。

ABSTRACT

A long-standing goal of AI systems is to perform complex multimodal reasoning like humans. Recently, large language models (LLMs) have made remarkable strides in such multi-step reasoning on the language modality solely by leveraging the chain of thought (CoT) to mimic human thinking. However, the transfer of these advancements to multimodal contexts introduces heightened challenges, including but not limited to the impractical need for labor-intensive annotation and the limitations in terms of flexibility, generalizability, and explainability. To evoke CoT reasoning in multimodality, this work first conducts an in-depth analysis of these challenges posed by multimodality and presents two key insights: "keeping critical thinking" and "letting everyone do their jobs" in multimodal CoT reasoning. Furthermore, this study proposes a novel DDCoT prompting that maintains a critical attitude through negative-space prompting and incorporates multimodality into reasoning by first dividing the reasoning responsibility of LLMs into reasoning and recognition and then integrating the visual recognition capability of visual models into the joint reasoning process. The rationales generated by DDCoT not only improve the reasoning abilities of both large and small language models in zero-shot prompting and fine-tuning learning, significantly outperforming state-of-the-art methods but also exhibit impressive generalizability and explainability.

研究の動機と目的

マルチモーダルチェーン・オブ・ソート推論の課題と注釈コスト、柔軟性、一般化可能性、説明可能性を動機づけ分析する。
クリティカルシンキングを維持しつつ認識のための視覚モデルを活用する、ゼロショットのマルチモーダル推論理由生成アプローチを提案する。
新規コンポーネントを通じて生成された理由を利用してLLMsをゼロショットプロンプトやファインチューニングでガイドする仕組みを開発する。
DDCoTがマルチモーダル推論ベンチマークで最先端の方法を上回り、一般化可能性と説明可能性を示す。

提案手法

DDCoTプロンプティングを三段階で提案する： (i) 言語のみのLLMsを用いてマルチモーダル推論理由を生成する、 (ii) 責任を推論と認識に明示的に分割する、 (iii) 不確実性を示すネガティブスペース・プロンプティングを適用して幻覚を減らす。
視覚質問応答モデルを用いてサブ質問の視覚認識結果を供給し、それをLLMの推論と結合して共同推論を行う。
ゼロショットでは生成された理由を問題と結合してLLMsを導く；ファインチューニングでは深層レイヤー・プロンプティング（DLP）とReasonal-Compressed Visual Embedding（RCVE）を用いてマルチモーダル入力の整合性を高める。
Deep-Layer Prompting（エンコーダーレイヤーごとに学習可能なプロンプトと理由との統合）とRational-Compressed Visual Embedding（理由に導かれた視覚特徴のアテンションベースのフィルタリング）を導入する。
ネガティブスペース・プロンプティングを活用して不確実性を明示的に示し、言語モデルの幻覚を減らし、補足的な視覚情報と共に共同推論を行う。

実験結果

リサーチクエスチョン

RQ1DDCoTは、労働集約的なグラウンドトゥルース注釈を必要とせずに、一般的で信頼性の高いマルチモーダル理由を生成できるか。
RQ2推論と認識の分離された役割は、ゼロショットおよびファインチューニング設定でマルチモーダル推論を改善するか。
RQ3ネガティブスペース・プロンプティングは、マルチモーダルCoTの幻覚を減らし、説明可能性を向上させるか。
RQ4深層レイヤー・プロンプティングとRCVEは、マルチモーダルタスクのファインチューニング性能にどのように影響するか。
RQ5ScienceQAの自然科学、社会科学、言語科学領域におけるDDCoTの性能と一般化はどうなるか。

主な発見

DDCoTはScienceQAマルチモーダルベンチマークにおいてゼロショットプロンプティングとファインチューニングの最先端性能を達成する。
ゼロショットの向上：GPT-3とChatGPTは、画像コンテキスト付きの質問でそれぞれ+2.53%と+8.23%の改善を示す（ベースラインと比較）。
ファインチューニングの向上：DDCoTはIMG分割で最大+21.96%、avg分割で最大+17.22%のUnifiedQAの向上をもたらし、MMCoTおよび他のベースラインを上回る。
アブレーション研究は、ネガティブスペース・プロ promptingと不確実性の強調が、特に画像対応の質問に対して幻覚を著しく減らし、ロバスト性を向上させることを示す。
人間評価は、DDCoT生成の理由が従来法よりも関連性が高く、正確で、完全で、首尾一貫性があり、説明可能であることを示す。
一般化実験は、NAT、SOC、LANドメインにおいてMMCoTを上回る強い外部ドメインでの性能向上を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。