QUICK REVIEW

[論文レビュー] AVIS: Autonomous Visual Information Seeking with Large Language Model Agent

Ziniu Hu, Ahmet İşcen|arXiv (Cornell University)|Jun 13, 2023

Multimodal Machine Learning Applications被引用数 15

ひとこと要約

AVIS は、遷移グラフに導かれたツールセットを備えた LLM 主導のプランナーとリソナーを用いて、動的で木探索に基づく視覚情報探索を実行し、Infoseek と OK-VQA で最先端の性能を達成します。

ABSTRACT

In this paper, we propose an autonomous information seeking visual question answering framework, AVIS. Our method leverages a Large Language Model (LLM) to dynamically strategize the utilization of external tools and to investigate their outputs, thereby acquiring the indispensable knowledge needed to provide answers to the posed questions. Responding to visual questions that necessitate external knowledge, such as "What event is commemorated by the building depicted in this image?", is a complex task. This task presents a combinatorial search space that demands a sequence of actions, including invoking APIs, analyzing their responses, and making informed decisions. We conduct a user study to collect a variety of instances of human decision-making when faced with this task. This data is then used to design a system comprised of three components: an LLM-powered planner that dynamically determines which tool to use next, an LLM-powered reasoner that analyzes and extracts key information from the tool outputs, and a working memory component that retains the acquired information throughout the process. The collected user behavior serves as a guide for our system in two key ways. First, we create a transition graph by analyzing the sequence of decisions made by users. This graph delineates distinct states and confines the set of actions available at each state. Second, we use examples of user decision-making to provide our LLM-powered planner and reasoner with relevant contextual instances, enhancing their capacity to make informed decisions. We show that AVIS achieves state-of-the-art results on knowledge-intensive visual question answering benchmarks such as Infoseek and OK-VQA.

研究の動機と目的

画像を超える外部知識を必要とする視覚的質問の解決動機付け。
ツールの使用をオーケストレートする自律的な LLM ベースのプランナー、リソナー、メモリの開発。
遷移グラフを介して人間の意思決定データを活用し、ツール選択を制約・導入力を行う。

提案手法

LLM 搭載のプランナー、LLM 搭載のリソナー、作業メモリという三要素システムを導入。
ユーザ研究から得られた遷移グラフを用いて各状態での実行可能なアクションを制約。
動的・段階的なツール使用（ビジョン API、ウェブ検索、画像検索）とバックトラッキング・メモリを活用して回答に到達。
プランナー用のプロンプトを、インコンテキストの例と現在のメモリを用いて次のツールとそのクエリを選択。
別のリソナーを用いて各ツール出力から有用情報を抽出し、継続するか回答を確定するかを判断。
AVIS を Infoseek Wikidata および OK-VQA で評価し、ベースラインやアブレーションと比較して動的意思決定の利点を示す。

Figure 1: An example of AVIS’s generated workflow for answering a challenging visual question using LLM with tree search to use tools. The input image is taken from the Infoseek dataset.

実験結果

リサーチクエスチョン

RQ1外部ツールを各ステップで動的に決定して、知識集約的な視覚質問に答えるにはどうすればよいか。
RQ2人間の意思決定データに基づくプランナーの定着（遷移グラフとインコンテキスト例を通じた）によって、ツール選択と推論の精度は改善されるか。
RQ3動的意思決定とバックトラッキングの導入は、固定的・順次的なツール使用と比較して性能にどのような影響を与えるか。

主な発見

データセット	モデル	未知のエンティティ（％）	未知の質問（％）
Infoseek Wikidata	PALM (Q-only, few-shot)	3.7	5.1
Infoseek Wikidata	OF A (fine-tune)	9.7	14.8
Infoseek Wikidata	PALI (VQA, zero-shot)	1.8	2.2
Infoseek Wikidata	PALI (fine-tune)	16.0	20.7
Infoseek Wikidata	PALM w/ CLIP	21.9	18.6
Infoseek Wikidata	FiD w/ CLIP	20.7	18.1
Infoseek Wikidata	baseline-PALM w/ PALI*	12.8	14.9
Infoseek Wikidata	baseline-PALM w/ PALI* + Object	31.3	36.1
Infoseek Wikidata	baseline-PALM w/ PALI* + Object + Search	36.1	38.2
Infoseek Wikidata	AVIS (ours, few-shot)	50.7	56.4
Infoseek Wikidata	w/o PALI*	47.9	54.2
Infoseek Wikidata	w/o Object	41.2	48.4
Infoseek Wikidata	w/o Search	42.5	49.6
OK-VQA	Supervised KRISP	38.4
OK-VQA	KAT	54.4
OK-VQA	ReVIVE	58.0
OK-VQA	REVEAL	59.1
OK-VQA	PALI (OK-VQA, finetune)	64.5
OK-VQA	Zero-shot PALI (VQA)	41.6
OK-VQA	PICa-Full	48.0
OK-VQA	Flamingo zero-shot	50.6
OK-VQA	ViperGPT few-shot	51.9
OK-VQA	Flamingo few-shot	57.8
OK-VQA	baseline-PALM w/ PALI	44.3
OK-VQA	baseline-PALM w/ PALI + Object	38.2
OK-VQA	baseline-PALM w/ PALI + Object + Search	47.9
OK-VQA	AVIS (ours)	60.2
OK-VQA	w/o PALI	47.1
OK-VQA	w/o Object	58.3
OK-VQA	w/o Search	55.0

AVIS は Infoseek の unseen-entity で 50.7%、unseen-question の分割で 56.4% の精度を、 few-shot プロンプティングで達成。
AVIS は OK-VQA で 60.2% の精度を達成し、いくつかのベースラインおよび多くのファインチューニングされていないモデルを上回った。
動的意思決定は、同じツールセットを使用する逐次的ベースラインより優れており、Infoseek で最大 17.3 の精度改善を示す。
アブレーション研究により、どのツールを削除しても性能が低下することが分かり、特に Object と Search は Infoseek に対して重要であり、PALI は OK-VQA の結果により寄与する。
バックトラッキング可能なリソナーは、初期のツール選択が誤っていても回復できることをケーススタディで示した。
このシステムは人間由来の遷移グラフを用いてアクション空間を制約し、文脈強化プロンプトでプランニングと推論を導く。

Figure 4: We conduct a user study to gather examples of user decision-making when responding to visual information-seeking questions. Given a visual question as depicted in (a), the user makes a series of tool calls using the available APIs shown in (b). Each tool call yields an output which the use

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。