QUICK REVIEW

[論文レビュー] ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Chen Lin, Jisong Li|arXiv (Cornell University)|Nov 21, 2023

Multimodal Machine Learning Applications被引用数 11

ひとこと要約

本論文はShareGPT4Vという大規模な画像キャプションデータセットと、GPT-4 Visionとキャプショナーが生成するキャプションを用いた事例を提示し、これらのキャプションの利用がモダリティ整合性とLMMパフォーマンスを向上させること、および11のベンチマークで競争力のある結果を出す7BスケールのShareGPT4V-7Bモデルを示しています。

ABSTRACT

In the realm of large multi-modal models (LMMs), efficient modality alignment is crucial yet often constrained by the scarcity of high-quality image-text data. To address this bottleneck, we introduce the ShareGPT4V dataset, a pioneering large-scale resource featuring 1.2 million highly descriptive captions, which surpasses existing datasets in diversity and information content, covering world knowledge, object properties, spatial relationships, and aesthetic evaluations. Specifically, ShareGPT4V originates from a curated 100K high-quality captions collected from advanced GPT4-Vision and has been expanded to 1.2M with a superb caption model trained on this subset. ShareGPT4V first demonstrates its effectiveness for the Supervised Fine-Tuning (SFT) phase, by substituting an equivalent quantity of detailed captions in existing SFT datasets with a subset of our high-quality captions, significantly enhancing the LMMs like LLaVA-7B, LLaVA-1.5-13B, and Qwen-VL-Chat-7B on the MME and MMBench benchmarks, with respective gains of 222.8/22.0/22.3 and 2.7/1.3/1.5. We further incorporate ShareGPT4V data into both the pre-training and SFT phases, obtaining ShareGPT4V-7B, a superior LMM based on a simple architecture that has remarkable performance across a majority of the multi-modal benchmarks. This project is available at https://ShareGPT4V.github.io to serve as a pivotal resource for advancing the LMMs community.

研究の動機と目的

大規模多モーダルモデルにおける視覚と言語のモダリティ整合性に対するキャプション品質の影響を強調する。
GPT-4 Visionのキャプションと訓練済みキャプショナーのキャプションを組み合わせた大規模で高品質な画像キャプションデータセット（ShareGPT4V）を作成する。
ShareGPT4Vデータを事前学習とSFTに組み込むことで、軽量アーキテクチャでもLMMパフォーマンスを向上させることを実証する。
多様なマルチモーダルベンチマークで強力な結果を示す7Bスケールモデル（ShareGPT4V-7B）を提示する。

提案手法

100KのGPT-4 Visionキャプションと訓練済みキャプショナーからの1.2MキャプションでShareGPT4Vを構築する。
視覚エンコーダ、MLPプロジェクター、およびLLM（Vicunaベース）を備えた単純なShareGPT4V-7Bアーキテクチャを訓練する。
ShareGPT4V-PTキャプションを用いた前処理学習と、視覚と言語コンポーネントを共同でファインチューニングする。
既存のSFTデータの一部をShareGPT4Vキャプションに置換して、パフォーマンスへの影響を測定する。
前処理学習とSFTの寄与とキャプション品質の影響を評価するアブレーションを実施する。

実験結果

リサーチクエスチョン

RQ1高品質な画像キャプションはモダリティ整合性と下流のマルチモーダルタスクの性能にどのような影響を及ぼすか？
RQ27BスケールのLMMにおける事前学習とSFTでShareGPT4Vデータを組み込む効果はどの程度か？
RQ3ShareGPT4Vのキャプション品質は、他のキャプショナーやデータセットと比較して、ベンチマーク全体の改善をどれだけ促進するか？

主な発見

方法	LLaVA W	MME P	MME C	MMB	MMB CN	SEED I	MM-Vet	QBench	SQA I	VQA V2	VizWiz
BLIP-2 \| FLAN-T5	38.1	1293.8	290.0	-	-	46.4	22.4	-	61.0	41.0	19.6
InstructBLIP \| Vicuna-7B	60.9	-	-	36.0	23.7	53.4	26.2	56.7	60.5	-	34.5
InstructBLIP \| FLAN-T5	58.2	1212.8	291.8	-	-	-	25.6	-	63.1	-	33.4
Shikra \| Vicuna-13B	-	-	-	58.8	-	-	-	54.7	-	77.4	-
IDEFICS-80B \| LLaMA-65B	-	-	-	54.5	38.1	-	-	-	-	60.0	36.0
Qwen-VL \| Qwen-7B	-	-	-	38.2	7.4	56.3	-	59.4	67.1	78.8	35.2
Qwen-VL-Chat \| Qwen-7B	-	1487.5	360.7	60.6	56.7	58.2	-	-	68.2	78.2	38.9
LLaVA \| Vicuna-7B	63.0*	807.0*	247.9*	34.1*	14.1*	25.5*	26.7*	-	38.5*	79.0*	9.3*
LLaVA-1.5 \| Vicuna-7B	63.4	1510.7	316.1*	64.3	58.3	66.2*	30.5	58.7	66.8	78.5	50.0
LLaVA-1.5 \| Vicuna-13B	70.7	1531.3	295.4*	67.7	63.6	68.2	35.4	62.1	71.6	80.0	53.6
ShareGPT4V-7B \| Vicuna-7B	72.6	1567.4	376.4	68.8	62.2	69.7	37.6	63.4	68.4	80.6	57.2

SFTキャプションの一部をShareGPT4Vキャプションに置換すると、複数のLMMとベンチマークで有意な改善が得られる。
ShareGPT4V-PTキャプションでの事前学習と、それに続く微調整（ShareGPT4V）は、最も良い全体パフォーマンスを実現し、いくつかのベースラインを上回る。
ShareGPT4V-7Bは11のベンチマークで力強い結果を達成し、しばしばより大規模またはデータ集約的なモデルを上回る。
前処理学習で視覚エンコーダの後半のみをファインチューニングすることで、顕著な性能向上をもたらす。
アブレーションにより、高品質なキャプションが知覚と認知の両方の指標を大幅に改善することを示す。
ShareGPT4V-PTデータだけで顕著な改善が得られ、1.2Mキャプションへ拡張して一般的なキャプショナーを用いると結果がさらに向上する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。