QUICK REVIEW

[論文レビュー] Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models

Haoran Wei, Lingyu Kong|arXiv (Cornell University)|Dec 11, 2023

Multimodal Machine Learning Applications被引用数 8

ひとこと要約

Varyはtiny autoregressiveモデルで新しい視覚語彙を生成し、それをCLIP-VITと融合する2段階アプローチによりLVLMの視覚語彙を拡張し、ファインチャームな認識（OCR、文書/チャート理解）を向上させつつ既存機能を維持する。

ABSTRACT

Modern Large Vision-Language Models (LVLMs) enjoy the same vision vocabulary -- CLIP, which can cover most common vision tasks. However, for some special vision task that needs dense and fine-grained vision perception, e.g., document-level OCR or chart understanding, especially in non-English scenarios, the CLIP-style vocabulary may encounter low efficiency in tokenizing the vision knowledge and even suffer out-of-vocabulary problem. Accordingly, we propose Vary, an efficient and effective method to scale up the vision vocabulary of LVLMs. The procedures of Vary are naturally divided into two folds: the generation and integration of a new vision vocabulary. In the first phase, we devise a vocabulary network along with a tiny decoder-only transformer to produce the desired vocabulary via autoregression. In the next, we scale up the vanilla vision vocabulary by merging the new one with the original one (CLIP), enabling the LVLMs can quickly garner new features. Compared to the popular BLIP-2, MiniGPT4, and LLaVA, Vary can maintain its vanilla capabilities while enjoying more excellent fine-grained perception and understanding ability. Specifically, Vary is competent in new document parsing features (OCR or markdown conversion) while achieving 78.2% ANLS in DocVQA and 36.2% in MMVet. Our code will be publicly available on the homepage.

研究の動機と目的

Denseまたは非英語の知覚タスクにおけるLVLMの視覚語彙ボトルネックを動機付け、対処する。
新しい視覚語彙を生成しCLIPベースの語彙と統合する2段階アプローチを提案する。
語彙のスケーリングがコアLVLM機能を維持しつつファインチャームな認識を向上させることを示す。

提案手法

2段階のパイプライン: (1) 自動回帰型のtinyデコーダーのみを用いた語彙ネットワークで新しい視覚語彙を生成; (2) 新語彙を元のCLIP-VIT語彙と統合し、LVLMトレーニング時には両語彙を凍結。
CLIP-VITと形状を合わせるためにSAM-ViTDetの特徴上に畳み込み層を用いた新語彙ネットワークを構築し、256×1024の平坦化トークンを生成。
Vary-tinyをドキュメントとチャートデータ（密集OCRとレンダリング）を陽性、自然画像を陰性として自己回帰型の画像からテキスト生成で学習。
新語彙をVary-baseに統合し、元のCLIP-VIT語彙と並列化して、語彙を凍結したままLVLMを訓練し、入力埋め込みとLLMを更新。
合成データ生成（ドキュメント用LaTeXレンダリング、チャートレンダリング）とGPT-4による高品質なチャートデータを用いてVary-baseの訓練を充実。

実験結果

リサーチクエスチョン

RQ1視覚語彙をスケールアップすることは、CLIP-VITの制限を超えるLVLMのファインチャームな知覚を改善するか。
RQ2新しい視覚語彙を効果的に生成・統合して、既存の知識を書き換えずに済ませるにはどうすればよいか。
RQ3語彙スケーリングされたLVLMは文書OCR、マークダウン変換、チャート理解でより良い性能を示しつつ、一般的な能力を維持できるか。

主な発見

Vary-tinyは中国語と英語の双方で密集OCR能力を達成し、編集距離は中国語0.266、英語0.197。
Vary-baseは英語の純文書OCRでNougatと同等を達成し、プロンプト下でMarkdown/LaTeX風の変換を可能にする。
Vary-baseは80k SFTデータでDocVQAのANLS 78.2、検証で76.3を達成；665k SFTデータでChartQAの平均66.1に到達。
Vary-baseとQwen-7Bを組み合わせた場合、MMVetのトップレベルスコアは36.2%、他のMMVet指標は設定に応じて38.9～38.7%を示す。
Varyは類似設定のLLaVA-1.5を上回る約2.4ポイントの一般MMVet性能向上を実現。
総じて、視覚語彙の拡張はコアLVLM能力を維持しつつファインチャームな知覚を改善する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。