QUICK REVIEW

[論文レビュー] From Training-Free to Adaptive: Empirical Insights into MLLMs' Understanding of Detection Information

Qirui Jiao, Daoyuan Chen|arXiv (Cornell University)|Jan 31, 2024

Multimodal Machine Learning Applications被引用数 5

ひとこと要約

本論文は、物体検出器および OCR からの検出情報を Multimodal LLMs（LLaVA-1.5 に基づく）へ導入する手法を、訓練なし、再訓練、ファインチューニングの戦略を用いて実証的に検討し、検出情報を組み込んだ LoRA 増強のファインチューニング手法がほとんどのベンチマークで有意な向上をもたらすことを示す。

ABSTRACT

Despite the impressive capabilities of Multimodal Large Language Models (MLLMs) in integrating text and image modalities, challenges remain in accurately interpreting detailed visual elements. Vision detection models excel at recognizing fine-grained image details, prompting researchers to use them to enhance MLLMs. One effective strategy is to infuse detection information in text format, which has proven simple and effective. However, most studies utilize this method without training, leaving the potential of adaptive training largely unexplored. Adaptive training could significantly enhance MLLMs' comprehension of unique inputs while filtering out irrelevant information. This paper addresses the crucial question: How does training impact MLLMs' understanding of infused textual detection information? We systematically experiment with various representative models to evaluate the effects of training-free, retraining, and fine-tuning strategies. We also examine the influence of training on MLLMs' original abilities and the interchangeability of detection models. Our findings indicate that fine-tuning a pre-trained MLLM to incorporate textual detection information delivers superior results compared to training-free and retraining methods, improving performance by 6.71% across 10 widely recognized benchmarks. Furthermore, fine-tuning enables MLLMs to retain performance enhancements even when detection models are swapped, indicating improved understanding of formatted textual data. We release our codes to support further exploration of fusion strategies for vision detection models and the enhancement of MLLMs' fine-grained multimodal capabilities.

研究の動機と目的

MLLMs における細粒度の視覚理解のギャップと幻視の抑制に取り組む動機付けと課題への対処。
市販の検出出力（オブジェクトラベル、座標、OCR）を MLLMs に融合する方法を探る。
検出情報の導入が元の MLLM の能力に与える影響を評価する。
検出モデルの交換（クローズドセットからオープンセットへ）によるモジュラリティを評価し、頑健性を観察する。

提案手法

検出出力をテキスト埋め込みとして埋め込み、LLM の前に ViT の画像特徴と結合する。
3つの融合戦略を比較する：Training-free Infusion (TFI)、LoRA Augmented Retraining (LAR)、LoRA Augmented Fine-tuning (LAF)。
物体検出には DINO（オープンセット代替として GroundingDINO）を、OCR には PaddleOCRv2 を用いてテキスト検出プロンプトを生成する。
イメージ特徴（CLIP-ViT-L-336px）を LLM のセマンティック空間へ写像するために2層の MLP を用いる。
10個のマルチモーダルベンチマークで評価し、正規化された集約スコア（mean s_norm）を計算する。
検出情報の有無でモデル挙動を検討し、ViT特徴と検出手がかりの間のトレードオフを理解する。

実験結果

リサーチクエスチョン

RQ1検出情報を直接 MLLMs に入力して retraining なしで性能を改善できるか？
RQ2再訓練とファインチューニングという訓練方法は、検出情報とどのように相互作用し、MLLMs の能力に影響を与えるか。
RQ3検出モデルを DINO から GroundingDINO に置換することが、性能と頑健性に与える影響は何か？

主な発見

Training-free infusion は混合的な結果を生む；POPE と MME-Cognition でいくつかの利得があるが、全体的にはベンチマーク間で不安定。
LoRA Augmented Retraining (LAR) はいくつかのベンチマークを改善するが、検出情報と ViT特徴への依存により画像レベルのタスクで劣化する可能性がある。
LoRA Augmented Fine-tuning (LAF) は全体的に最良の性能を達成し、9/10のベンチマークで LLaVA-1.5 および多くの SOTA モデルを上回り、VQA および OCR 関連タスクでの改善を示す。
DINO を GroundingDINO に置換しても性能が維持または向上することを示し、モジュール性とオープンセット検出の利点を確認。
OCR と物体検出の手がかりは、正確なカウント、局在、テキスト抽出を可能にし、物体中心のクエリやテキスト関連の問合せにおける幻視を減らす。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。