QUICK REVIEW

[論文レビュー] When LLaVA Meets Objects: Token Composition for Vision-Language-Models

Soumya Jahagirdar, Walid Bousselham|arXiv (Cornell University)|Feb 4, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

Mask-LLaVAは CLS、プール済みパッチ、マスクベースのオブジェクトトークンの3つのグレードの視覚トークンを用い、再学習なしでトークン削減を柔軟に行い、8つのVLMベンチマークで視覚トークン数を大幅に減らして競争力のある性能を達成する。

ABSTRACT

Current autoregressive Vision Language Models (VLMs) usually rely on a large number of visual tokens to represent images, resulting in a need for more compute especially at inference time. To address this problem, we propose Mask-LLaVA, a framework that leverages different levels of visual features to create a compact yet information-rich visual representation for autoregressive VLMs. Namely, we combine mask-based object representations together with global tokens and local patch tokens. While all tokens are used during training, it shows that the resulting model can flexibly drop especially the number of mask-based object-tokens at test time, allowing to adapt the number of tokens during inference without the need to retrain the model and without a significant drop in performance. We evaluate the proposed approach on a suite of standard benchmarks showing results competitive to current token efficient methods and comparable to the original LLaVA baseline using only a fraction of visual tokens. Our analysis demonstrates that combining multi-level features enables efficient learning with fewer tokens while allowing dynamic token selection at test time for good performance.

研究の動機と目的

自己回帰型ビジョン-ランゲージモデルで再学習せずに視覚トークン数を削減する動機づけ。
グローバル、ローカル、オブジェクトレベルの視覚特徴を結合してLLMs向けのコンパクト入力を形成するMask-LLaVAを提案。
ノームスケーリングとテスト時トークンプルーニングがベンチマーク間で頑健な性能を示すことを実証。
訓練時にオブジェクトベースのトークンを過剰サンプリングすることで推論時の柔軟性を高める。

提案手法

事前学習済みビジョンエンコーダを用いて3種類のトークンを抽出する：CLS（グローバル）、プール済みパッチ（ローカル）、マスクベースのオブジェクトトークン。
オブジェクトマスクはオブジェクト性検出器（境界ボックス）とSAMセグメンテーションで生成し、MaskInversionでオブジェクト埋め込みを取得。
パッチトークンの平均と標準偏差を用いてCLSとオブジェクトトークンのノームをパッチトークンの統計に合わせて正規化。
3つのトークンストリームをマルチモーダルプロジェクターを通じて融合し、LLMへ入力する。これはLLaVAのトレーニングパイプライン（ビジョン-ランゲージ事前学習→指示チューニング）に従う。
再学習なしでIoUベースのマスク剪定と任意のパッチトークン剪定/プーリングを介した動的テスト時トークン削減をサポート。

Figure 1: Overview of Mask-LLaVA Architecture. Given an input image, the local feature extraction module pools patch tokens from the Vision Transformer ViT Radford et al. ( 2021 ) based on 2D grid structure to obtain local context features. Simultaneously, the SAM Kirillov et al. ( 2023 ) generates

実験結果

リサーチクエスチョン

RQ1グローバルCLS、ローカルパッチ、オブジェクトマスクの多段階視覚トークンを組み合わせることで、トークン数を減らしてもVLM性能を損なわないか。
RQ2トークンタイプ間のノームスケーリングがクロストーク融合と全体性能の向上につながるか。
RQ3テスト時のトークン数が削減された状態で、Mask-LLaVAは標準的なVLMベンチマークでどう機能するか。
RQ4訓練時にオブジェクトベース表現を過剰サンプリングすることが、推論時の柔軟なトークン剪定に有益か。

主な発見

Methods	RR	# Vis. tokens	VQAv2	GQA	POPE	MME	MMBench	SciQA	Vizwiz	MM-Vet
LLaVA-1.5-7B	0%	576	78.5	62.0	85.9	1510.7	64.3	66.8	50.0	30.5
LLaVA-1.5-7B†	90%	58	-	54.2	74.6	1246.8	53.4	67.1	-	27.0
FitPrune	90%	58	62.7	49.9	53.8	1147.4	56.2	68.2	50.8	21.8
SparseVLM	90%	58	62.9	48.8	65.8	1030.6	49.0	67.2	49.3	18.6
FasterVLM	90%	58	71.9	54.9	75.8	1348.6	60.5	68.9	53.0	30.1
MQT	90%	64	75.3	60.0	83.6	1464.3	63.5	67.0	51.5	28.9
Voco-LLaMa	88%	64	75.4	60.4	-	60.5	-	-	-	-
Mask-LLaVA (ours)	90%	57	74.8	60.6	83.7	1415.0	63.1	68.8	51.8	24.9
LLaVA-1.5-7B†	95%	29	-	51.0	65.9	1141.1	45.7	67.1	-	23.5
FitPrune	95%	29	52.3	43.6	31.1	855.2	39.6	68.3	48.6	18.0
FasterVLM	95%	29	66.7	51.5	67.2	1254.8	58.5	69.5	52.6	27.5
MQT	95%	36	73.7	58.8	81.9	1416.3	63.4	66.8	51.0	27.8
M3	95%	36	76.9	60.3	85.5	1417.2	64.8	68.2	52.8	25.4
Voco-LLaMa	95%	32	75.3	60.2	-	59.4	-	-	-	-
Mask-LLaVA (ours)	95%	15	71.5	58.5	82.1	1395.8	62.1	68.4	52.8	21.9

Mask-LLaVAは8つのベンチマークで視覚トークンのごく一部で競争力のある性能を達成し、トークン削減比が高い場合でも他のトークン効率化手法を上回ることが多い。
57トークン構成（1CLS＋36パッチ＋20オブジェクトトークン）や42/29トークン構成でも強力な結果を維持し、堅牢なトークン削減を示す。
CLS、パッチ、オブジェクトトークンを組み合わせることで、パッチトークンのみを用いる場合より良い結果となり、CLSは一部タスクで、オブジェクトトークンは他のタスクで寄与。
CLSとオブジェクトトークンをパッチトークンの統計に合わせてノームを正規化すると全体性能が向上し、画像ごとのトークン正規化が最良の結果をもたらす。
マスクベースのトークン剪定（IoUベース）とパッチトークン剪定/プーリングにより、再学習なしで推論時のトークン数を動的に調整可能。
いくつかのデータセット（特にPOPEとMME）では高トークン削減時に最先端に近い利得を、VQAv2、GQA、VizWiz、MM-Vetほかでは競争力の結果を維持。

Figure 2: Mask-Token Computation. This figure illustrates the process of obtaining segmentation masks. First, an objectness detector Zhu et al. ( 2020 ) identifies bounding boxes in the image. These bounding boxes, along with the image, are then passed to the SAM Kirillov et al. ( 2023 ) model to ge

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。