[論文レビュー] ImageBind-LLM: Multi-modality Instruction Tuning
ImageBind-LLM は、ImageBind の埋め込みを学習可能な bind ネットワークと整列させ、注意機構を使わず視覚的手掛かりを注入し、推論を強化するためのトレーニング不要の視覚キャッシュを活用することで、マルチモーダル指示に従う能力を LLaMA にファインチューニングします。画像、音声、動画、3D 入力に対して適用します。
We present ImageBind-LLM, a multi-modality instruction tuning method of large language models (LLMs) via ImageBind. Existing works mainly focus on language and image instruction tuning, different from which, our ImageBind-LLM can respond to multi-modality conditions, including audio, 3D point clouds, video, and their embedding-space arithmetic by only image-text alignment training. During training, we adopt a learnable bind network to align the embedding space between LLaMA and ImageBind's image encoder. Then, the image features transformed by the bind network are added to word tokens of all layers in LLaMA, which progressively injects visual instructions via an attention-free and zero-initialized gating mechanism. Aided by the joint embedding of ImageBind, the simple image-text training enables our model to exhibit superior multi-modality instruction-following capabilities. During inference, the multi-modality inputs are fed into the corresponding ImageBind encoders, and processed by a proposed visual cache model for further cross-modal embedding enhancement. The training-free cache model retrieves from three million image features extracted by ImageBind, which effectively mitigates the training-inference modality discrepancy. Notably, with our approach, ImageBind-LLM can respond to instructions of diverse modalities and demonstrate significant language generation quality. Code is released at https://github.com/OpenGVLab/LLaMA-Adapter.
研究の動機と目的
- Motivate and enable large language models to follow instructions conditioned on multiple modalities beyond images (e.g., audio, video, 3D point clouds).
- Develop an efficient training approach by aligning ImageBind embeddings with LLaMA and injecting visual cues in an attention-free manner.
- Leverage a simple image-text training setup to achieve cross-modality instruction-following without full multi-modal LLM retraining.
- Mitigate training-inference modality gaps with a training-free visual cache model during inference.
提案手法
- Freeze the ImageBind image encoder and train a learnable bind network to transform its global image features into the LLaMA embedding space.
- Inject the transformed image feature into every LLaMA word token across all transformer layers using an attention-free, zero-initialized gating mechanism.
- Perform vision-language pre-training with image-caption data, followed by multi-modality instruction tuning using language and visual instruction data, while keeping encoders frozen.
- Fine-tune LLaMA with parameter-efficient methods (LoRA and bias-norm tuning) and a secondary high-quality instruction-tuning stage (MiniGPT-4 data).
- Introduce a training-free visual cache retrieval to enhance multi-modality embeddings during inference by retrieving similar ImageBind features and aggregating them with a residual connection.
実験結果
リサーチクエスチョン
- RQ1Can a single joint embedding space (ImageBind) be leveraged to support instruction-following across image, audio, video, and 3D modalities when tuned with a language model?
- RQ2Does an attention-free, zero-initialized visual injection scheme effectively inject visual instructions into LLaMA without disturbing existing language knowledge?
- RQ3How does a cache-based enhancement during inference mitigate modality-discrepancies between training (image-only) and inference (multi-modality inputs)?
- RQ4Is parameter-efficient fine-tuning sufficient to achieve strong multi-modality instruction-following capabilities?
主な発見
- ImageBind-LLM demonstrates strong zero-shot performance across OCR, KIE, image captioning, VQA, and KGID benchmarks compared with other vision-language models and PandaGPT.
- Compared to PandaGPT, ImageBind-LLM benefits from a bind network for better alignment and uses LLaMA instead of Vicuna, contributing to language generation quality and alignment.
- OCR performance lags behind some baselines, possibly due to using a single modality feature token, unlike other models that use multiple tokens for visual information.
- The training-free visual cache further improves inference by enriching multi-modality embeddings with retrieved similar visual features, reducing modality gaps.
- Extensions include bilingual instruction tuning and any-to-any generation, enabling outputs beyond text when integrated with suitable generative backends.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。