Skip to main content
QUICK REVIEW

[Paper Review] Visual Instruction Tuning

Haotian Liu, Chunyuan Li|arXiv (Cornell University)|Apr 17, 2023
Multimodal Machine Learning Applications670 citations
TL;DR

This paper introduces LLaVA, a large multimodal model trained by instructing a vision encoder and an LLM via GPT-4 generated vision-language data, achieving strong multimodal chat and state-of-the-art on ScienceQA when combined with GPT-4.

ABSTRACT

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available.

Motivation & Objective

  • Motivate the extension of instruction tuning to vision-language models to enable general-purpose visual assistants.
  • Provide a scalable pipeline to generate multimodal instruction-following data using language models.
  • Develop and evaluate LLaVA, a large multimodal model combining a vision encoder with a language model.
  • Create and release benchmarks (LLaVA-Bench) for multimodal instruction-following in chat and reasoning tasks.

Proposed method

  • Connect the CLIP visual encoder to the Vicuna language model via a trainable projection W to produce visual tokens.
  • Generate 158K multimodal instruction-following samples using GPT-4 (and earlier ChatGPT) from image-text data, in three formats: conversation, detailed description, and complex reasoning.
  • Two-stage training: Stage 1 pre-train a visual tokenizer by aligning image features with LLM embeddings using a sub-sample of CC3M; Stage 2 fine-tune end-to-end with W and φ (LM) on the generated data.
  • Train with multimodal chat data and evaluate on multimodal chat and ScienceQA; ensemble with GPT-4 for improved results.
Figure 1 : LLaVA network architecture.
Figure 1 : LLaVA network architecture.

Experimental results

Research questions

  • RQ1Can GPT-4 generated vision-language data enable effective visual instruction tuning of a multimodal model?
  • RQ2How well can a CLIP-Vicuna architecture coupled with a GPT-4 generated data pipeline perform on open-ended multimodal tasks?
  • RQ3Does combining LLaVA with GPT-4 yield state-of-the-art results on multimodal reasoning benchmarks?
  • RQ4What is the value of different types of instruction-following data (conversation, detailed description, complex reasoning) for multimodal alignment?

Key findings

  • LLaVA achieves strong multimodal chat capabilities, approaching multimodal GPT-4 on unseen images and instructions.
  • On a synthetic multimodal instruction-following dataset, LLaVA reaches 85.1% relative score vs GPT-4.
  • Fine-tuning on ScienceQA with GPT-4 ensemble yields a new state-of-the-art accuracy of 92.53%.
  • LLaVA-Bench (In-the-Wild) shows substantial gains from instruction tuning, with all three data types providing the best overall performance at 85.1%.
  • Ablations indicate pre-training and model scale materially impact results, with a 13B LLaVA model achieving 90.92% on ScienceQA and entering SOTA when combined with GPT-4.
Table 3 : Example prompt from GPT-4 paper [ 36 ] to compare visual reasoning and chat capabilities. Compared to BLIP-2 [ 28 ] and OpenFlamingo [ 5 ] , LLaVA accurately follows the user’s instructions, instead of simply describing the scene. LLaVA offers a more comprehensive response than GPT-4. Even
Table 3 : Example prompt from GPT-4 paper [ 36 ] to compare visual reasoning and chat capabilities. Compared to BLIP-2 [ 28 ] and OpenFlamingo [ 5 ] , LLaVA accurately follows the user’s instructions, instead of simply describing the scene. LLaVA offers a more comprehensive response than GPT-4. Even

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.