[论文解读] VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
tldr: VITA-1.5 introduces a three-stage training pipeline that integrates vision and speech into a multimodal LLM, achieving near real-time vision-speech interaction without external ASR/TTS modules and showing competitive results on image/video benchmarks and strong ASR performance.
Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a significant challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities, making near real-time vision and speech interaction. Code has been released at https://github.com/VITA-MLLM/VITA.
研究动机与目标
- Objective 1: Advance multimodal interaction by integrating vision and speech into a single LLM-based framework.
- Objective 2: Mitigate modality conflicts through staged training that progressively introduces vision and audio data.
- Objective 3: Eliminate dependence on separate ASR and TTS modules to reduce latency in end-to-end interactions.
- Objective 4: Demonstrate competitive performance on image, video, and speech benchmarks against open-source and proprietary models.
提出的方法
- Method 1: Three-stage training pipeline that progressively incorporates vision and audio into a large language model (LLM).
- 方法 2: Stage 1: Vision-Language training with vision alignment, vision understanding, and vision SFT using caption and QA data.
- 方法 3: Stage 2: Audio Input tuning with audio alignment via an ASR-style encoder (CTC loss) and audio-SFT for speech QA with mixed caption/QA data.
- 方法 4: Stage 3: Audio Output tuning with a end-to-end speech generator composed of a codec, non-autoregressive and autoregressive decoders to produce speech tokens and waveform.
- 方法 5: Input modalities use an InternViT visual encoder and a dedicated audio encoder with adapters; output relies on an end-to-end speech module rather than a separate TTS system.
实验结果
研究问题
- RQ1Question 1: Can a single LLM be effectively trained to process and reason with vision, language, and audio inputs without modular ASR/TTS pipelines?
- RQ2Question 2: Does a staged training strategy sufficiently relieve cross-modality conflicts to preserve vision-language performance while enabling robust speech understanding and generation?
- RQ3Question 3: How does VITA-1.5 perform on image, video, and speech benchmarks compared to open-source and proprietary multimodal models?
- RQ4Question 4: What are the trade-offs in end-to-end speech generation quality and latency for real-time multimodal interaction?
主要发现
- Finding 1: VITA-1.5 achieves vision-language performance competitive with leading open-source models and comparable to some closed-source systems on image benchmarks.
- Finding 2: After Stage 2 (Audio Input Tuning) and Stage 3 (Audio Output Tuning), the model retains most of its visual-language capabilities.
- Finding 3: The model exhibits strong ASR performance in both Mandarin and English benchmarks, surpassing several specialized speech models.
- Finding 4: Video understanding benchmarks show VITA-1.5 approaching open-source peers, with a larger gap to proprietary systems.
- Finding 5: An end-to-end speech generation module enables speech-to-speech interaction without external TTS, reducing latency.
- Finding 6: Training data covers diverse modalities (image, video, text, audio) and languages (Chinese and English), with 110k hours of ASR data and 3k hours of text-speech data.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。