QUICK REVIEW

[论文解读] Gemini: A Family of Highly Capable Multimodal Models

Gemini Robotics Team, Rohan Anil|arXiv (Cornell University)|Dec 19, 2023

Multimodal Machine Learning Applications被引用 790

一句话总结

Gemini 引入了一系列多模态模型（Ultra、Pro、Nano），在图像、音频、视频和文本上进行训练，在大量基准测试中达到最先进的结果，并实现设备端使用。Gemini Ultra 在 MMLU 上达到人类专家水平的表现，并在 32 项基准测试中领先 30 项。

ABSTRACT

This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI.

研究动机与目标

Develop a single multimodal model family trained across text, image, audio, and video with strong cross-domain capabilities.
Enable variants for different deployment needs: Ultra (high capability), Pro (balanced performance and deployability), Nano (on-device).
Post-train models for improved quality, alignment, and safety; provide chat-focused and developer-focused variants.
Evaluate performance across broad internal and external benchmarks spanning language, coding, reasoning, and multimodal tasks.
Discuss responsible deployment, policies, and implications for real-world applications.

提出的方法

Train Transformer decoder-based models with 32k context length and efficient attention (e.g., multi-query attention).
Jointly train on multimodal data (text, images, audio, video) with native text and image output capabilities.
Ingest audio via 16 kHz signals from the Universal Speech Model to capture nuanced audio information.
Use post-training to improve domain capabilities and safety alignment.
Distill Nano models from larger Gemini models for on-device deployment (1.8B and 3.25B variants).
Evaluate pre- and post-trained models on extensive language, coding, reasoning, and multimodal benchmarks.

实验结果

研究问题

RQ1Can a single, jointly trained multimodal model family achieve state-of-the-art performance across text, image, audio, and video benchmarks?
RQ2What are the trade-offs between Ultra, Pro, and Nano variants for accuracy, efficiency, and deployment?
RQ3How does post-training affect factuality, attribution, and hedging in multimodal models?
RQ4To what extent can multimodal models exhibit cross-modal reasoning and long-context capabilities?
RQ5What are the multilingual and on-device capabilities of the Gemini family across languages and tasks?

主要发现

Gemini Ultra achieves state-of-the-art results on 30 of 32 benchmarks and surpasses human expert performance on MMLU with 90.04% accuracy.
Gemini Ultra also sets new state-of-the-art on MMMU (62.4%), and improves the state of the art across 20 multimodal benchmarks.
Gemini Nano models (1.8B and 3.25B) deliver strong on-device performance, especially in factuality, reasoning, and multilingual tasks, and are distillation-based from larger Gemini models.
Post-training mitigations improve factuality (inaccuracy rate halved), attribution AIS score (up to 60.0%), and hedging accuracy (69.3%).
A multimodal, long-context model with 32k context length retrieves correctly across the full context in retrieval-style tests (98% accuracy).
Gemini enables complex systems like AlphaCode 2 by integrating Gemini Pro with search and tool-use for competitive programming tasks.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。