QUICK REVIEW

[论文解读] Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Robotics Team, Petko Georgiev|arXiv (Cornell University)|Mar 8, 2024

Semantic Web and Ontologies被引用 276

一句话总结

Gemini 1.5 引入两款长上下文多模态模型（Gemini 1.5 Pro 和 Gemini 1.5 Flash），能够回忆并在数百万个令牌上进行推理，达到近乎完美的长上下文检索，以及在长文档问答、长视频问答和长上下文自动语音识别方面的最新水平。

ABSTRACT

In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.

研究动机与目标

在极长的上下文窗口（数百万个令牌）中推进多模态理解。
提供在保留质量的前提下高效的变体（Gemini 1.5 Pro 和 Gemini 1.5 Flash）。
展示在长上下文检索、长文档问答、长视频问答和长上下文自动语音识别方面的改进。
展示在低资源语言任务中的实际现实世界影响和令人惊讶的能力。

提出的方法

开发两种模型：Gemini 1.5 Pro（在基准测试上较二月版本有改进）和 Gemini 1.5 Flash（在质量损失极小的前提下更高效）。
在所有模态下展示近乎完美的检索（>99%），覆盖高达 10 million tokens。
在长文档问答、长视频问答和长上下文自动语音识别基准上进行评估，比较对象包括 Gemini 1.0 Ultra 等先前模型。
随着上下文长度扩展，分析下一个词预测性能以评估长上下文的极限。
展示现实世界中的用例，说明时间节省和跨领域能力。

实验结果

研究问题

RQ1Gemini 1.5 在文本、视频和音频等多模态数据上，能在数百万个令牌范围内多好地回忆和推理？
RQ2Gemini 1.5 Pro 与 Gemini 1.5 Flash 在准确性与效率之间的权衡是什么？
RQ3长上下文模型在长文档问答、长视频问答和长上下文自动语音识别方面是否达到最先进的表现？
RQ4在多样化任务中部署 Gemini 1.5 的实际现实影响与局限性是什么（包括低资源语言）？

主要发现

Gemini 1.5 在高达 10M tokens 的范围内实现近乎完美的检索（>99%）。
Gemini 1.5 Pro 在大多数能力与基准测试上超越了二月版本。
Gemini 1.5 Flash 在保持与 Pro 相比的质量最小回归的同时提供效率。
这些模型在长文档问答、长视频问答和长上下文自动语音识别方面刷新了最先进的结果。
在现实世界场景中，Gemini 1.5 在 10 个工作类别中实现了 26–75% 的时间节省。
这些模型展示出令人惊讶的能力，例如从语法材料学习 Kalamang 的翻译，达到与同一内容的学习者相当的水平。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。