QUICK REVIEW

[论文解读] PaliGemma 2: A Family of Versatile VLMs for Transfer

Andreas Steiner, André Susano Pinto|arXiv (Cornell University)|Dec 4, 2024

Geophysics and Sensor Technology被引用 10

一句话总结

PaliGemma 2 通过在 3 种尺寸和 3 种图像分辨率上整合 Gemma 2 语言模型，升级 PaliGemma VLM，实现广泛迁移，涵盖新任务，并在 OCR、表格、化学、音乐和医学影像领域达到最先进的结果。

ABSTRACT

PaliGemma 2 is an upgrade of the PaliGemma open Vision-Language Model (VLM) based on the Gemma 2 family of language models. We combine the SigLIP-So400m vision encoder that was also used by PaliGemma with the whole range of Gemma 2 models, from the 2B one all the way up to the 27B model. We train these models at three resolutions (224px, 448px, and 896px) in multiple stages to equip them with broad knowledge for transfer via fine-tuning. The resulting family of base models covering different model sizes and resolutions allows us to investigate factors impacting transfer performance (such as learning rate) and to analyze the interplay between the type of task, model size, and resolution. We further increase the number and breadth of transfer tasks beyond the scope of PaliGemma including different OCR-related tasks such as table structure recognition, molecular structure recognition, music score recognition, as well as long fine-grained captioning and radiography report generation, on which PaliGemma 2 obtains state-of-the-art results.

研究动机与目标

研究模型大小和图像分辨率在微调下对迁移性能的影响。
将迁移任务扩展到 OCR、表结构识别、分子结构、乐谱识别、长描述、空间推理和放射影像报告生成。
提供开源权重的 VLM 作为可插拔替代方案，以促进广泛的迁移研究与实际部署。

提出的方法

将固定的视觉编码器（SigLIP-So400m）与 3B、10B、28B 大小的 Gemma 2 语言模型相结合。
分三阶段训练：单模态/组件预训练、在逐步增大的分辨率（224px^2、448px^2、896px^2）下的多模态联合预训练，然后进行特定任务微调。
在阶段1和阶段2中如同先前工作那样对注意力和输出 logits 应用软上限以稳定训练。
在 Cloud TPUv5e 机组上采用完全切片的数据并行（FSDP）设置进行大规模预训练。
在涵盖标题、地面真实描述、OCR、VQA、检测和实例分割等广泛任务混合上进行微调。
在 30+ 个迁移任务上评估，并分析模型大小、分辨率和迁移学习率的影响。

Figure 1: PaliGemma 2 processes a 224px 2 / 448px 2 /896px 2 image with a SigLIP-400m encoder with patch size 14px 2 , yielding 256/1024/ 4096 tokens. After a linear projection, the image tokens are concatenated with the input text tokens and Gemma 2 autoregressively completes this prefix with an an

实验结果

研究问题

RQ1图像分辨率和语言模型大小如何相互作用，以影响多任务的迁移性能？
RQ2哪些迁移任务在高分辨率相比更强的语言模型上受益更多？
RQ3最佳迁移学习率如何随模型大小和分辨率变化？
RQ4更大版本的 PaliGemma 2 是否在像 OCR、分子和医学影像等新领域取得了最先进的结果？

主要发现

提高图像分辨率和语言模型规模通常会提升迁移性能，但在两个维度上成本更高。
更大模型（例如 28B）在很多任务上带来显著提升，但与 3B→10B 的阶段相比，回报可能递减。
对于更大模型，最佳迁移学习率往往更低，随着模型大小的增加需要探索更小的学习率。
PaliGemma 2 3B 在 896px^2 下在 ICDAR’15 Incidental 与 Total-Text 的 HierText 评估中达到最先进的 OCR 结果。
PaliGemma 2 在适当分辨率下在表结构识别（PubTabNet、FinTabNet）和分子结构识别（MolScribe）上达到最先进的结果。
在放射影像方面，PaliGemma 2 达到最先进的 RadGraph F1 分数，且分辨率和模型增大带来改进。

Figure 2: Referring segmentation example from our PaliGemma demo a . The model is pretrained with a vocabulary that includes localization tokens (for detection) and segmentation tokens (to define a binary mask inside a bounding box).

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。