QUICK REVIEW

[論文レビュー] PaliGemma 2: A Family of Versatile VLMs for Transfer

Andreas Steiner, André Susano Pinto|arXiv (Cornell University)|Dec 4, 2024

Geophysics and Sensor Technology被引用数 10

ひとこと要約

PaliGemma 2 は Gemma 2 言語モデルを 3 サイズと 3 つの画像解像度に統合することで PaliGemma VLM を強化し、広範な転移を可能にするとともに、新しいタスクや OCR、表、化学、音楽、医用画像処理で最先端の結果を達成します。

ABSTRACT

PaliGemma 2 is an upgrade of the PaliGemma open Vision-Language Model (VLM) based on the Gemma 2 family of language models. We combine the SigLIP-So400m vision encoder that was also used by PaliGemma with the whole range of Gemma 2 models, from the 2B one all the way up to the 27B model. We train these models at three resolutions (224px, 448px, and 896px) in multiple stages to equip them with broad knowledge for transfer via fine-tuning. The resulting family of base models covering different model sizes and resolutions allows us to investigate factors impacting transfer performance (such as learning rate) and to analyze the interplay between the type of task, model size, and resolution. We further increase the number and breadth of transfer tasks beyond the scope of PaliGemma including different OCR-related tasks such as table structure recognition, molecular structure recognition, music score recognition, as well as long fine-grained captioning and radiography report generation, on which PaliGemma 2 obtains state-of-the-art results.

研究の動機と目的

ファインチューニング下で、モデルサイズと画像解像度が転移性能に与える影響を調査する。
転移タスクをOCR、表構造認識、分子構造認識、楽譜認識、長文キャプション生成、空間推論、放射線診断レポート生成へ拡張する。
広範な転移研究と実用的な展開を可能にするため、オープンウェイトのVLMをドロップイン置換として提供する。

提案手法

固定ビジョンエンコーダ（SigLIP-So400m）と 3B、10B、28B サイズの Gemma 2 言語モデルを組み合わせる。
3 段階で訓練する：単一モーダル/コンポーネントプリトレーニング、解像度を順次上げたマルチモーダル共訓練（224px^2、448px^2、896px^2）、続いてタスク固有のファインチューニング。
訓練安定化のため、Stage 1 と Stage 2 で注意機構と出力ロジットにロジットソフトキャップを prior work のように適用する。
大規模プリトレーニングのため、Cloud TPUv5e ポッド上でFully-Sharded Data-Parallel (FSDP) を使用する。
キャプショニング、グラounded キャプショニング、OCR、VQA、検出、インスタンスセグメンテーションを含む幅広くタスク豊富な混合データでファインチューニング。
30+ の転移タスクにわたって評価し、モデルサイズ、解像度、転移学習率の影響を分析する。

Figure 1: PaliGemma 2 processes a 224px 2 / 448px 2 /896px 2 image with a SigLIP-400m encoder with patch size 14px 2 , yielding 256/1024/ 4096 tokens. After a linear projection, the image tokens are concatenated with the input text tokens and Gemma 2 autoregressively completes this prefix with an an

実験結果

リサーチクエスチョン

RQ1多様なタスクにおける転移性能に対して、画像解像度と言語モデルサイズはどのように相互作用するか。
RQ2高解像度とより高性能な言語モデルのどちらの恩恵を受ける転移タスクが多いか。
RQ3最適な転移学習率はモデルサイズと解像度とともにどう変化するか。
RQ4より大きい PaliGemma 2 のバリアントはOCR、分子、医用画像などの新しい領域で最先端の結果を出すか。

主な発見

画像解像度と言語モデルサイズを増やすと、一般に転移性能が向上するが、両方の次元で計算コストが高くなる。
大規模モデル（例: 28B）は多くのタスクで substantial gains をもたらすが、3B→10B ステップと比べると収益が頭打ちになることがある。
最適な転移学習率は大きいモデルほど低くなる傾向があり、モデルサイズが大きくなるにつれて小さな学習率を sweep する必要がある。
PaliGemma 2 3B at 896px^2 は HierText 評価下で ICDAR’15 Incidental および Total-Text において最先端の OCR 結果を達成。
PaliGemma 2 は PubTabNet、FinTabNet の表構造認識および MolScribe の分子構造認識で最先端の結果を達成。
放射線診断領域では、PaliGemma 2 が最先端の RadGraph F1 スコアを達成し、高解像度と大規模モデルから改善。

Figure 2: Referring segmentation example from our PaliGemma demo a . The model is pretrained with a vocabulary that includes localization tokens (for detection) and segmentation tokens (to define a binary mask inside a bounding box).

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。