[論文レビュー] Are we pretraining it right? Digging deeper into visio-linguistic pretraining
この論文は事前学習データのドメイン領域(視覚/テキスト)が視覚言語転移に与える影響を体系的に研究し、同ドメイン内または合成データが標準的で大規模だがミスマッチなデータセットよりも優れることがあると示し、単純な事前学習の選択でアーキテクチャ変更なしにほぼSOTAの結果を得られる可能性を示す。
Numerous recent works have proposed pretraining generic visio-linguistic representations and then finetuning them for downstream vision and language tasks. While architecture and objective function design choices have received attention, the choice of pretraining datasets has received little attention. In this work, we question some of the default choices made in literature. For instance, we systematically study how varying similarity between the pretraining dataset domain (textual and visual) and the downstream domain affects performance. Surprisingly, we show that automatically generated data in a domain closer to the downstream task (e.g., VQA v2) is a better choice for pretraining than "natural" data but of a slightly different domain (e.g., Conceptual Captions). On the other hand, some seemingly reasonable choices of pretraining datasets were found to be entirely ineffective for some downstream tasks. This suggests that despite the numerous recent efforts, vision & language pretraining does not quite work "out of the box" yet. Overall, as a by-product of our study, we find that simple design choices in pretraining can help us achieve close to state-of-art results on downstream tasks without any architectural changes.
研究の動機と目的
- Investigate how the domain similarity between pretraining data and downstream tasks impacts performance in visio-linguistic models.
- Assess whether larger pretraining datasets always help, and how data quality and domain alignment influence transfer.
- Explore when pretraining helps low-resource downstream tasks and when it may be unnecessary or detrimental.
- Examine the transferability of pretrained representations by freezing model bases and comparing to full fine-tuning.
- Propose synthetic in-domain data generation as a scalable alternative to improve pretraining when labelled data are scarce.
提案手法
- Compare two major visio-linguistic architectures (VisualBERT and ViLBERT) trained with MLM and MMM objectives on three pretraining datasets (COCO Captions, VQA 2.0, Conceptual Captions).
- Evaluate on four downstream tasks (VQA-D, VizWiz, SNLI-VE, MM-IMDB) with different domain matches to pretraining data.
- Systematically vary pretraining dataset size (including CC-Small, CC-full) and measure impact on downstream performance.
- Freeze the pretraining base to quantify transferability of learned representations.
- Introduce a generated in-domain dataset (CCG) by captioning in-domain images and compare its effectiveness to natural in-domain/out-of-domain data.
- Analyze representation drift via L1 and angular distances between pretrained and finetuned self-attention weights.
実験結果
リサーチクエスチョン
- RQ1How does the visual and textual domain similarity between pretraining data and downstream tasks affect visio-linguistic transfer?
- RQ2Is simply using the largest pretraining dataset always beneficial, or do domain and quality matter more?
- RQ3Do pretrained representations transfer better when the downstream task is low-resource, and what factors influence this?
- RQ4Can synthetic in-domain data close the domain gap and improve pretraining effectiveness without architectural changes?
- RQ5Which architecture (VisualBERT vs ViLBERT) yields more transferable representations under various pretraining conditions?
主な発見
- Pretraining domains aligned with downstream tasks yield better results than out-of-domain pretraining across VisualBERT and ViLBERT.
- VQA-P and COCO pretraining often outperform CC-small/-full, especially when both visual and textual domains match the downstream task; MM-IMDB rarely benefits from pretraining.
- Pretraining improves transferability vs random initialization, but benefits vary by task; some low-resource tasks show little or negative gains from pretraining.
- A generated in-domain dataset (CCG) can outperform out-of-domain CC and approaches in-domain performance, suggesting a scalable path to improve pretraining when labelled data are scarce.
- VisualBERT consistently outperforms ViLBERT in the reported settings; freezing the base model reveals limited transfer for ViLBERT in some tasks.
- The best pretrained model for finetuning does not always equal the most transferable one when freezing the base; finetuning can recover the best downstream performance.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。