QUICK REVIEW

[論文レビュー] Pythia v0.1: the Winning Entry to the VQA Challenge 2018

Yu Jiang, Vivek Natarajan|arXiv (Cornell University)|Jul 26, 2018

Multimodal Machine Learning Applications参考文献 16被引用数 165

ひとこと要約

Pythia v0.1 は、アーキテクチャの調整、学習スケジュール、特徴量の微調整、データ拡張、そして多様なアンサンブルを通じて up-down attention モデルを改善するモジュール型 VQA フレームワークであり、VQA v2.0 で最先端の結果を達成します。

ABSTRACT

This document describes Pythia v0.1, the winning entry from Facebook AI Research (FAIR)'s A-STAR team to the VQA Challenge 2018. Our starting point is a modular re-implementation of the bottom-up top-down (up-down) model. We demonstrate that by making subtle but important changes to the model architecture and the learning rate schedule, fine-tuning image features, and adding data augmentation, we can significantly improve the performance of the up-down model on VQA v2.0 dataset -- from 65.67% to 70.22%. Furthermore, by using a diverse ensemble of models trained with different features and on different datasets, we are able to significantly improve over the 'standard' way of ensembling (i.e. same model with different random seeds) by 1.31%. Overall, we achieve 72.27% on the test-std split of the VQA v2.0 dataset. Our code in its entirety (training, evaluation, data-augmentation, ensembling) and pre-trained models are publicly available at: https://github.com/facebookresearch/pythia

研究の動機と目的

Pythia というモジュール型の VQA 研究プラットフォームの開発を動機づける。
ターゲットを絞ったアーキテクチャやトレーニングの変更が VQA の精度を向上させることを示す。
データ拡張と微調整された特徴量が性能を向上させることを示す。
標準シードを超えるグリッド特徴と多様なアンサンブルの利点を探る。

提案手法

bottom-up top-down (up-down) アテンションモデルをモジュラーなフレームワークとして再実装する。
ゲート付き tanh をウェイト正規化と ReLU に置換; 融合には Hadamard積を用い、シグモイド分類器を用いる。
300D GloVe 埋め込み、GRU ベースの質問エンコード、質問アテンションモジュールを用いる。
暖機期間付きの学習スケジュールと段階的な LR 減衰を用いた Adamax を適用して学習を改善する。
Detectron FPN ベースの検出器と 2048D fc6/fc7 特徴量で bottom-up 特徴を微調整する。
Visual Genome と VisDial でデータを拡張し、左右トークンのスワップによるミラー画像を作成し、グリッド特徴と100個のバウンディングボックス提案を取り入れる。
2 つのアンサンブルを構築する：(i) 同一モデルのシード; (ii) 異なる特徴とデータソースで学習させた多様なモデル。

実験結果

リサーチクエスチョン

RQ1VQA の研究を相互互換可能なコンポーネントにモジュール化することは、再利用性と性能を向上させるのか？
RQ2アーキテクチャの変更（活性化関数、融合）、学習率スケジュール、特徴量の微調整が VQA の精度に与える影響はどのようになるのか？
RQ3データ拡張と追加のグリッドベースの画像特徴が、bottom-up特徴のみを超えて性能を向上させるのか？
RQ4異なるモデルを用いた多様なアンサンブルは、同一アーキテクチャで異なるシードから構築したアンサンブルより優れているのか？

主な発見

モデル	test-dev	test-std
up-down [1]	65.32	65.67
up-down Model Adaptation (§ 2.1 )	66.91
+ Learning Schedule (§ 2.2 )	68.05
+ Detectron & Fine-tuning (§ 2.3 )	68.49
+ Data Augmentation ∗ (§ 2.4 )	69.24
+ Grid Feature ∗ (§ 2.5 )	69.81
+ 100 bboxes ∗ (§ 2.5 )	70.01	70.24
Ensemble, 30 × same model (§ 2.6 )	70.96
Ensemble, 30 × diverse model (§ 2.6 )	72.18	72.27

ベースラインの up-down は test-dev 65.32%、test-std 65.67% を達成。
アーキテクチャの適応により test-dev が 66.91% へ向上（test-std は報告なし）。
学習スケジュールの改善で test-dev が 68.05% へ。
bottom-up 特徴の微調整で test-dev が 68.49% へ。
データ拡張で test-dev が 69.24% へ。
グリッド特徴で test-dev が 69.81% へ。
100 個のオブジェクト提案を使用して test-dev が 70.01%、test-std が 70.24% に向上。
30 個の多様なモデルのアンサンブルは test-dev 72.18%、test-std 72.27%（最先端）を達成。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。