QUICK REVIEW

[論文レビュー] Attention on Attention for Image Captioning

Lun Huang, Wenmin Wang|arXiv (Cornell University)|Aug 19, 2019

Multimodal Machine Learning Applications参考文献 48被引用数 51

ひとこと要約

この論文は Attention on Attention (AoA) を導入し、注意結果とクエリの関連性を測定するモジュールを提案し、それをエンコーダとデコーダの双方に適用して AoANet を構築。MS COCO で CIDEr-D の最先端スコアを達成。

ABSTRACT

Attention mechanisms are widely used in current encoder/decoder frameworks of image captioning, where a weighted average on encoded vectors is generated at each time step to guide the caption decoding process. However, the decoder has little idea of whether or how well the attended vector and the given attention query are related, which could make the decoder give misled results. In this paper, we propose an Attention on Attention (AoA) module, which extends the conventional attention mechanisms to determine the relevance between attention results and queries. AoA first generates an information vector and an attention gate using the attention result and the current context, then adds another attention by applying element-wise multiplication to them and finally obtains the attended information, the expected useful knowledge. We apply AoA to both the encoder and the decoder of our image captioning model, which we name as AoA Network (AoANet). Experiments show that AoANet outperforms all previously published methods and achieves a new state-of-the-art performance of 129.8 CIDEr-D score on MS COCO Karpathy offline test split and 129.6 CIDEr-D (C40) score on the official online testing server. Code is available at https://github.com/husthuaan/AoANet.

研究の動機と目的

画像キャプショニングにおける標準の注意で attended ベクトルとクエリの不整合を動機付け、対処する。
AoA を従来の注意の拡張として提案し、 attended 情報の関連性を評価・活用する。
AoA を画像エンコーダの改良とキャプションデコーダの両方に統合して AoANet を開発する。
MS COCO での性能を評価し、最先端の結果を示す。

提案手法

Attention on Attention (AoA) を定義し、注意結果と現在の文脈から情報ベクトルと注意ゲートを生成し、ゲート付きの要素ごと注意を適用して attended 情報を得る。
AoA をエンコーダとデコーダの両方に適用: エンコーダは AoA を用いた自己注意の改良モジュールでオブジェクト間の関係をモデル化する。デコーダは AoA を用いて生成中に有用な注意出力をフィルタリング・強化する。
エンコーダの改良にはマルチヘッド自己注意を用い、残差接続と層正規化とともに AoA を適用; デコーダは AoA ベースのコンテキストベクトルを持つ LSTM を採用。
交差エントロピー損失で訓練し、Self-Critical Sequence Training (SCST) による CIDEr-D 最適化を行う; 入力にはFaster-RCNN ボトムアップ特徴を 1024 次元に射影。

実験結果

リサーチクエスチョン

RQ1AoA は画像キャプショニングにおける注意結果とデコード文脈の関連性を信頼性高く測定・強制できるか？
RQ2エンコーダに AoA を適用することでオブジェクト間関係のモデリングは改善され、デコーダに AoA を適用することで生成時の誤誘導的注意を減らせるか？
RQ3AoANet は標準的な MS COCO ベンチマークで、従来の最先端手法と比較してどうなるか？

主な発見

単一の AoANet モデルが offline MS COCO (XE トレーニング) で 119.8 CIDEr-D を達成、従来の単一モデル手法を上回る。
Four AoANet models のアンサンブルは offline MS COCO で 132.0 CIDEr-D へ（CIDEr-D 最適化）。
公式オンラインサーバーでは AoANet が 129.6 CIDEr-D (C40) を達成し、ほとんどの指標で首位。
定性的・アブレーション分析は AoA が誤誘導的注意を減らし、物体数え取りや相互作用の理解を向上させることを示す（例: キリンにいる鳥、テニスラケット）。
MSR-VTT の一般化実験は AoA が同様の利得で動画キャプショニングを改善できることを示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。