QUICK REVIEW

[論文レビュー] Adaptively Aligned Image Captioning via Adaptive Attention Time

Lun Huang, Wenmin Wang|arXiv (Cornell University)|Sep 19, 2019

Multimodal Machine Learning Applications被引用数 39

ひとこと要約

本論文は Adaptive Attention Time (AAT) を導入します。デコーディングの各ステップごとに取得する注意ステップ数を適応的に決定する微分可能な機構であり、固定の1ステップ注意および再帰的注意モデルと比べて改善します。

ABSTRACT

Recent neural models for image captioning usually employ an encoder-decoder framework with an attention mechanism. However, the attention mechanism in such a framework aligns one single (attended) image feature vector to one caption word, assuming one-to-one mapping from source image regions and target caption words, which is never possible. In this paper, we propose a novel attention model, namely Adaptive Attention Time (AAT), to align the source and the target adaptively for image captioning. AAT allows the framework to learn how many attention steps to take to output a caption word at each decoding step. With AAT, an image region can be mapped to an arbitrary number of caption words while a caption word can also attend to an arbitrary number of image regions. AAT is deterministic and differentiable, and doesn't introduce any noise to the parameter gradients. In this paper, we empirically show that AAT improves over state-of-the-art methods on the task of image captioning. Code is available at https://github.com/husthuaan/AAT.

研究の動機と目的

標準の注意モデルにおける画像領域と単語の1対1仮定を解消することで、画像キャプション生成を改善する。
各単語ごとに適応的な注意ステップを有効にし、画像領域とキャプションを柔軟に整合させる。
デコーディング時の適応的計算を許しつつ、微分可能性と安定性を維持する。

提案手法

デコーディングの各ステップでいくつの注意ステップを取るかを学習する Adaptive Attention Time (AAT) を提案。
AAT を、各単語ごとに複数の attended steps を実行できる注意モジュールを備えた2層 LSTM エンコーダ-デコーダに埋め込む。
注意を停止して単語を出力する時期を決定する信頼度ネットワークを用い、Adaptive Computation Time (ACT) に触発。
画像領域間の相互作用をより良く捉えるためにマルチヘッド注意を取り入れる。
訓練時に時間コストペナルティを追加し、精度と計算のバランスを取る。
ベース、リカレント、適応型注意モデルを AAT の特別な場合として示す接続を提供。

実験結果

リサーチクエスチョン

RQ1デコーディング毎の適応的な注意ステップは、1ステップや固定ステップの注意モデルよりキャプション品質を向上させるか？
RQ2注意時間における計算コストとキャプション品質のトレードオフはどうなるか？
RQ3この枠組みでのヘッド数と additive 注意と dot-product 注意の影響は？
RQ4適応型注意機構は画像キャプション生成以外の他のエンコーダ-デコーダタスクにも一般化できるか？

主な発見

モデル	クロスエントロピー BLEU-4	クロスエントロピー METEOR	クロスエントロピー ROUGE	クロスエントロピー CIDEr-D	クロスエントロピー SPICE	自己批判 BLEU-4	自己批判 METEOR	自己批判 ROUGE	自己批判 CIDEr-D	自己批判 SPICE
LSTM	29.6	25.2	52.6	94.0	-	31.9	25.5	54.3	106.3	-
ADP-ATT	33.2	26.6	-	108.5	-	-	-	-	-	-
SCST	30.0	25.9	53.4	99.4	-	34.2	26.7	55.7	114.0	-
Up-Down	36.2	27.0	56.4	113.5	20.3	36.3	27.7	56.9	120.1	21.4
RFNet	35.8	27.4	56.8	112.5	20.5	36.5	27.7	57.3	121.9	21.2
GCN-LSTM	36.8	27.9	57.0	116.3	20.9	38.2	28.5	58.3	127.6	22.0
SGAE	-	-	-	-	-	38.4	28.4	58.6	127.8	22.1
AAT (Ours)	37.0	28.1	57.3	117.2	21.2	38.7	28.6	58.5	128.6	22.2

AAT は MS COCO (Karpathy 分割) において METEOR、CIDEr-D、SPICE の全体でベースおよびリカレント注意モデルを上回り、デコーディングステップごとに平均 2.55 の注意ステップを達成。
λ = 1e-4 のとき、AAT は強力な性能を達成しつつ、アブレーションでの平均注意ステップを比較的低く保つ（2.54–2.84）
マルチヘッド加法型注意（8 ヘッド）が最良のバランスを示し、自己批判的訓練で CIDEr-D 128.6 と SPICE 22.2 を達成。
Up-Down（従来のSOTA）と比較して、AAT は両方の訓練段階で BLEU-4、METEOR、ROUGE-L、CIDEr-D、SPICE を著しく改善。
単一の AAT モデルが彼らの結果で MS COCO テストセットで 128.6 CIDEr-D を達成し、当時の最先端性能を示しています。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。