QUICK REVIEW

[論文レビュー] GLAC Net: GLocal Attention Cascading Networks for Multi-image Cued Story Generation

Taehyeong Kim, Min-Oh Heo|arXiv (Cornell University)|May 28, 2018

Multimodal Machine Learning Applications参考文献 31被引用数 53

ひとこと要約

GLAC Net は二レベルのグローカル注意機構とコンテキストカ cascading を導入し、複数画像ストーリーを一貫して生成。VIST でのビーム探索をあまり使わず競争力のある METEOR スコア。

ABSTRACT

The task of multi-image cued story generation, such as visual storytelling dataset (VIST) challenge, is to compose multiple coherent sentences from a given sequence of images. The main difficulty is how to generate image-specific sentences within the context of overall images. Here we propose a deep learning network model, GLAC Net, that generates visual stories by combining global-local (glocal) attention and context cascading mechanisms. The model incorporates two levels of attention, i.e., overall encoding level and image feature level, to construct image-dependent sentences. While standard attention configuration needs a large number of parameters, the GLAC Net implements them in a very simple way via hard connections from the outputs of encoders or image features onto the sentence generators. The coherency of the generated story is further improved by conveying (cascading) the information of the previous sentence to the next sentence serially. We evaluate the performance of the GLAC Net on the visual storytelling dataset (VIST) and achieve very competitive results compared to the state-of-the-art techniques. Our code and pre-trained models are available here.

研究の動機と目的

多画像を用いた誘導ストーリー生成の課題を動機づけ、画像ごとに特有でありつつ文脈的に一貫した文の必要性を強調する。
全体シーケンス文脈と画像固有特徴の両方に基づく文を grounding する二段階（グローバルとローカル）注意機構を備えた GLAC Net を提案する。
前の文の最後の隠れ状態を用いて次の文の文生成を初期化するカスケード機構を組み込み、ストーリーの一貫性を向上させる。
VIST データセットで競争的な性能を達成し、再現性のためのコードと事前学習モデルを提供する。

提案手法

各画像について ResNet-152 で画像特徴を抽出する。
全体の文脈を捉えるために、双方向 LSTM で画像列をエンコードする。
画像固有特徴と bi-LSTM 出力を結合して glocal ベクトルを計算する（ハードアテンション）。
glocal ベクトルをデコーダに入力して各画像に対する文を生成する。
前文の最後の隠れ状態を用いて各文生成器を初期化するカスケード機構を適用する。
語の重複を減らし多様性を高めるため、サンプリング/カウントベースの後処理戦略を用いる。

実験結果

リサーチクエスチョン

RQ1グローバル文脈と画像固有特徴という二段階の注意機構は、画像列に対する画像に基づくストーリー生成を改善できるか？
RQ2シーケンス全体で文レベルの文脈をカスケードすることは、生成されたストーリー全体の一貫性を高めるか？
RQ3グローバル/ローカル注意、カスケード、および後処理の各要素が、VIST データセットにおける標準指標（例：METEOR、困惑度）へ及ぼす影響は何か？

主な発見

モデル	検証困惑度	テスト困惑度	METEOR スコア
Baselines (Beam=10)	-	-	0.2313
Baselines (Greedy)	-	-	0.2776
Baselines (-Dups)	-	-	0.3011
Baselines (+Grounded)	-	-	0.3142
LSTM Seq2Seq	21.89	22.18	0.2721
GLAC Net (-Cascading)	20.24	20.54	0.3063
GLAC Net (-Global)	18.32	18.47	0.2913
GLAC Net (-Local)	18.21	18.33	0.2996
GLAC Net (-Count)	18.13	18.28	0.2823
GLAC Net	18.13	18.28	0.3014

GLAC Net はビーム探索なしで VIST データセットのベースラインと競争力を持つ。
アブレーション分析は、カスケード、グローバル注意、ローカル注意、および後処理の性能への寄与を示す。
完全な GLAC Net は、テストされた構成の中で最も総合的な METEOR 相当性能を達成する（報告設定で METEOR ~0.3014）。
重複を減らすヒューリスティクス（Count）は METEOR を改善し、反復を減らす。
LSTM Seq2Seq ベースラインと比較して、GLAC Net は評価設定全般で通常より良い性能を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。