QUICK REVIEW

[論文レビュー] Hierarchical Long-term Video Prediction without Supervision

Nevan Wichers, Ruben Villegas|arXiv (Cornell University)|Jun 12, 2018

Advanced Data Compression Techniques被引用数 64

ひとこと要約

この論文は高レベル特徴を学習し、地上真の高レベル監督なしで長期フレームを予測する教師なし階層型ビデオ予測フレームワーク（EPVA）を提示します。特徴空間の敵対的損失を用いて Human3.6M の予測精度を向上させます。

ABSTRACT

Much of recent research has been devoted to video prediction and generation, yet most of the previous works have demonstrated only limited success in generating videos on short-term horizons. The hierarchical video prediction method by Villegas et al. (2017) is an example of a state-of-the-art method for long-term video prediction, but their method is limited because it requires ground truth annotation of high-level structures (e.g., human joint landmarks) at training time. Our network encodes the input frame, predicts a high-level encoding into the future, and then a decoder with access to the first frame produces the predicted image from the predicted encoding. The decoder also produces a mask that outlines the predicted foreground object (e.g., person) as a by-product. Unlike Villegas et al. (2017), we develop a novel training method that jointly trains the encoder, the predictor, and the decoder together without highlevel supervision; we further improve upon this by using an adversarial loss in the feature space to train the predictor. Our method can predict about 20 seconds into the future and provides better results compared to Denton and Fergus (2018) and Finn et al. (2016) on the Human 3.6M dataset.

研究の動機と目的

長次元ビデオの短期的なハorizon を超えた長期的なビデオ予測を動機づける。
訓練時に地上真の高レベル構造アノテーションの必要性を排除する。
階層的フレームワークを介して高レベル特徴の予測と低レベルピクセル生成を分離する。

提案手法

入力フレームを特徴空間にエンコードし、LSTM で将来の高レベル特徴を予測する。
アダプティブマスキングを備えたビジュアルアナロジー Network (VAN) を用いて最初のフレームから将来フレームを生成する。
高レベル監督なしでエンコーダ、予測子、VAN を共同訓練し、アナロジーベースの損失のオプションを利用する。
EPVA ではピクセルレベルの L2 損失を最小化し、任意で予測特徴をエンコーダ出力に制約し、特徴空間で敵対的損失を適用して予測をシャープにする。
敵対的損失を用いる EPVA では、Wasserstein 損失を用いた LSTM 判別器を訓練して予測された特徴列と実際の特徴列を区別させ、そのフィードバックを用いて生成を改善する。

実験結果

リサーチクエスチョン

RQ1高レベル構造アノテーションを supervised せずに長期的なビデオ予測を達成できるか。
RQ2エンコーダ、予測子、VAN のエンドツーエンド共同訓練は地上真のランドマークなしで長期予測品質を改善するか。
RQ3特徴空間での敵対的訓練は、L2 のみの目的よりもシャープで現実的な長期予測を生み出すか。

主な発見

EPVA は Human3.6M および toy データセットでエンドツーエンドの L2 ベースラインより長期予測をシャープにする。
toy の跳ねる形状データセットでは、EPVA は予測形状の正しい色を約 97% の頻度で達成し、CDNA ベースラインの約 25% より高い。
Human3.6M では、EPVA Adversarial は Finn ら (2016) および Denton and Fergus (2018) より、64–127 フレームの人間らしさの評価で有意に上回る。
EPVA 手法は前景運動分割マスクを明らかにすることができ、ネットワークが動く物体構造を発見していることを示唆する。
learned エンコーダ特徴を用いたポーズ回帰は、VGG ベースの特徴より相対誤差削減約 9% の改善を示す。
特徴空間での敵対的損失はブラーを減らし、L2 のみより長期的なリアリズムを向上させる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。