QUICK REVIEW

[論文レビュー] Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning

William Lotter, Gabriel Kreiman|arXiv (Cornell University)|May 25, 2016

Advanced Vision and Imaging参考文献 60被引用数 418

ひとこと要約

PredNetは、予測符号化に着想を得た深い再帰型CNNで、教師なしで将来の映像フレームを予測することを学び、潜在的な物体パラメータをデコードするのに役立つ表現と、ステアリング角推定などの下流タスクに有用な表現を発展させます。

ABSTRACT

While great strides have been made in using deep learning algorithms to solve supervised learning tasks, the problem of unsupervised learning - leveraging unlabeled examples to learn about the structure of a domain - remains a difficult unsolved challenge. Here, we explore prediction of future frames in a video sequence as an unsupervised learning rule for learning about the structure of the visual world. We describe a predictive neural network ("PredNet") architecture that is inspired by the concept of "predictive coding" from the neuroscience literature. These networks learn to predict future frames in a video sequence, with each layer in the network making local predictions and only forwarding deviations from those predictions to subsequent network layers. We show that these networks are able to robustly learn to predict the movement of synthetic (rendered) objects, and that in doing so, the networks learn internal representations that are useful for decoding latent object parameters (e.g. pose) that support object recognition with fewer training views. We also show that these networks can scale to complex natural image streams (car-mounted camera videos), capturing key aspects of both egocentric movement and the movement of objects in the visual scene, and the representation learned in this setting is useful for estimating the steering angle. Altogether, these results suggest that prediction represents a powerful framework for unsupervised learning, allowing for implicit learning of object and scene structure.

研究の動機と目的

ラベルなしビデオから将来のフレームを予測することによって、教師なし学習を動機づける。
局所的な予測と誤差ベースの通信を備えた、予測符号化に着想を得たアーキテクチャ（PredNet）を開発する。
予測学習によって得られた表現が潜在因子（例：姿勢）をデコードするのに役立ち、下流タスクを改善することを示す。
自然なビデオ列（車載カメラなど）へのスケーラビリティと、ステアリング角推定の有用性を示す。

提案手法

PredNetを提案する：層ごとに4つの成分を持つ積み重ねられた再帰的畳み込みネットワークで、入力 A_l、表現 R_l、予測 ĤA_l、誤差 E_l からなる。
R_l には ConvLSTM ユニットを用い、時間を通じて層ごとの予測誤差の重み付き和を最小化することによって学習する（L_train）。
A_lを下から上へ計算する（A_0 = x_t; l>0の場合は MaxPool(ReLU(Conv(E_{l-1})))）； ĤA_l は R_l から Conv と ReLU で得る； E_l は正の予測誤差と負の予測誤差を連結したものとして表す（ReLU(A_l - ĤA_l) と ReLU(ĤA_l - A_l)）。
Adamで学習する; 2つの損失設定を検討する: PredNet_L0（最下層でのみ損失）および PredNet_Lall（最下層と上位層の損失で、より小さな重み）。
2パス更新方式: 上位から下位へR_l状態をConvLSTMで更新し、次に順伝播で予測・誤差・上位層のターゲットを計算する。

実験結果

リサーチクエスチョン

RQ1予測符号化に着想を得たネットワークは、将来のフレームを予測することによって、ビデオから有用な教師なし表現を学習できるか。
RQ2PredNet の表現は潜在的な物体パラメータ（例：姿勢、識別子）のデコードを促進し、静的な物体認識などの下流タスクを改善するか。
RQ3PredNetモデルは自然な映像（車載カメラ）にスケールし、自己運動と物体運動を捉え、ステアリング角推定などの有用なタスクを可能にするか。

主な発見

モデル	MSE	SSIM
PredNet L0 (Rotating Faces)	0.0152	0.937
PredNet L_all (Rotating Faces)	0.0157	0.921
CNN-LSTM Enc.-Dec (Rotating Faces)	0.0180	0.907
Copy Last Frame (Rotating Faces)	0.125	0.631
PredNet L0 (CalTech)	3.13e-3	0.884
PredNet L_all (CalTech)	3.33e-3	0.875
CNN-LSTM Enc.-Dec (CalTech)	3.67e-3	0.865
Copy Last Frame (CalTech)	7.95e-3	0.762

PredNetは回転顔合成系列でベースラインを上回り、MSEとSSIMの両方で優位を示す（Rotating Faces: L0 MSE 0.0152, SSIM 0.937; Lall MSE 0.0157, SSIM 0.921; CNN-LSTM Enc.-Dec: MSE 0.0180, SSIM 0.907）。
CalTech Pedestrianデータで、PredNet/L0はMSE 3.13e-3とSSIM 0.884、PredNet/LallはMSE 3.33e-3とSSIM 0.875、CNN-LSTM Enc.-DecはMSE 3.67e-3とSSIM 0.865、Copy Last Frameは最悪の成績（MSE 7.95e-3、SSIM 0.762）。
潜在パラメータのデコード：R_l からの表現は、ランダムネットワークと比べて潜在因子（パン／ロール速度、パン角、PC1）の線形デコードを改善する；L_allは特に最初のPCのデコードを高める。
線形SVMを用いた静的顔分類は、PredNet表現がオートエンコーダやLadder Networkのバリアントよりもトレーニングセットサイズに関係なく優れており、L_allがL0より高い精度を出すことが多い。
Comma.aiデータでのステアリング角推定：1kのラベル付き例でPredNet_L0の線形読み出しはステアリング角の分散の74%を説明し、CNN-LSTM Enc.-Decを約35%上回る；25kラベルではPredNet_L0 のMSEは約2.14 (deg^2)。
PredNetは自然風景（KITTI）で堅牢なフレーム予測を示し、CalTech Pedestrianのテスト系列への合理的な一般化を示す。予測フレームは遮蔽領域を埋め、カメラの動作にも対処できる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。