QUICK REVIEW

[논문 리뷰] Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning

William Lotter, Gabriel Kreiman|arXiv (Cornell University)|2016. 05. 25.

Advanced Vision and Imaging참고 문헌 60인용 수 418

한 줄 요약

PredNet, 예측적 인코딩에서 영감을 받은 깊은 순환 CNN은 감독 없이 미래 비디오 프레임을 예측하는 법을 배우고, 잠재 물체 매개변수 디코딩 및 스티어링 각도 추정과 같은 다운스트림 작업에 유용한 표현을 개발한다.

ABSTRACT

While great strides have been made in using deep learning algorithms to solve supervised learning tasks, the problem of unsupervised learning - leveraging unlabeled examples to learn about the structure of a domain - remains a difficult unsolved challenge. Here, we explore prediction of future frames in a video sequence as an unsupervised learning rule for learning about the structure of the visual world. We describe a predictive neural network ("PredNet") architecture that is inspired by the concept of "predictive coding" from the neuroscience literature. These networks learn to predict future frames in a video sequence, with each layer in the network making local predictions and only forwarding deviations from those predictions to subsequent network layers. We show that these networks are able to robustly learn to predict the movement of synthetic (rendered) objects, and that in doing so, the networks learn internal representations that are useful for decoding latent object parameters (e.g. pose) that support object recognition with fewer training views. We also show that these networks can scale to complex natural image streams (car-mounted camera videos), capturing key aspects of both egocentric movement and the movement of objects in the visual scene, and the representation learned in this setting is useful for estimating the steering angle. Altogether, these results suggest that prediction represents a powerful framework for unsupervised learning, allowing for implicit learning of object and scene structure.

연구 동기 및 목표

라벨이 없는 비디오에서 미래 프레임을 예측함으로써 비감독 학습의 필요성을 동기화한다.
로컬 예측 및 오차 기반 커뮤니케이션을 갖춘 예측적 인코딩 기반 구조(PredNet)를 개발한다.
예측에서 학습된 표현이 잠재 요인(예: 포즈)을 해독하고 다운스트림 작업을 개선하는 데 기여하는지 보인다.
자연스러운 비디오 시퀀스(차량 탑재 카메라)로의 확장성 및 스티어링 각도 추정에의 활용성을 보여준다.

제안 방법

PredNet를 제안한다: 4계층으로 구성된 스택형 순환 합성곱 네트워크로 각 계층당 입력 A_l, 표현 R_l, 예측 ĤA_l, 및 오차 E_l의 네 가지 구성요소를 가진다.
R_l에 ConvLSTM 유닛을 사용하고 시간에 따라 계층별 예측 오차의 가중합을 최소화하여 학습한다(L_train).
아래에서 위로 A_l을 계산(A_0 = x_t; l>0인 경우 A_l은 MaxPool(ReLU(Conv(E_{l-1})))를 통해 얻음); ĤA_l은 R_l에서 Conv 및 ReLU를 통해 얻고, E_l은 예측 오차의 양의/음의 값을 결합(concatenation)하여 얻는다(ReLU(A_l - ĤA_l) 및 ReLU(ĤA_l - A_l)).
Adam으로 학습한다; 두 가지 손실 설정을 탐구한다: PredNet_L0(가장 낮은 계층에서만 손실) 및 PredNet_Lall(가장 낮은 계층과 상위 계층에서 손실, 더 작은 가중치).
두 단계 업데이트 방식: 위에서 아래로의 R_l 상태를 ConvLSTM으로 계산한 뒤 순방향 패스로 예측, 오차 및 상위 계층 타깃을 계산한다.

실험 결과

연구 질문

RQ1예측적 인코딩 기반 네트워크가 미래 프레임을 예측함으로써 비감독 표현을 비디오에서 학습할 수 있는가?
RQ2PredNet 표현이 잠재 물체 매개변수(예: 자세, 정체성)의 선형 해독을 개선하고 정적 물체 인식과 같은 다운스트림 작업을 향상시키는가?
RQ3PredNet 모델은 자연스러운 비디오(차량 탑재 카메라)로 확장되며 자이모션 및 물체 모션을 포착하여 스티어링 각도 추정과 같은 유용한 작업을 가능하게 하는가?

주요 결과

모델	MSE	SSIM
PredNet L0 (Rotating Faces)	0.0152	0.937
PredNet L_all (Rotating Faces)	0.0157	0.921
CNN-LSTM Enc.-Dec (Rotating Faces)	0.0180	0.907
Copy Last Frame (Rotating Faces)	0.125	0.631
PredNet L0 (CalTech)	3.13e-3	0.884
PredNet L_all (CalTech)	3.33e-3	0.875
CNN-LSTM Enc.-Dec (CalTech)	3.67e-3	0.865
Copy Last Frame (CalTech)	7.95e-3	0.762

PredNet는 회전 얼굴(Rotating Faces) 합성 시퀀스에서 MSE 및 SSIM 모두에서 baselines를 능가한다(회전 얼굴: L0 MSE 0.0152, SSIM 0.937; Lall MSE 0.0157, SSIM 0.921; CNN-LSTM Enc.-Dec: MSE 0.0180, SSIM 0.907).
CalTech Pedestrian 데이터에서, PredNet/L0은 MSE 3.13e-3 및 SSIM 0.884를 달성; PredNet/Lall은 MSE 3.33e-3 및 SSIM 0.875; CNN-LSTM Enc.-Dec는 MSE 3.67e-3 및 SSIM 0.865; Copy Last Frame은 최하위보다 낮은 성능( MSE 7.95e-3, SSIM 0.762).
잠재 매개변수 디코딩: R_l의 표현이 잠재 요인(팬/롤 속도, 팬 각도, PC1)의 선형 디코딩을 무작위 네트워크 대비 향상시키며, Lall이 특히 첫 PC 디코딩을 크게 향상시킴.
선형 SVM을 이용한 정적 얼굴 분류에서 PredNet 표현은 학습세트 규모에 관계없이 자동인코더 및 Ladder Network 변형보다 우수한 성능을 나타내며, 종종 Lall이 L0보다 높은 정확도를 보임.
Coma.ai 데이터에서의 스티어링 각도 추정: 1k 표본으로 PredNet_L0에 대한 선형 읽기가 스티어링 각도 분산의 74%를 설명하며 CNN-LSTM Enc.-Dec보다 약 35% 포인트 앞섬; 25k 라벨에서는 PredNet_L0 MSE 약 2.14(deg^2).
PredNet은 자연 장면(KITTI)에서의 프레임 예측이 견고함을 보이며 CalTech Pedestrian 테스트 시퀀스에 대한 일반화도 합리적이며, 예측된 프레임은 가려진 영역을 채우고 카메라 모션을 다룰 수 있다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.