QUICK REVIEW

[논문 리뷰] ConvTransformer: A Convolutional Transformer Network for Video Frame Synthesis

Zhouyong Liu, Shun Luo|arXiv (Cornell University)|2020. 11. 20.

Advanced Vision and Imaging참고 문헌 47인용 수 60

한 줄 요약

ConvTransformer는 비디오 프레임 보간 및 외삽을 하나로 통합하는 다중 헤드 합성 컨볼루셔널 자기 주의 아키텍처를 도입하여, 병렬 학습을 가능하게 하면서 최첨단에 준하는 성능을 달성한다.

ABSTRACT

Deep Convolutional Neural Networks (CNNs) are powerful models that have achieved excellent performance on difficult computer vision tasks. Although CNNs perform well whenever large labeled training samples are available, they work badly on video frame synthesis due to objects deforming and moving, scene lighting changes, and cameras moving in video sequence. In this paper, we present a novel and general end-to-end architecture, called convolutional Transformer or ConvTransformer, for video frame sequence learning and video frame synthesis. The core ingredient of ConvTransformer is the proposed attention layer, i.e., multi-head convolutional self-attention layer, that learns the sequential dependence of video sequence. ConvTransformer uses an encoder, built upon multi-head convolutional self-attention layer, to encode the sequential dependence between the input frames, and then a decoder decodes the long-term dependence between the target synthesized frames and the input frames. Experiments on video future frame extrapolation task show ConvTransformer to be superior in quality while being more parallelizable to recent approaches built upon convolutional LSTM (ConvLSTM). To the best of our knowledge, this is the first time that ConvTransformer architecture is proposed and applied to video frame synthesis.

연구 동기 및 목표

물체가 이동하고 변형되며 조명 변화가 발생하는 비디오 프레임 합성의 도전과제를 동기 부여하고 해결한다.
보간과 외삽을 모두 다루는 하나의 통합된 엔드-투-엔드 아키텍처를 제안한다.
프레임 간의 장거리 의존성을 모델링하기 위한 다중 헤드 컨볼루셔널 self-attention 메커니즘을 개발한다.
순환형 아키텍처에 비해 효율성을 향상시키기 위해 병렬 학습 및 추론을 가능하게 한다.

제안 방법

공유된 4층 CNN을 통해 입력 프레임을 압축된 특징 맵으로 임베딩한다.
프레임 순서를 보존하기 위해 3D 위치 인코딩을 적용한다.
다중 헤드 컨볼루셔널 self-attention과 컨볼루셔널 피드포워드 네트워크를 사용한 스택 인코더 층으로 프레임 시퀀스를 인코딩한다.
인코딩된 특징과 질의 프레임에 주의하는 디코더를 사용하여 디코딩하고, 학습된 장거리 의존성을 가능하게 한다.
U-네트와 같은 구조의 2단계 합성 피드포워드 네트워크(SFFN)로 최종 프레임을 합성한다.
합성 프레임과 ground-truth 프레임 간의 재구성 오차를 최소화하기 위해 픽셀 단위 MSE 손실로 학습한다.

실험 결과

연구 질문

RQ1ConvTransformer가 단일의 엔드투엔드 아키텍처에서 비디오 프레임의 보간과 외삽을 함께 처리할 수 있는가?
RQ2다중 헤드 컨볼루셔널 self-attention이 비디오 시퀀스에서 장거리 시간적 및 공간적 의존성을 효과적으로 포착하는가?
RQ3표준 벤치마크에서 특화된 보간 및 외삽 방법에 비해 ConvTransformer의 성능은 어떤가?

주요 결과

ConvTransformer는 ConvLSTM 기반 외삽 기준선보다 우수하며, 특히 다음 프레임 외삽에서 우수하다.
보간 및 외삽 작업에 대해 여러 데이터 세트에서 PSNR/SSIM이 더 높게 나타나며, 여러 최신 방법과 비교해 우수하다.
모델은 데이터 세트 전반에서 우수한 평균 성능을 보여주며, 통합 접근법의 일반성을 입증한다.
정성적 결과는 이전 방법에 비해 더 선명하고 사진처럼 현실적인 프레임과 적은 아티팩트를 시사한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.