QUICK REVIEW

[논문 리뷰] V4D:4D Convolutional Neural Networks for Video-level Representation Learning

Shiwen Zhang, Sheng Guo|arXiv (Cornell University)|2020. 02. 18.

Human Pose and Action Recognition참고 문헌 31인용 수 48

한 줄 요약

V4D는 영상 수준의 4D CNN과 4D 컨볼루션 및 잔여 블록을 통해 비디오 동작 인식을 위한 장거리 시공간 진화를 모델링하여 클립 기반 3D CNN보다 우수합니다.

ABSTRACT

Most existing 3D CNNs for video representation learning are clip-based methods, and thus do not consider video-level temporal evolution of spatio-temporal features. In this paper, we propose Video-level 4D Convolutional Neural Networks, referred as V4D, to model the evolution of long-range spatio-temporal representation with 4D convolutions, and at the same time, to preserve strong 3D spatio-temporal representation with residual connections. Specifically, we design a new 4D residual block able to capture inter-clip interactions, which could enhance the representation power of the original clip-level 3D CNNs. The 4D residual blocks can be easily integrated into the existing 3D CNNs to perform long-range modeling hierarchically. We further introduce the training and inference methods for the proposed V4D. Extensive experiments are conducted on three video recognition benchmarks, where V4D achieves excellent results, surpassing recent 3D CNNs by a large margin.

연구 동기 및 목표

클립 기반 3D CNN을 넘어 비디오 수준 표현 학습의 동기를 제시하여 긴-range 시간적 진화를 포착한다.
4D 컨볼루션과 잔여 4D 블록을 도입해 클립 간 상호작용을 전체 비디오 표현 내에서 모델링한다.
기존 3D CNN 백본에 4D 블록을 통합해 계층적 장거리 모델링을 가능하게 한다.
V4D에 맞춘 훈련 및 비디오 수준 추론 전략을 개발한다.
다양한 벤치마크(Mini-Kinetics, Kinetics-400, Something-Something-v1)에서 효과를 입증한다.

제안 방법

비디오를 U 액션 유닛으로 나누고 각 구간에서 샘플링하는 비디오 수준 샘플링 전략을 도입한다.
상태 (C, U, T, H, W) 형태의 V 텐서에서 작동하는 4D 컨볼루션을 정의해 클립 간 상호작용을 포착한다.
4D 컨볼루션을 잔여 연결이 있는 3D CNN 백본에 통합하여 잔여 4D 컨볼루션 블록을 만든다.
차원 정렬을 가능하게 하는 순열 기반 메커니즘을 사용해 4D 블록을 표준 3D CNN에 삽입한다.
다양한 샘플링 표현에 대한 예측을 집계하는 비디오 수준 추론 절차를 제공한다.
다른 4D 커널 형식(예: 3x3x3x3, 3x3x1x1)과 배치 위치(res3, res4, res5)에서 성능과 파라미터를 균형 있게 탐색한다.

실험 결과

연구 질문

RQ14D 컨볼루션이 비디오에서 장거리 시공간 진화를 효과적으로 모델링할 수 있는가?
RQ23D CNN 백본에 통합된 잔여 4D 블록이 클립 기반 접근법을 넘어 비디오 수준 표현을 개선하는가?
RQ3액션 유닛 수(U)와 커널 구성의 영향은 성능과 효율성에 어떤 영향을 미치는가?
RQ4V4D는 다양한 벤치마크에서 TSN 및 클립 기반 3D CNN에 비해 어떤 성능을 보이는가?

주요 결과

Residual 4D Blocks를 갖춘 V4D는 비디오-클립 기반 I3D-S 및 TSN 기준선보다 유사한 프로토콜에서 더 높은 정확도를 달성한다(예: V4D ResNet18은 Mini-Kinetics에서 I3D-S ResNet18 및 TSN+I3D-S ResNet18을 능가한다).
커널 선택이 성능에 영향을 미치며, 3x3x3x3가 강력한 결과를 주는 반면, 더 경제적인 3x3x1x1도 실제 사용에서 경쟁력이 있다.
res3와 res4에 4D 블록을 배치하는 것이 다른 배치보다 더 큰 이득을 주며, 두 위치에 블록을 결합하면 정확도가 더욱 향상된다.
V4D는 Kinetics-400에서 77.4 top-1, 93.1 top-5( V4D ResNet50 기준) 및 Something-Something-v1에서 50.4 top-1( V4D ResNet50 기준)로 여러 최첨단 방법과 비교하여 경쟁력 있거나 우수한 결과를 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.