QUICK REVIEW

[논문 리뷰] Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks

Zhaofan Qiu, Ting Yao|arXiv (Cornell University)|2017. 11. 28.

Human Pose and Action Recognition참고 문헌 34인용 수 251

한 줄 요약

이 논문은 Residual Network 내에서 2D 공간 필터와 1D 시간 필터를 결합해 3D 컨볼루션을 시뮬레이션하는 Pseudo-3D (P3D) 블록을 도입하고, 이를 통해 전통적인 2D 및 3D CNN보다 비디오 표현을 개선하는 P3D ResNet 변형을 제시한다.

ABSTRACT

Convolutional Neural Networks (CNN) have been regarded as a powerful class of models for image recognition problems. Nevertheless, it is not trivial when utilizing a CNN for learning spatio-temporal video representation. A few studies have shown that performing 3D convolutions is a rewarding approach to capture both spatial and temporal dimensions in videos. However, the development of a very deep 3D CNN from scratch results in expensive computational cost and memory demand. A valid question is why not recycle off-the-shelf 2D networks for a 3D CNN. In this paper, we devise multiple variants of bottleneck building blocks in a residual learning framework by simulating $3 imes3 imes3$ convolutions with $1 imes3 imes3$ convolutional filters on spatial domain (equivalent to 2D CNN) plus $3 imes1 imes1$ convolutions to construct temporal connections on adjacent feature maps in time. Furthermore, we propose a new architecture, named Pseudo-3D Residual Net (P3D ResNet), that exploits all the variants of blocks but composes each in different placement of ResNet, following the philosophy that enhancing structural diversity with going deep could improve the power of neural networks. Our P3D ResNet achieves clear improvements on Sports-1M video classification dataset against 3D CNN and frame-based 2D CNN by 5.3% and 1.8%, respectively. We further examine the generalization performance of video representation produced by our pre-trained P3D ResNet on five different benchmarks and three different tasks, demonstrating superior performances over several state-of-the-art techniques.

연구 동기 및 목표

공간-시간 비디오 표현의 효율적 학습을 전체 3D CNN 없이 달성한다.
3x3x3 컨볼루션을 1x3x3 공간 필터와 3x1x1 시간 필터로 시뮬레이션하는 병목 블록을 개발한다.
다양한 블록 설계(P3D-A/B/C)를 탐색하고 ResNet 내에서 혼합해 성능을 향상시킨다.
P3D ResNet이 3D CNN과 프레임 기반 CNN보다 여러 비디오 데이터셋에서 우수하다는 것을 입증한다.
이미지에서 공간 필터를 사전 학습하고 비디오 데이터에서 1D 시간 필터를 학습하는 것이 강력한 일반화 성능을 낳는지 보여준다.

제안 방법

3D 컨볼루션을 정의하고 이를 2D 공간(1x3x3) 및 1D 시간(3x1x1) 구성요소로 분리한다.
S 경로와 T 경로 간의 직접/간접 연결이 다른 세 가지 P3D 블록 설계(A, B, C)를 제안한다.
공간/시간 필터 주위에 1x1 축소/복원 과정을 포함하는 병목 구성을 채택한다.
ResNet 블록을 P3D 블록으로 교체하고 A/B/C 블록을 혼합해 구조적 다양성을 확보한 P3D ResNet을 만든다.
Sports-1M(대규모 비디오)에서 사전 학습하고 다양한 태스크에 걸쳐 일반 비디오 표현 추출기로 평가한다.
UCF101, ActivityNet, ASLAN, YUPENN, Dynamic Scene에서 ResNet-50, C3D 및 기타 베이스라인과 비교한다.

실험 결과

연구 질문

RQ1의사-3D 블록이 비디오의 시공간 정보를 포착하기 위해 전체 3D 컨볼루션을 효과적으로 대체할 수 있는가?
RQ2다양한 P3D 블록 설계(A, B, C)가 보완적 이점을 제공하며 이를 혼합하면 성능이 향상되는가?
RQ3이미지 데이터에서 공간을, 비디오 데이터에서 시간 정보를 사전 학습한 P3D ResNet이 순수 3D CNN이나 프레임 기반 방법보다 더 효과적인가?
RQ4다양한 데이터셋과 태스크에서 P3D ResNet이 일반 비디오 표현으로서 얼마나 우수한가?

주요 결과

P3D 변형은 ResNet-50을 능가하거나 C3D와 경쟁하면서도 모델 크기가 증가를 최소화하고 런타임 효율을 유지한다.
P3D-A, P3D-B, P3D-C를 혼합한 전체 P3D ResNet은 단일 변형보다 추가적인 정확도 증가를 제공해 아키텍처 다양성의 가치를 입증한다.
Sports-1M에서 P3D ResNet은 비디오 수준 정확도에서 더 높은 성능을 달성한다(클립 히트@1 47.9%; 비디오 히트@1 66.4%; 비디오 히트@5 87.4%).
UCF101에서 프레임 입력만 사용한 P3D ResNet은 상위 1% 정확도 88.6%에 도달해 ResNet-152 및 C3D를 상회하며, IDT 융합 시 93.7%에 이른다.
ActivityNet에서 P3D ResNet은 Top-1 75.12%, Top-3 87.71%, MAP 78.86%를 달성해 IDT, C3D, ResNet-152 베이스라인 등을 능가한다.
시각화 결과 P3D ResNet은 공간 패턴과 시간적 운동을 모두 포착하며, t-SNE 분석은 P3D ResNet 표현의 의미론적 군집이 더 명확함을 나타낸다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.