QUICK REVIEW

[논문 리뷰] Spatiotemporal Residual Networks for Video Action Recognition

Christoph Feichtenhofer, Axel Pinz|arXiv (Cornell University)|2016. 11. 07.

Human Pose and Action Recognition인용 수 494

한 줄 요약

공간-시간 ResNets를 도입하여 두 흐름 아키텍처와 잔여 연결 및 시간적 합성을 융합, UCF101 및 HMDB51에서 행동 인식에서 최첨단 성능 달성.

ABSTRACT

Two-stream Convolutional Networks (ConvNets) have shown strong performance for human action recognition in videos. Recently, Residual Networks (ResNets) have arisen as a new technique to train extremely deep architectures. In this paper, we introduce spatiotemporal ResNets as a combination of these two approaches. Our novel architecture generalizes ResNets for the spatiotemporal domain by introducing residual connections in two ways. First, we inject residual connections between the appearance and motion pathways of a two-stream architecture to allow spatiotemporal interaction between the two streams. Second, we transform pretrained image ConvNets into spatiotemporal networks by equipping these with learnable convolutional filters that are initialized as temporal residual connections and operate on adjacent feature maps in time. This approach slowly increases the spatiotemporal receptive field as the depth of the model increases and naturally integrates image ConvNet design principles. The whole model is trained end-to-end to allow hierarchical learning of complex spatiotemporal features. We evaluate our novel spatiotemporal ResNet using two widely used action recognition benchmarks where it exceeds the previous state-of-the-art.

연구 동기 및 목표

영상을 위한 공간-시간 도메인으로 ResNets 확장한다.
appearance (RGB)와 motion (optical flow) 스트림을 스트림 간 잔여 연결로 통합한다.
사전 학습된 이미지 ConvNets를 시간 잔차 초기화를 통해 공간-시간 네트워크로 변환한다.
계층적 공간-시간 특성을 학습하기 위한 end-to-end training을 가능하게 한다.
표준 액션 인식 벤치마크에서 최첨단 성능을 입증한다.

제안 방법

ImageNet에서 사전 학습된 appearance 및 motion 스트림용 two-stream ResNet-50 구조를 채택한다.
스트림 간 잔여 연결을 도입하여 공간-시간 상호작용을 가능하게 한다(모션 잔여).
공간 1x1 차원 매핑 필터를 시간 필터로 변환하고 시간 잔여 연결로 초기화한다(식 5 및 관련 내용).
이미지 기반 설계 원칙을 유지하면서 시간적 합성곱을 쌓아 공간-시간 수용 영역을 확장한다.
세 단계로 학습한다: 분리된 스트림 사전 학습, 교차 스트림 잔여를 포함한 공동 ST-ResNet 학습, 그리고 ST-ResNet*의 시간적 최대풀링.
더 긴 시간 범위에 대해 25프레임 청크와 시간적 최대풀링을 사용한 fully convolutional 추론.

실험 결과

연구 질문

RQ1appearance 및 motion 스트림 간의 잔여 연결이 영상 액션 인식을 위한 공간-시간 특성 학습을 개선할 수 있는가?
RQ2시간적 합성곱과 사전 학습된 image-net 기반 초기화를 통해 ResNets를 확장하는 것이 표준 벤치마크에서 성능을 향상시키는가?
RQ3end-to-end training과 시간적 최대풀링의 인식 정확도에 대한 영향은 무엇인가?
RQ4더 긴 비디오 시퀀스에서 시간적 스트라이드와 수용 영역이 행동 인식에 어떤 영향을 미치는가?

주요 결과

ST-ResNet은 cross-stream 잔여 및 시간적 합성곱을 통한 두-stream 베이스라인보다 현저히 향상된다.
ST-ResNet* 및 ST-ResNet*의 시간적 max-pooling은 두 벤치마크에서 ST-ResNet보다 더 높은 정확도를 달성한다.
UCF101 및 HMDB51에서 ST-ResNet*은 이전 ConvNet 접근법과 비교해 최첨단 결과를 산출한다.
ST-ResNet*과 함께 IDT 특징을 통합하면 HMDB51 성능이 더욱 향상된다(현저한 이득).
사전 학습된 image nets를 활용한 공간-시간 네트워크의 end-to-end training은 강한 일반화 및 성능 향상을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.