QUICK REVIEW

[논문 리뷰] CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos

Zheng Shou, Jonathan Chan|arXiv (Cornell University)|2017. 03. 04.

Human Pose and Action Recognition참고 문헌 60인용 수 62

한 줄 요약

Convolutional-De-Convolutional (CDC) 네트워크를 3D ConvNets 위에 구성하여 프레임 수준의 액션 점수를 예측하고, 비정규화된 비디오에서 높은 효율성으로 정확한 시간적 위치를 가능하게 한다(≈500 FPS).

ABSTRACT

Temporal action localization is an important yet challenging problem. Given a long, untrimmed video consisting of multiple action instances and complex background contents, we need not only to recognize their action categories, but also to localize the start time and end time of each instance. Many state-of-the-art systems use segment-level classifiers to select and rank proposal segments of pre-determined boundaries. However, a desirable model should move beyond segment-level and make dense predictions at a fine granularity in time to determine precise temporal boundaries. To this end, we design a novel Convolutional-De-Convolutional (CDC) network that places CDC filters on top of 3D ConvNets, which have been shown to be effective for abstracting action semantics but reduce the temporal length of the input data. The proposed CDC filter performs the required temporal upsampling and spatial downsampling operations simultaneously to predict actions at the frame-level granularity. It is unique in jointly modeling action semantics in space-time and fine-grained temporal dynamics. We train the CDC network in an end-to-end manner efficiently. Our model not only achieves superior performance in detecting actions in every frame, but also significantly boosts the precision of localizing temporal boundaries. Finally, the CDC network demonstrates a very high efficiency with the ability to process 500 frames per second on a single GPU server. We will update the camera-ready version and publish the source codes online soon.

연구 동기 및 목표

정의된 세그먼트 제안 이상으로 미세한 프레임 수준의 시간적 로컬라이제이션 필요성을 동기화합니다.
공간에서 다운샘플링하고 시간에서 업샘플링하는 CDC 필터를 공동 학습하여 프레임 수준 해상도를 보존합니다.
end-to-end CDC 네트워크를 3D ConvNets 위에 설계하여 조밀한 프레임별 액션 점수를 생성합니다.
THUMOS’14 및 ActivityNet 2016에서 프레임별 라벨링 정확도 및 시간적 로컬라이제이션 정밀도 향상을 입증합니다.

제안 방법

C3D를 대체/보강하여 공간 다운샘플링(4x4)과 시간 업샘플링(2x)을 함께 수행하는 CDC 필터를 도입합니다.
FC6/FC7을 CDC6/CDC7로 변형하여 다중 프레임 출력 및 프레임 수준 예측을 가능하게 합니다.
프레임 단위 소프트맥스 분류기(CDC8)를 부착하고 프레임 수준 교차 엔트로피 손실로 학습합니다.
안정성을 위해 사전 학습된 C3D 초기화로 32 프레임 비디오 윈도우에서 SGD로 엔드-투-엔드 학습합니다.
테스트 시 제안 윈도우에서 프레임 점수를 생성하고 프레임 신뢰도의 가우시안 KDE로 세그먼트 경계를 다듬습니다.

실험 결과

연구 질문

RQ1공동 Convolutional-De-Convolutional (CDC) 필터가 공간적으로 다운샘플링하고 시간적으로 업샘플링하여 프레임 수준 액션 예측을 동시에 수행할 수 있을까?
RQ2프레임 수준 예측이 세그먼트 수준 접근에 비해 시간 경계 로컬라이제이션을 크게 개선하는가?
RQ3End-to-end CDC 기반 로컬라이제이션은 THUMOS’14 및 ActivityNet 2016에서 최첨단 방법과 어떻게 비교되는가?
RQ4CDC 접근이 실시간 또는 준실시간 처리에 충분히 계산 효율적인가?

주요 결과

Table 1: Per-frame labeling mAP on THUMOS’14
Single-frame CNN	34.7%
Two-stream CNN	36.2%
LSTM	39.3%
MultiLSTM	41.3%
C3D + LinearInterp	37.0%
Conv & De-conv	41.7%
CDC (fix 3D ConvNets)	37.4%
CDC	44.4%

CDC는 THUMOS’14에서 프레임 수준 라벨링 mAP에서 최첨단을 달성하여 단일 프레임, 2-스트림, LSTM 및 초기 C3D 기반 방법들을 능가합니다.
프레임 수준 예측을 갖춘 CDC는 IoU 임계값(0.3–0.7) 전반에서 S-CNN, C3D+LinearInterp, Conv&De-conv 기반선 및 CDC 변형들보다 우수한 시간적 로컬라이제이션 정확도를 제공합니다.
ActivityNet 2016에서 프레임 수준 예측으로 세그먼트 경계를 다듬으면 특히 높은 IoU(0.75)에서 시간적 로컬라이제이션 mAP가 향상됩니다.
CDC 네트워크는 단일 GPU(Titan X)에서 약 500 프레임/초 수준으로 처리되며 약 1 GB 저장공간이 필요하여 비정형 비디오에 대한 효율적인 조밀 예측이 가능합니다.
C3D ConvNet 위에 CDC 계층을 엔드-투-엔드로 학습·미세조정하면 3D ConvNet 특징을 고정시킬 때보다 시간적 역학에 대한 판별력이 향상됩니다.
미세한 프레임 수준 예측은 거친 세그먼트 제안에서 시작해도 경계 정밀 조정이 가능하게 합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.