QUICK REVIEW

[논문 리뷰] Revisiting the Effectiveness of Off-the-shelf Temporal Modeling Approaches for Large-scale Video Classification

Yunlong Bian, Chuang Gan|arXiv (Cornell University)|2017. 08. 12.

Human Pose and Action Recognition참고 문헌 18인용 수 53

한 줄 요약

이 논문은 다중 모드 특징을 사용한 대규모 비디오 분류를 위한 기성의 시계열 모델링 방법을 평가하고, 특히 앙상블 시 사용 시 최첨단 성과를 달성하는 네 가지 모델을 제안합니다.

ABSTRACT

This paper describes our solution for the video recognition task of ActivityNet Kinetics challenge that ranked the 1st place. Most of existing state-of-the-art video recognition approaches are in favor of an end-to-end pipeline. One exception is the framework of DevNet. The merit of DevNet is that they first use the video data to learn a network (i.e. fine-tuning or training from scratch). Instead of directly using the end-to-end classification scores (e.g. softmax scores), they extract the features from the learned network and then fed them into the off-the-shelf machine learning models to conduct video classification. However, the effectiveness of this line work has long-term been ignored and underestimated. In this submission, we extensively use this strategy. Particularly, we investigate four temporal modeling approaches using the learned features: Multi-group Shifting Attention Network, Temporal Xception Network, Multi-stream sequence Model and Fast-Forward Sequence Model. Experiment results on the challenging Kinetics dataset demonstrate that our proposed temporal modeling approaches can significantly improve existing approaches in the large-scale video recognition tasks. Most remarkably, our best single Multi-group Shifting Attention Network can achieve 77.7% in term of top-1 accuracy and 93.2% in term of top-5 accuracy on the validation set.

연구 동기 및 목표

learned features 다음에 기성 시계열 모델을 통한 대규모 비디오 이해도 향상 동기를 부여합니다.
RGB, Flow, 및 Audio 특징을 사용하여 Kinetics에서 다양한 시계열 모델링 접근 방식을 평가합니다.
네 가지 새로운 시계열 모델링 접근 방식을 제안하고 이들의 보완 효과를 평가합니다.

제안 방법

RGB/Flow용 Inception-ResNet-v2와 Temporal Segment Network 프레임워크 내의 VGG16 기반 오디오 모델을 사용하여 다중 모달 특징을 추출합니다.
네 가지 기성 시계열 모델링 방법: Multi-group Shifting Attention Network, Temporal Xception Network, Multi-stream Sequence Model, Fast-Forward Sequence Model을 제안합니다.
시계열 모델링을 위해 깊이별 분리 합성곱과 주의 기반 시프팅 연산을 사용합니다.
모달리티별 주의/그룹 출력을 융합하고 분류기로 전달합니다; 전통적 시계열 풀링 및 LSTM 기반 baselines와 비교합니다.
고정 길이/세그먼트 기반 테스트 프로토콜로 Kinetics에서 평가하고 Top-1/Top-5 정확도를 보고합니다.
개별 모델들의 앙상블에서 얻는 성능 향상을 보여줍니다.

실험 결과

연구 질문

RQ1대규모 동작 인식을 위한 학습된 다중 모달 비디오 특징에 기성 시계열 모델링 접근 방식이 얼마나 효과적인가?
RQ2제안된 시계열 모델들이 Kinetics에서 LSTM과 같은 전통적 시퀀스 모델을 맞추거나 능가할 수 있는가?
RQ3다른 시계열 모델링 접근 방식이 서로 보완되어 더 나은 앙상블 성능을 낼 수 있는가?
RQ4성능 향상을 위한 다중 모달 특징(RGB, Flow, Audio)과 순수 점수 융합의 기여도 차이는 무엇인가?

주요 결과

다중 모달 특징을 사용한 시계열 모델링이 모달리티별 분류기의 순수 점수 융합보다 더 높은 성능을 보입니다.
제안된 Shifting Attention Network와 Temporal Xception Network가 LSTM과 같은 전통적 시퀀스 모델과 동등하거나 더 나은 성능을 달성합니다.
네 가지 시계열 모델은 상호 보완적이며 이들의 앙상블이 최상의 성능을 제공합니다.
Kinetics 검증에서 최상의 단일 모델(Shifting Attention Network)은 Top-1 77.7%, Top-5 93.2%를 달성하고, 앙상블은 Top-1 81.5%, Top-5 95.6%에 도달합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.