QUICK REVIEW

[논문 리뷰] The ActivityNet Large-Scale Activity Recognition Challenge 2018 Summary

Bernard Ghanem, Juan Carlos Niebles|arXiv (Cornell University)|2018. 08. 11.

Human Pose and Action Recognition참고 문헌 5인용 수 55

한 줄 요약

본 논문은 2018년 ActivityNet 챌린지를 요약하고, 대규모 비디오에서 시간 제안, 위치 지정, 밀집 자막 작성에 걸쳐 여섯 가지 과제(세 가지의 ActivityNet 주 과제와 세 가지 게스트 과제)와 최상위 제출 사례를 제시한다.

ABSTRACT

The 3rd annual installment of the ActivityNet Large- Scale Activity Recognition Challenge, held as a full-day workshop in CVPR 2018, focused on the recognition of daily life, high-level, goal-oriented activities from user-generated videos as those found in internet video portals. The 2018 challenge hosted six diverse tasks which aimed to push the limits of semantic visual understanding of videos as well as bridge visual content with human captions. Three out of the six tasks were based on the ActivityNet dataset, which was introduced in CVPR 2015 and organized hierarchically in a semantic taxonomy. These tasks focused on tracing evidence of activities in time in the form of proposals, class labels, and captions. In this installment of the challenge, we hosted three guest tasks to enrich the understanding of visual information in videos. The guest tasks focused on complementary aspects of the activity recognition problem at large scale and involved three challenging and recently compiled datasets: the Kinetics-600 dataset from Google DeepMind, the AVA dataset from Berkeley and Google, and the Moments in Time dataset from MIT and IBM Research.

연구 동기 및 목표

대규모의 사용자 생성 비디오에서 일상 활동의 의미적 시각 이해의 한계를 확장하다.
다양한 과제와 데이터세트를 통해 시각 콘텐츠를 인간의 자막과 연결하다.
ActivityNet과 게스트 데이터셋 전반에 걸쳐 제안( proposals ), 로컬라이제이션, 자막화 지표를 통해 표준화된 평가를 제공하다.

제안 방법

비디오 이해의 다양한 측면을 평가하기 위해 ActivityNet 기반의 세 가지 과제와 게스트 과제 세 가지의 여섯 가지 과제를 정의한다.
AR-AN을 사용하고 제안 품질에 대해 AR-AN 기반의 평균 지표를 사용한다.
시간적 로컬라이제이션에 대해 tIoU 임계값에서의 평균 정밀도(mAP)를 사용한다.
이벤트의 밀집 자막 작성에 대해 METEOR/BLEU/CIDEr 기반의 평균 지표를 사용한다.
대규모 이해를 넓히기 위해 Kinetics-600, AVA, Moments in Time의 게스트 과제를 포함한다.

실험 결과

연구 질문

RQ1대상 활동에 대해 판별력을 유지하면서 시간적 행동 제안을 어떻게 효율적으로 생성할 수 있는가?
RQ2자르지 않은 길이가 긴 비디오에서 행동을 로컬라이즈하고 인식하는 현재 방법의 효과는 어떠한가?
RQ3하나의 비디오 안에서 다수의 이벤트를 탐지, 로컬라이즈, 그리고 서술할 수 있는 모델의 성능은 어떠한가(밀집 자막)?
RQ4대규모 게스트 데이터셋(Kinetics-600, AVA, Moments in Time)이 광범위한 활동 이해에 어떤 통찰을 제공하는가?
RQ5대규모 활동 인식의 다양한 과제와 데이터셋에서 최고 성능 접근법은 무엇인가?

주요 결과

과제 1(시간적 행동 제안): 상위 3개 AUC 점수는 Baidu Vis, Shanghai Jiao Tong University, YH Technologies에서 각각 71.00, 69.30, 67.78이었다.
과제 2(시간적 행동 로컬라이제이션): 상위 3개 평균 mAP 값은 각각 38.53, 35.49, 35.27이다.
과제 3(이벤트의 밀집 자막 작성): 상위 2개 평균 Meteor 점수는 8.53과 8.13이다.
과제 A(잘림된 활동 인식): 상위 3개 평균 오차는 10.99, 11.69, 12.20이다.
과제 B(공간-시간적 행동 로컬라이제이션): CV 트랙의 top-3 mAP@0.5IoU는 21.08, 21.03, 19.60; 전체 트랙은 20.99, 19.60, 16.76이다.
과제 C(잘림된 이벤트 인식): 전체 트랙의 상위 3개 평균 정확도는 52.91, 51.26, 50.06; 미니 트랙은 47.72, 45.49, 45.10이다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.