QUICK REVIEW

[논문 리뷰] YOWOv2: A Stronger yet Efficient Multi-level Detection Framework for Real-time Spatio-temporal Action Detection

Jianhua Yang, Kun Dai|arXiv (Cornell University)|2023. 02. 14.

Human Pose and Action Recognition인용 수 11

한 줄 요약

YOWOv2는 3D 백본과 다중 레벨 2D 백본, 그리고 분리된 융합 헤드를 결합하여 실시간, 앵커 없는 다중-레벨 시공-시간 동작 탐지기를 도입하고 UCF101-24와 AVA에서 최고 수준의 속도-정확도 트레이드오프를 달성합니다.

ABSTRACT

Designing a real-time framework for the spatio-temporal action detection task is still a challenge. In this paper, we propose a novel real-time action detection framework, YOWOv2. In this new framework, YOWOv2 takes advantage of both the 3D backbone and 2D backbone for accurate action detection. A multi-level detection pipeline is designed to detect action instances of different scales. To achieve this goal, we carefully build a simple and efficient 2D backbone with a feature pyramid network to extract different levels of classification features and regression features. For the 3D backbone, we adopt the existing efficient 3D CNN to save development time. By combining 3D backbones and 2D backbones of different sizes, we design a YOWOv2 family including YOWOv2-Tiny, YOWOv2-Medium, and YOWOv2-Large. We also introduce the popular dynamic label assignment strategy and anchor-free mechanism to make the YOWOv2 consistent with the advanced model architecture design. With our improvement, YOWOv2 is significantly superior to YOWO, and can still keep real-time detection. Without any bells and whistles, YOWOv2 achieves 87.0 % frame mAP and 52.8 % video mAP with over 20 FPS on the UCF101-24. On the AVA, YOWOv2 achieves 21.7 % frame mAP with over 20 FPS. Our code is available on https://github.com/yjh0410/YOWOv2.

연구 동기 및 목표

작은 액션을 정확하게 탐지할 수 있는 실시간 시공-시간 동작 탐지를 필요로 한다.
작은 인스턴스 탐지 성능을 개선하기 위한 다중 레벨 앵커 없는 탐지 프레임워크를 개발한다.
3D 시공-시간 특징과 다중 레벨 2D 공간 특징을 효율적으로 융합한다.
다양한 컴퓨트 예산에 맞춘 Tiny, Medium, Large 모델 계열을 제공한다.

제안 방법

비디오 클립에서 시공-시간 특징을 추출하기 위해 3D 백본을 활용한다.
세 레벨에서 분리된 분류 및 회귀 특징을 생성하기 위해 피처 피라미드 네트워크를 갖춘 다중 레벨 2D 백본을 사용한다.
ChannelEncoder를 도입하여 2D와 3D 특징을 DANet 영감을 받은 자기 주의 단계로 융합한다.
각 레벨에서 F_ST를 F_cls와 F_reg에 대해 분리된 융합 헤드로 융합한다.
사전 정의된 앵커 없이 학습하는 앵커-프리 동적 레이블 할당(SimOTA)을 채택한다.
conf, cls, reg 항을 결합한 손실에 가중치 lambda로 균형을 맞춰 학습한다.

실험 결과

연구 질문

RQ1다중 레벨 앵커-프리 탐지기가 작은 액션 로컬라이제이션을 개선하면서 실시간 시공-시간 동작 탐지를 달성할 수 있는가?
RQ22D와 3D 특징의 분리된 융합이 STAD에 대한 결합 융합보다 우수한가?
RQ3Tiny/Medium/Large 백본의 서로 다른 크기가 UCF101-24와 AVA 같은 데이터셋에서 속도-정확도에 어떤 영향을 미치는가?

주요 결과

YOWOv2-Tiny/Medium/Large는 UCF101-24에서 YOWO보다 낮은 FLOPs와 파라미터로 더 높은 프레임 mAP 및 비디오 mAP를 달성한다.
분리된 융합 헤드는 결합된 융합 헤드보다 우수하여 F-mAP와 V-mAP를 개선하되 약간의 속도 트레이드오프가 있다.
동적 레이블 할당(SimOTA)은 앵커-프리 학습을 가능하게 하여 경쟁력 있는 성능을 제공한다.
UCF101-24에서 YOWOv2-L은 16 프레임에서 85.2% F-mAP 및 52.0% V-mAP를 달성하고 RTX 3090에서 30 FPS; 32 프레임에서는 87.0% F-mAP 및 52.8% V-mAP로 증가하나 22 FPS.
AVA에서 YOWOv2-L은 20 FPS 이상에서 21.7% 프레임 mAP(K=16).
YOWOv2-T는 YOWO보다 F-mAP 및 V-mAP에서 우수하지만 FLOPs와 파라미터가 훨씬 적다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.