QUICK REVIEW

[논문 리뷰] YouTube-8M: A Large-Scale Video Classification Benchmark

Sami Abu-El-Haija, Nisarg Kothari|arXiv (Cornell University)|2016. 09. 27.

Multimodal Machine Learning Applications참고 문헌 32인용 수 920

한 줄 요약

이 논문은 약 8.3M개의 비디오(500k+ 시간)와 4,800개 라벨, 미리 추출된 프레임 특징 및 베이스라인을 갖춘 다중 라벨 비디오 분류 벤치마크 YouTube-8M을 소개합니다. 프레임 기반 및 비디오 수준 표현을 평가하고 Sports-1M 및 ActivityNet으로의 전이 가능성을 보여줍니다.

ABSTRACT

Many recent advancements in Computer Vision are attributed to large datasets. Open-source software packages for Machine Learning and inexpensive commodity hardware have reduced the barrier of entry for exploring novel approaches at scale. It is possible to train models over millions of examples within a few days. Although large-scale datasets exist for image understanding, such as ImageNet, there are no comparable size video classification datasets. In this paper, we introduce YouTube-8M, the largest multi-label video classification dataset, composed of ~8 million videos (500K hours of video), annotated with a vocabulary of 4800 visual entities. To get the videos and their labels, we used a YouTube video annotation system, which labels videos with their main topics. While the labels are machine-generated, they have high-precision and are derived from a variety of human-based signals including metadata and query click signals. We filtered the video labels (Knowledge Graph entities) using both automated and manual curation strategies, including asking human raters if the labels are visually recognizable. Then, we decoded each video at one-frame-per-second, and used a Deep CNN pre-trained on ImageNet to extract the hidden representation immediately prior to the classification layer. Finally, we compressed the frame features and make both the features and video-level labels available for download. We trained various (modest) classification models on the dataset, evaluated them using popular evaluation metrics, and report them as baselines. Despite the size of the dataset, some of our models train to convergence in less than a day on a single machine using TensorFlow. We plan to release code for training a TensorFlow model and for computing metrics.

연구 동기 및 목표

YouTube 데이터에 바탕을 둔 대규모의 일반적 다중 라벨 비디오 분류 벤치마크를 소개한다.
다양한 최상위 범주를 아우르는 4,800개의 지식 그래프 엔티티를 시각적으로 recognizably 한 어휘로 제공한다.
확정된 프레임 수준 특징과 표준화된 학습/검증/테스트 분할을 제공하여 확장 가능한 연구를 가능하게 한다.
고정된 프레임 특징과 고정된 비디오 표현에서의 기본 모델을 시연하고, 다른 벤치마크로의 전이 학습을 탐구한다.

제안 방법

약 10,000개 내외의 시각적으로 인식 가능한 엔티티를 시각적 다중 라벨 어휘로 구성하되 비디오가 ≥200개인 경우로 필터링한다.
피처 추출 및 주석을 위해 비디오를 약 8.26백만 편(≈500k 시간) 수집하고 각 비디오에 1,400개 이상의 프레임을 사용한다.
비디오를 1초당 1프레임으로 디코딩하고 Inception의 2048 차원 pool_3/_reshape 특징을 추출한다; 1024 차원으로 PCA+ whitening을 적용하고 8비트 양자화를 통해 8배 압축한다.
모든 비디오와 라벨 분할에 대해 고정된 프레임 수준 특징을 제공하며 releasetrain/validate/test 분할을 제공한다(학습:검증:테스트 = 5,786,881:1,652,167:825,602).
단순 프레임 기반 및 비디오 수준 모델(one-vs-all 로지스틱 분류기, 힌지 손실을 사용하는 온라인 SVM, 전문가 혼합 변형)을 학습시키고; 프레임 특징에서 Deep Bag-of-Frames(DBoF) 및 LSTM을 탐구한다.
프레임 특징을 집계하여 비디오 수준 표현을 생성하고(평균, 표준편차, 상위-K 순차 통계) PCA whitening으로 정규화한 뒤, 이 축약 표현에 대해 이진 분류기를 학습한다.

실험 결과

연구 질문

RQ1대규모이고 다양한 다중 라벨 비디오 데이터셋이 행동 중심의 벤치마크를 넘는 일반 비디오 표현 학습을 가능하게 하는가?
RQ2고정된 프레임 수준 특징과 고정된 비디오 수준 표현이 이 규모에서 다중 라벨 비디오 분류를 확장 가능하게 지원하는가?
RQ3YouTube-8M에서 학습된 표현이 Sports-1M 및 ActivityNet 같은 다른 벤치마크로 전이되는가?
RQ4로지스틱 회귀, 힌지 손실 SVM, 전문가 혼합, LSTM 등 모델 선택이 다중 라벨 비디오 분류 성능에 어떤 영향을 미치는가?
RQ5데이터셋의 규모와 라벨 노이즈가 평가 및 베이스라인에 어떤 영향을 미치는가?

주요 결과

YouTube-8M은 약 8.26백만 개의 비디오, 4,800개 클래스 및 처리 후 약 19억 프레임을 포함합니다(처리의 첫 6분에서 1 FPS).
사전 계산된 프레임 특징(2048 차원)과 PCA+ whitening 및 8비트 양자화를 통해 연구자들이 무거운 계산 없이 확장 가능한 베이스라인을 가능하게 합니다.
고정 프레임 특징 및 비디오 수준 표현에 대한 기본 모델은 단일 머신에서 TensorFlow로 학습 가능하며 이 데이터에서 하루도 채 안 되어 수렴합니다.
YouTube-8M에서 학습된 비디오 표현은 Sports-1M 및 ActivityNet과 같은 다른 벤치마크로 일반화되며, ActivityNet에서 뚜렷한 개선을 보였습니다(mAP가 53.8%에서 77.6%로 증가).
인간이 평가한 테스트 서브셋은 정답 레이블에 대해 정확도 78.8%와 재현율 14.5%를 나타내며 누락된 레이블 문제와 잘못된 혹은 누락된 레이블을 다루는 모델링 기회가 있음을 강조합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.