QUICK REVIEW

[論文レビュー] YouTube-8M: A Large-Scale Video Classification Benchmark

Sami Abu-El-Haija, Nisarg Kothari|arXiv (Cornell University)|Sep 27, 2016

Multimodal Machine Learning Applications参考文献 32被引用数 920

ひとこと要約

本論文は YouTube-8M を紹介します。約8.3M 本の動画（50万時間超）と4,800 のラベルを持つ、大規模なマルチラベル動画分類ベンチマークで、事前抽出済みのフレーム特徴量とベースラインを含みます。フレームベースおよび動画レベルの表現を評価し、Sports-1M および ActivityNet への転移を示します。

ABSTRACT

Many recent advancements in Computer Vision are attributed to large datasets. Open-source software packages for Machine Learning and inexpensive commodity hardware have reduced the barrier of entry for exploring novel approaches at scale. It is possible to train models over millions of examples within a few days. Although large-scale datasets exist for image understanding, such as ImageNet, there are no comparable size video classification datasets. In this paper, we introduce YouTube-8M, the largest multi-label video classification dataset, composed of ~8 million videos (500K hours of video), annotated with a vocabulary of 4800 visual entities. To get the videos and their labels, we used a YouTube video annotation system, which labels videos with their main topics. While the labels are machine-generated, they have high-precision and are derived from a variety of human-based signals including metadata and query click signals. We filtered the video labels (Knowledge Graph entities) using both automated and manual curation strategies, including asking human raters if the labels are visually recognizable. Then, we decoded each video at one-frame-per-second, and used a Deep CNN pre-trained on ImageNet to extract the hidden representation immediately prior to the classification layer. Finally, we compressed the frame features and make both the features and video-level labels available for download. We trained various (modest) classification models on the dataset, evaluated them using popular evaluation metrics, and report them as baselines. Despite the size of the dataset, some of our models train to convergence in less than a day on a single machine using TensorFlow. We plan to release code for training a TensorFlow model and for computing metrics.

研究の動機と目的

YouTube データに基づく、スケールの大きい汎用的なマルチラベル動画分類のベンチマークを導入する。
多様なトップレベルカテゴリにまたがる4,800の知識グラフエンティティの、視覚的に認識可能な語彙を提供する。
拡張可能な研究を可能にするため、事前計算済みのフレームレベル特徴量と標準化された訓練/検証/テスト分割を提供する。
固定フレーム特徴量と固定の動画表現でのベースラインモデルを示し、他のベンチマークへの転移学習を探る。

提案手法

約10,000 認識可能な視覚的エンティティの視覚的・マルチラベル語彙を構築（200 本以上の動画にフィルタ）.
約8.26百万本の動画（約500k時間）、1,400+ フレーム/動画を収集。
動画を1秒間に1フレームでデコード；Inception から 2048-dim pool_3/_reshape フィーチャを抽出；1024 次元へ PCA+ whitening を適用；8-bit 量子化で 8x 圧縮。
全動画とラベル分割の固定フレームレベル特徴量を提供；releasetrain/validate/test splits（train:validate:test = 5,786,881:1,652,167:825,602）。
単純なフレームベースおよび動画レベルモデルを訓練：one-vs-all ロジスティック分類器、hinge loss を用いたオンライン SVM、Mixture-of-Experts のバリエーション；Deep Bag-of-Frames (DBoF) およびフレーム特徴量上の LSTM を探る。
フレーム特徴量を用いた平均・分散・上位K の順序統計量などで動画レベル表現を集計し、PCA whitening で正規化することにより、これらのコンパクトな表現上で二値分類器を訓練する。

実験結果

リサーチクエスチョン

RQ1大規模で多様なマルチラベル動画データセットは、アクション中心のベンチマークを超えた一般的な動画表現の学習を可能にしますか？
RQ2この規模で、固定フレームレベル特徴量と固定動画レベル表現は、スケーラブルなマルチラベル動画分類をどれほど支援しますか？
RQ3YouTube-8M で学習した表現は、Sports-1M や ActivityNet のような他のベンチマークへ転移しますか？
RQ4モデル選択（ロジスティック回帰、ヒンジ損失SVM、ミクスチャー・オブ・エキスパーツ、LSTM）のマルチラベル動画分類性能への影響は？
RQ5データセットの規模とラベルノイズは評価とベースラインにどのような影響を与えますか？

主な発見

YouTube-8M には約8.26百万本の動画、4,800クラス、1 FPS で最初の6分を処理した後の≈19億フレームが含まれます。
事前計算済みフレーム特徴量（2048次元）と PCA+ whitening および 8-bit 量子化により、研究者にとって過度な計算を必要とせずにスケーラブルなベースラインを可能にします。
固定フレーム特徴量と動画レベルの表現に基づくベースラインモデルは、TensorFlow で1台のマシン上で訓練可能であり、このデータで1日未満で収束します。
YouTube-8M で学習した動画表現は Sports-1M や ActivityNet のような他のベンチマークに一般化し、ActivityNet では顕著な改善を示します（mAP が 53.8% から 77.6% へ）。
人間が評価したテストサブセットは、正解ラベルに対する精度 78.8%、再現率 14.5% を示しており、欠落ラベルの課題と誤りや欠落ラベルをモデル化する機会を浮き彫りにしています。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。