QUICK REVIEW

[論文レビュー] Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding

Gunnar A. Sigurdsson|arXiv (Cornell University)|Apr 6, 2016

Human Pose and Action Recognition参考文献 31被引用数 198

ひとこと要約

本論文は Hollywood in Homes アプローチを導入し、日常活動のエンドツーエンドの動画作成と注釈付けをクラウドソーシングすることで、家の中の活動の9,848本の動画と豊富な注釈を持つ Charades データセットを生み出し、アクション認識と説明生成のベースラインを提供します。

ABSTRACT

Computer vision has a great potential to help our daily lives by searching for lost keys, watering flowers or reminding us to take a pill. To succeed with such tasks, computer vision methods need to be trained from real and diverse examples of our daily dynamic scenes. While most of such scenes are not particularly exciting, they typically do not appear on YouTube, in movies or TV broadcasts. So how do we collect sufficiently many diverse but boring samples representing our lives? We propose a novel Hollywood in Homes approach to collect such data. Instead of shooting videos in the lab, we ensure diversity by distributing and crowdsourcing the whole process of video creation from script writing to video recording and annotation. Following this procedure we collect a new dataset, Charades, with hundreds of people recording videos in their own homes, acting out casual everyday activities. The dataset is composed of 9,848 annotated videos with an average length of 30 seconds, showing activities of 267 people from three continents. Each video is annotated by multiple free-text descriptions, action labels, action intervals and classes of interacted objects. In total, Charades provides 27,847 video descriptions, 66,500 temporally localized intervals for 157 action classes and 41,104 labels for 46 object classes. Using this rich data, we evaluate and provide baseline results for several tasks including action recognition and automatic description generation. We believe that the realism, diversity, and casual nature of this dataset will present unique challenges and new opportunities for computer vision community.

研究の動機と目的

YouTube/movies and lab recordings を超える現実的で多様な日常生活データの必要性を動機づける。
スクリプティング、撮影、注釈を包含するクラウドソースのデータ収集パイプラインを提案し、つまらない日常活動を捉える。
豊富な時間的アクションとオブジェクト相互作用の注釈を備えた大規模で多様なデータセット（Charades）を作成する。
Charades でのアクション認識と自動説明生成のベースライン評価を提供する。

提案手法

シーンベースのプロンプトを導く40個のオブジェクトと30個のアクションの語彙を用いたクラウドソースのスクリプト生成。
労働者が自宅で scripted sentences を約30秒間演じるクラウドソースの動画撮影。
157 のアクションクラスとオブジェクト相互作用の時間的位置付けを含む検証と注釈のクラウドソース、自由記述の説明も含む。
3段階のAMTワークフロー：スクリプト生成、ビデオ撮影、注釈/検証。
トレーニングと評価の分割は、訓練とテスト間で作業者の重複を防ぎ、カテゴリ分布を均衡化するように構築。

実験結果

リサーチクエスチョン

RQ1クラウドソースの scripted な自宅内動画は、娯楽動画を超えた日常生活の現実的で多様なデータを提供し得るか。
RQ2Charades における標準的および最先端手法を用いたアクション認識とキャプション生成のベースライン性能はどの程度か。
RQ3クラウドソースの制御語彙データセットにおけるオブジェクト−アクション相互作用とシーン文脈は、制御されていないオンライン動画と比較してどのように現れるか。

主な発見

Charades には 9,848 本の動画（平均 30.1 秒）、157 のアクションクラスにわたる 66,500 の時間的局在化アクション区間が含まれる。
データセットは46のオブジェクトクラスと30動詞の語彙を含み、生起するアクション−オブジェクト相互作用を可能にする。
改良されたトラジェクトリ、CNNベース、ツーストリーム法を用いたベースラインのアクション認識は相対的に控えめなmAPを示し、IDTが17.2%のmAPで最高、Combinedが18.6%に達した。
文予測ではS2VTが説明生成の最強ベースラインを示し、CIDErスコアは人間の説明を超える余地があることを示唆している。
データは日常の実世界の活動を反映するアクションの同時出現と文脈豊かな相互作用を明らかにし、細かなアクション認識と動画キャプション生成の課題を浮き彫りにしている。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。