QUICK REVIEW

[論文レビュー] Object Level Visual Reasoning in Videos

Fabien Baradel, Nathalia Neverova|arXiv (Cornell University)|Jun 16, 2018

Human Pose and Action Recognition参考文献 37被引用数 89

ひとこと要約

この論文では、動画内の意味的に意味のある物体相互作用を推論する Object Relation Network (ORN) を紹介します。Mask-RCNN ベースの物体検出と関係推論を組み合わせて、SS、VLOG、EPIC Kitchens で最先端の結果を達成します。

ABSTRACT

Human activity recognition is typically addressed by detecting key concepts like global and local motion, features related to object classes present in the scene, as well as features related to the global context. The next open challenges in activity recognition require a level of understanding that pushes beyond this and call for models with capabilities for fine distinction and detailed comprehension of interactions between actors and objects in a scene. We propose a model capable of learning to reason about semantically meaningful spatiotemporal interactions in videos. The key to our approach is a choice of performing this reasoning at the object level through the integration of state of the art object detection networks. This allows the model to learn detailed spatial interactions that exist at a semantic, object-interaction relevant level. We evaluate our method on three standard datasets (Twenty-BN Something-Something, VLOG and EPIC Kitchens) and achieve state of the art results on all of them. Finally, we show visualizations of the interactions learned by the model, which illustrate object classes and their interactions corresponding to different activity classes.

研究の動機と目的

グローバルな運動と場面の手掛かりを超えた、人間-物体相互作用の細粒度理解を動機付ける。
動画における物体関係の時空間推論を行うために、明示的な物体検出を活用する。
時間を跨いで物体インスタンスについて推論する、エンドツーエンドで学習可能なアーキテクチャを開発する。
難易度の高いデータセットで、物体レベルの推論が活動のみのベースラインより改善をもたらすことを示す。

提案手法

検出された物体インスタンス間の空間と時間を跨ぐ推論を行う Object Relation Network (ORN) を導入する。
Mask-RCNN を用いて物体マスクとクラス予測を取得し、ROI-Pooling で各物体の特徴を抽出する。
関数 h_theta を用いてフレーム間の物体関係のペアをモデル化し、グローバル関数 g で集約し、長距離依存を捉えるために再帰的な f_phi (GRU) で伝搬する。
物体推論表現を、グローバルモーションコンテキストを捉える別の activity head と組み合わせる。
ジョイント損失で学習する: activity 分類損失に加えて、物体特徴を意味クラスと整合させる補助的な object-class 一貫性損失。

実験結果

リサーチクエスチョン

RQ1伝統的なグローバルモーションモデルを超えて、意味的に基づいた物体レベルの関係推論は動画の activity 認識を改善できるか？
RQ2再帰を伴う明示的なフレーム間の物体相互作用推論（ORN）は、細かな動画理解タスクの性能を向上させるか？
RQ3意味的に定義された物体インスタンスを使用することは、ピクセルレベルの関係推論と比較して動画の activity 認識にどう影響するか？
RQ4物体 head と activity head を共同訓練することと、別々に訓練することの影響は何か？

主な発見

VLOG では、提案手法は 44.7% の mAP を達成し、前回の最高（40.5%）を上回る。
Something-Something では、提案手法が最先端を 2.3 ポイント上回る。
EPIC Kitchens では、精度 40.89% を達成し、ベースラインにより約 6.4–7.9 ポイント上回る。
アブレーションにより、物体レベルの推論を追加すると、データセット全体で activity-head ベースラインより顕著な改善（0.8–2.5+ ポイント）が得られる。
意味的に定義された物体を使用すると、ピクセルレベル推論と比較して EPIC は約 2 ポイント、VLOG は約 2.3 ポイント改善する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。