QUICK REVIEW

[論文レビュー] Why Can't I Dance in the Mall? Learning to Mitigate Scene Bias in Action Recognition

Jinwoo Choi, Chen Gao|arXiv (Cornell University)|Dec 11, 2019

Human Pose and Action Recognition被引用数 99

ひとこと要約

本論文は、scene バイアスを scene-adversarial loss と human-masked entropy loss を用いて緩和し、分類、局在、検出タスクへの転移を改善する video action recognition のデバイス（デバイアリング）フレームワークを提案します。

ABSTRACT

Human activities often occur in specific scene contexts, e.g., playing basketball on a basketball court. Training a model using existing video datasets thus inevitably captures and leverages such bias (instead of using the actual discriminative cues). The learned representation may not generalize well to new action classes or different tasks. In this paper, we propose to mitigate scene bias for video representation learning. Specifically, we augment the standard cross-entropy loss for action classification with 1) an adversarial loss for scene types and 2) a human mask confusion loss for videos where the human actors are masked out. These two losses encourage learning representations that are unable to predict the scene types and the correct actions when there is no evidence. We validate the effectiveness of our method by transferring our pre-trained model to three different tasks, including action classification, temporal localization, and spatio-temporal action detection. Our results show consistent improvement over the baseline model without debiasing.

研究の動機と目的

Action recognition データセットにおける scene 表現バイアスを動機付けて定量化する。
scene-invariant な特徴を学習するデバイアリング学習 objective を提案する。
複数の action understanding タスクへの転移学習を通じて一般化を向上させる。
アクション分類、時間的局在、時空検出にわたってデバイアリング手法を評価する。

提案手法

アクションラベルの標準的なクロスエントロピー損失で Mini-Kinetics-200 上に CNN を事前学習する。
特徴抽出器の上に scene classifier を置いて scene-invariant な特徴を学習するために scene adversarial loss を追加する。
ビデオ内の人間をマスクし、これらのビデオの予測アクションのエントロピーを最大化する human mask confusion loss を追加する。
勾配反転レイヤを使用して scene-adversarial objective を adversarial な方法で訓練する。
訓練中には off-the-shelf detector を用いて人間をマスクし、フレームの平均値でそのピクセルを置換する。
デバイアイズ済み表現を downstream タスク（アクション分類、局在、検出）で微調整する。

実験結果

リサーチクエスチョン

RQ1提案されたデバイアリングは video データセットの scene 表現バイアスを低減するか。
RQ2デバイアリングされた表現は事前訓練データを超える unseen なアクションクラスやタスクへより良く転移するか。
RQ3提案された二つのデバイアリング損失の generalization への影響はどうか。
RQ4異なる疑似 scene ラベルはデバイアリングの効果にどのような影響を与えるか。

主な発見

デバイアリングは scene に依存した特徴を低減し、Mini-Kinetics-200 の検証で scene classifier の精度が 29.7% から 2.9% に低下した。
デバイアイズド前処理は action classification の転移性能を HMDB-51、UCF-101、Diving48 の各データセットで一貫して改善した。
デバイアリングは THUMOS-14 の時間的アクション局在や JHMDB の時空間的アクション検出も改善した。
soft な疑似 scene ラベルを用いた scene-adversarial training は hard ラベルよりも優れていた。
LAdv と LEnt の両方が寄与し、両方を用いると最良の結果を得られた。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。