QUICK REVIEW

[論文レビュー] Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice

Xiaojiang Peng, Limin Wang|arXiv (Cornell University)|May 18, 2014

Human Pose and Action Recognition参考文献 47被引用数 128

ひとこと要約

本稿は、動画行動認識のためのBag of Visual Words（BoVW）パイプラインについて包括的な研究を提示し、特徴抽出、符号化、統合の各段階における最適な設定を同定する。本稿では、フィッシャー・ベクトル（FV）とソフト・ビジュアルコーデック（SVC）特徴のハイブリッド表現を提案し、表現レベルの統合を用いて、HMDB51で61.1%、UCF50で92.3%、UCF101で87.9%の最先端性能を達成した。

ABSTRACT

Video based action recognition is one of the important and challenging problems in computer vision research. Bag of Visual Words model (BoVW) with local features has become the most popular method and obtained the state-of-the-art performance on several realistic datasets, such as the HMDB51, UCF50, and UCF101. BoVW is a general pipeline to construct a global representation from a set of local features, which is mainly composed of five steps: (i) feature extraction, (ii) feature pre-processing, (iii) codebook generation, (iv) feature encoding, and (v) pooling and normalization. Many efforts have been made in each step independently in different scenarios and their effect on action recognition is still unknown. Meanwhile, video data exhibits different views of visual pattern, such as static appearance and motion dynamics. Multiple descriptors are usually extracted to represent these different views. Many feature fusion methods have been developed in other areas and their influence on action recognition has never been investigated before. This paper aims to provide a comprehensive study of all steps in BoVW and different fusion methods, and uncover some good practice to produce a state-of-the-art action recognition system. Specifically, we explore two kinds of local features, ten kinds of encoding methods, eight kinds of pooling and normalization strategies, and three kinds of fusion methods. We conclude that every step is crucial for contributing to the final recognition rate. Furthermore, based on our comprehensive study, we propose a simple yet effective representation, called hybrid representation, by exploring the complementarity of different BoVW frameworks and local descriptors. Using this representation, we obtain the state-of-the-art on the three challenging datasets: HMDB51 (61.1%), UCF50 (92.3%), and UCF101 (87.9%).

研究の動機と目的

BoVWパイプラインの各構成要素が行動認識性能に与える影響を体系的に評価すること。
複数の記述子を統合する際の異なる統合戦略の有効性を調査すること。
BoVWを用いた堅牢で正確な行動認識システムを構築するための最良の実践法を同定すること。
異なる符号化手法と記述子の相補性を活用するシンプルで効果的なハイブリッド表現を開発すること。

提案手法

著者らは、複数のデータセット（HMDB51、UCF50、UCF101）を対象に、10種類の符号化手法、8種類のプーリングおよび正規化戦略、3種類の統合手法を評価した。
局所的な空間的・時間的特徴（iDT、HOG、HOF、MBH）を用い、記述子の相関を低減するための特徴前処理を実施した。
HOG、HOF、MBHx、MBHyの複数の記述子から得られるフィッシャー・ベクトル（FV）とソフト・ビジュアルコーデック（SVC）出力を統合することで、ハイブリッド表現を提案した。
特徴の統合には表現レベルの統合を用い、耐性を高めるためにパワー正規化および内側ℓ₂正規化を適用した。
分類には、融合された表現で訓練されたRBFカーネルを用いたSVMを採用した。
パイプラインの各段階でアブレーションスタディを実施し、各構成要素の寄与を分離した。

実験結果

リサーチクエスチョン

RQ1BoVWフレームワーク内での異なる局所的特徴と符号化手法が、行動認識性能にどのように影響を与えるか？
RQ2プーリングおよび正規化戦略が最終認識精度に与える相対的影響は何か？
RQ3複数の記述子を統合する際、記述子レベル、表現レベル、またはイ早朝統合のうち、どの統合戦略が最高のパフォーマンスをもたらすか？
RQ4FVとSVC符号化を組み合わせたハイブリッド表現は、それらの相補的な統計的性質（1次および2次統計量 vs. 0次および1次統計量）を活用することで性能向上を達成できるか？
RQ5BoVWを用いた行動認識で最先端性能を達成するための重要な設計選択は何か？

主な発見

BoVWパイプラインの各段階（特徴抽出、前処理、コーデック生成、符号化、プーリング）が最終認識精度に顕著な影響を与え、他の段階での改善が、非最適な選択によって相殺されることがある。
表現レベルの統合は、特に再構成に基づく符号化手法（例：SA-k、LLC、VQ）を用いる場合、記述子レベル統合やイ早朝統合を常に上回る性能を示した。
フィッシャー・ベクトル（FV）とソフト・ビジュアルコーデック（SVC）表現の統合により、相補的な統計的性質（1次および2次統計量 vs. 0次および1次統計量）のおかげで顕著な性能向上が達成された。
提案されたハイブリッド表現は、HMDB51で61.1%の精度を達成し、前回の最高記録を3.9%上回り、UCF50（92.3%）およびUCF101（87.9%）でも新たな最先端性能を樹立した。これは、最近の深層学習や複雑な符号化手法を上回る性能であった。
本研究では、スーパーベクトルベースの符号化手法（例：FV、SVC）は、安定した低次元のコーデック表現を持つため、統合戦略に対して感受性が低く、一方で再構成ベースの手法は表現レベルの統合によってより大きな利益を得られると判明した。
統合による性能向上の主な要因は、異なる記述子や符号化方式の相補性に起因しており、単に特徴次元の増加によるものではない。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。