QUICK REVIEW

[論文レビュー] Multi-label Image Recognition by Recurrently Discovering Attentional Regions

Zhouxia Wang, Tianshui Chen|arXiv (Cornell University)|Nov 8, 2017

Text and Document Classification Technologies参考文献 22被引用数 53

ひとこと要約

この論文は、提案なしのエンドツーエンドフレームワークを提案し、空間変換器とLSTMを介して注意領域を学習し、マルチラベル画像認識を実行して領域依存性を捉えます。

ABSTRACT

This paper proposes a novel deep architecture to address multi-label image recognition, a fundamental and practical task towards general visual understanding. Current solutions for this task usually rely on an extra step of extracting hypothesis regions (i.e., region proposals), resulting in redundant computation and sub-optimal performance. In this work, we achieve the interpretable and contextualized multi-label image classification by developing a recurrent memorized-attention module. This module consists of two alternately performed components: i) a spatial transformer layer to locate attentional regions from the convolutional feature maps in a region-proposal-free way and ii) an LSTM (Long-Short Term Memory) sub-network to sequentially predict semantic labeling scores on the located regions while capturing the global dependencies of these regions. The LSTM also output the parameters for computing the spatial transformer. On large-scale benchmarks of multi-label image classification (e.g., MS-COCO and PASCAL VOC 07), our approach demonstrates superior performances over other existing state-of-the-arts in both accuracy and efficiency.

研究の動機と目的

マルチラベル画像認識における仮説-領域パイプラインの非効率性を動機づけ、対処する。
外部提案なしに意味的に有意な注意領域を自動的に発見するエンドツーエンドアーキテクチャを開発する。
注意された領域間の長距離の文脈依存性を捉え、ラベリング精度を向上させる。
より解釈可能な領域になるよう空間変換器の局在を導く制約を提供する。
VOC 2007とMS-COCOで最先端の性能を、向上した効率性とともに示す。

提案手法

畳み込み特徴マップ上の領域提案なしに注意領域を定位するため、CNNに空間変換層を組み込む。
attended regionごとにラベルスコアを逐次予測し、次のステップの局在パラメータを出力するのにLSTMを使用する。
K個の領域を反復的に注意し、カテゴリ別最大プーリングで領域スコアを融合して最終的なラベルスコアを得る。
マルチラベル分類のためのカテゴリレベルのユークリッド損失を適用する。
空間的局在化を多様化し、サイズを制御し、鏡像化を回避する3つの局在化制約（アンカー、スケール、ポジティブ）を導入し、結合局在化損失を課す。
Adamオプティマイザを用いた結合損失L = L_cls + gamma * L_locでエンドツーエンドに訓練する。

実験結果

リサーチクエスチョン

RQ1提案なしの注意機構はマルチラベル分類の識別的領域を定位できるか。
RQ2空間変換器で発見された注意領域とメモリ強化領域エンコーディングを組み合わせた場合、提案ベースの方法よりも精度と効率の双方が改善されるか。
RQ3局在化制約はより多様で適切にスケールされた非鏡像の注意領域を導き、性能を向上させるか。
RQ4マルチスケール/マルチビューのテストはVOC 2007およびMS-COCOの性能にどのように影響するか。
RQ5注意領域を用いることと物体提案を使用することの認識性能にどのような影響があるか。

主な発見

Methods	aero	bike	bird	boat	bottle	bus	car	cat	chair	cow	table	dog	horse	mbike	person	plant	sheep	sofa	train	tv	mAP
CNN-SVM	88.5	81.0	83.5	82.0	42.0	72.5	85.3	81.6	59.9	58.5	66.5	77.8	81.8	78.8	90.2	54.8	71.1	62.6	87.2	71.8	73.9
CNN-RNN	96.7	83.1	94.2	92.8	61.2	82.1	89.1	94.2	64.2	83.6	70.0	92.4	91.7	84.2	93.7	59.8	93.2	75.3	99.7	78.6	84.0
VeryDeep	98.9	95.0	96.8	95.4	69.7	90.4	93.5	96.0	74.2	86.6	87.8	96.0	96.3	93.1	97.2	70.0	92.1	80.3	98.1	87.0	89.7
RLSD	96.4	92.7	93.8	94.1	71.2	92.5	94.2	95.7	74.3	90.0	74.2	95.4	96.2	92.1	97.9	66.9	93.5	73.7	97.5	87.6	88.5
HCP	98.6	97.1	98.0	95.6	75.3	94.7	95.8	97.3	73.1	90.2	80.0	97.3	96.1	94.9	96.3	78.3	94.7	76.2	97.9	91.5	90.9
FeV+LV	97.9	97.0	96.6	94.6	73.6	93.9	96.5	95.5	73.7	90.3	82.8	95.4	97.7	95.9	98.6	77.6	88.7	78.0	98.3	89.0	90.6
Ours (512)	98.5	96.7	95.6	95.7	73.7	92.1	95.8	96.8	76.5	92.9	87.2	96.6	97.5	92.8	98.3	76.9	91.3	83.6	98.6	88.1	91.3
Ours (640)	97.7	97.3	96.4	95.8	74.6	91.9	96.5	96.7	75.2	89.9	87.1	96.0	96.9	93.2	98.4	81.3	93.4	81.3	98.3	88.5	91.3
Ours	98.6	97.4	96.3	96.2	75.2	92.4	96.5	97.1	76.5	92.0	87.7	96.8	97.5	93.8	98.5	81.6	93.7	82.8	98.6	89.3	91.9

提案なしのアプローチで、PASCAL VOC 2007（1スケール512または640、マルチスケール/マルチクロップ）およびMS-COCOで最先端の平均適合率（mAP）を達成。
提案ベースの方法よりも精度と効率が優れており、推論速度が大幅に高速（GPU上での10視点テストで約150–200 ms程度）。
注意領域は数百の物体提案と比較して競争力のある、または優れたmAPを提供（例：5つの注意領域で約500提案を凌駕）。
局在化制約（アンカー、スケール、ポジティブ）は、VOC 2007およびMS-COCOのmAPを大幅に向上させ、A+S+Pの組み合わせが最良の結果をもたらす。
マルチスケールおよびマルチクロップの融合は追加の利得を生み、スケール間でのパッチ特徴を集約することで改善を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。