QUICK REVIEW

[論文レビュー] Look into Person: Self-supervised Structure-sensitive Learning and A New Benchmark for Human Parsing

Ke Gong, Xiaodan Liang|arXiv (Cornell University)|Mar 16, 2017

Multimodal Machine Learning Applications参考文献 31被引用数 51

ひとこと要約

Look into Person (LIP) の大規模人間パーシングのベンチマークと、推定される体関節構造と整合するよう解析結果を強制する自己-supervised structure-sensitive learning (SSL) 手法を紹介します。SSL は LIP および PASCAL-Person-Part データセットでパーシング精度を向上させます。

ABSTRACT

Human parsing has recently attracted a lot of research interests due to its huge application potentials. However existing datasets have limited number of images and annotations, and lack the variety of human appearances and the coverage of challenging cases in unconstrained environment. In this paper, we introduce a new benchmark "Look into Person (LIP)" that makes a significant advance in terms of scalability, diversity and difficulty, a contribution that we feel is crucial for future developments in human-centric analysis. This comprehensive dataset contains over 50,000 elaborately annotated images with 19 semantic part labels, which are captured from a wider range of viewpoints, occlusions and background complexity. Given these rich annotations we perform detailed analyses of the leading human parsing approaches, gaining insights into the success and failures of these methods. Furthermore, in contrast to the existing efforts on improving the feature discriminative capability, we solve human parsing by exploring a novel self-supervised structure-sensitive learning approach, which imposes human pose structures into parsing results without resorting to extra supervision (i.e., no need for specifically labeling human joints in model training). Our self-supervised learning framework can be injected into any advanced neural networks to help incorporate rich high-level knowledge regarding human joints from a global perspective and improve the parsing results. Extensive evaluations on our LIP and the public PASCAL-Person-Part dataset demonstrate the superiority of our method.

研究の動機と目的

現実世界の外観変動や困難な状況を網羅する、大規模で多様な人間パーシングのベンチマークを作成する。
多様な条件下での強みと失敗モードを特定するために、最先端の人間パーシング手法を分析する。
追加の関節アノテーションなしで、人間の体構造と意味的一貫性を強制する自己-supervised structure-sensitive learning フレームワークを提案する。

提案手法

Look into Person (LIP) データセットを新たにアノテーションし、50,462 枚の画像と19 のセマンティックパーツラベル plus 背景ラベルを付与する。
LIP における最先端のパーシング手法を分析し、性能ギャップと構造関連の失敗を理解する。
パーシングマップから推定された関節（頭部、上半身、下半身、四肢、靴）を用いてパーシング損失にウェイトを与える自己 supervise structure-sensitive loss を導入する。
パーシング結果とグラウンドトゥルースから関節構造ヒートマップを計算し、構造項として予測ヒートマップとグラウンドトゥルース関節ヒートマップ間の L2 Loss を最小化する。
最終損失を Structure = JointLoss × ParsingLoss と導出し、既存のネットワーク（例：Attention to Scale、DeepLabV2）へのエンドツーエンド統合を可能にする。
LIP および公開データセットの PASCAL-Person-Part を対象に SSL を評価し、mean IoU の改善とクラス別の改善を示す。特に小さな部位や視覚的に曖昧な部位での改善が顕著。

実験結果

リサーチクエスチョン

RQ1現実世界の外観変動、遮蔽、視点を捉えるには、人間パーシングデータセットはどの程度の大規模さと多様性が必要か？
RQ2現在の最先端パーシングモデルは人間の体の配置と構造的一貫性を欠く問題があり、構造認識を備えた自己-supervised 信号は追加アノテーションなしに予測を改善できるか？
RQ3関節構造ベースの重み付け方式は、特に小さな部位や左右の曖昧さに対してピクセル単位のパーシング精度を向上させるか？
RQ4提案された SSL アプローチはデータセット間（LIP と PASCAL-Person-Part）およびネットワークバックボーン間で転用可能か？

主な発見

LIP は 50,462-image ベンチマークで、19 のパーツラベルと背景を含み、従来のデータセットより多様性と難易度が高い。
最先端のパーシング手法は LIP で意味のある性能ギャップを示し、構造事前情報とマルチスケール特徴が結果を改善する。
提案された自己-supervised structure-sensitive learning (SSL) は、LIP と PASCAL-Person-Part の両方でパーシング性能を一貫して向上させ、ベースラインを顕著な差で上回る。
クラス別 IoU の利得は、特に小さな部位や高度に曖昧な部位（例：サングラス、手袋、靴下）および左右の肢の区別で顕著。
SSL はパーシング出力をより plausible な人間の体構成に整合させ、構造非依存メソッドで観察された不合理な結果に対処する。
SSL 信号は既存のアーキテクチャ（例：Attention to Scale、DeepLabV2）に最小限のアーキテクチャ変更と追加の関節アノテーションなしで組み込むことができる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。