QUICK REVIEW

[论文解读] Look into Person: Self-supervised Structure-sensitive Learning and A New Benchmark for Human Parsing

Ke Gong, Xiaodan Liang|arXiv (Cornell University)|Mar 16, 2017

Multimodal Machine Learning Applications参考文献 31被引用 51

一句话总结

介绍 Look into Person (LIP) 大规模人像分割基准，以及一种自监督的结构敏感学习（SSL）方法，强制分割结果与推断出的身体关节结构保持一致。SSL 提升了 LIP 和 PASCAL-Person-Part 数据集上的分割准确率。

ABSTRACT

Human parsing has recently attracted a lot of research interests due to its huge application potentials. However existing datasets have limited number of images and annotations, and lack the variety of human appearances and the coverage of challenging cases in unconstrained environment. In this paper, we introduce a new benchmark "Look into Person (LIP)" that makes a significant advance in terms of scalability, diversity and difficulty, a contribution that we feel is crucial for future developments in human-centric analysis. This comprehensive dataset contains over 50,000 elaborately annotated images with 19 semantic part labels, which are captured from a wider range of viewpoints, occlusions and background complexity. Given these rich annotations we perform detailed analyses of the leading human parsing approaches, gaining insights into the success and failures of these methods. Furthermore, in contrast to the existing efforts on improving the feature discriminative capability, we solve human parsing by exploring a novel self-supervised structure-sensitive learning approach, which imposes human pose structures into parsing results without resorting to extra supervision (i.e., no need for specifically labeling human joints in model training). Our self-supervised learning framework can be injected into any advanced neural networks to help incorporate rich high-level knowledge regarding human joints from a global perspective and improve the parsing results. Extensive evaluations on our LIP and the public PASCAL-Person-Part dataset demonstrate the superiority of our method.

研究动机与目标

创建一个大规模、多样化的人体解析基准，覆盖现实世界的外观变异和具有挑战性的场景。
分析前沿的人体解析方法，识别在多样条件下的优势与失败模式。
提出一种自监督的结构敏感学习框架，在没有额外关节注释的情况下，强制语义与人体结构的一致性。

提出的方法

使用 50,462 张图像和 19 个语义部位标签 plus 一个背景标签，标注一个新的 Look into Person (LIP) 数据集。
分析 LIP 上的尖端解析方法，以了解性能差距和与结构相关的失败。
引入一种自监督的结构敏感损失，该损失利用从解析图（头部、上半身、下半身、四肢、鞋子）推断出的关节来对解析损失加权。
从解析结果和真实标签计算关节结构热力图，然后最小化预测与真实关节热力图之间的 L2 损失，作为结构项。
将最终损失推导为 Structure = JointLoss × ParsingLoss，便于与现有网络（如 Attention to Scale、DeepLabV2）端到端集成。
在 LIP 和公开的 PASCAL-Person-Part 数据集上评估 SSL，以证明平均 IoU 和逐类改进，特别是对小的或视觉上模糊的部位的改进。

实验结果

研究问题

RQ1一个人体解析数据集应有多大规模和多样性，才能捕捉现实世界的外观变化、遮挡和视角？
RQ2当前的最先进解析模型是否存在与人体布局结构不一致的问题，是否可以通过结构感知的自监督信号在不需要额外注释的情况下改善预测？
RQ3基于关节结构的加权方案是否能提高像素级解析的准确性，尤其是对小部件和左/右判定模糊的情况？
RQ4所提出的 SSL 方法能否在不同数据集（LIP 与 PASCAL-Person-Part）以及不同网络骨干之间迁移？

主要发现

LIP 是一个包含 50,462 张图像、具有 19 个部位标签加背景的基准，提供比以往数据集更高的多样性和难度。
尖端解析方法在 LIP 上显示出显著的性能差距，结构先验与多尺度特征的结合能提升结果。
提出的自监督结构敏感学习（SSL）在 LIP 和 PASCAL-Person-Part 上均持续提升解析性能，显著优于基线。
逐类 IoU 的增益在小型或高度模糊的部位（如太阳镜、手套、袜子）以及左/右肢体的区分上尤为显著。
SSL 有助于解析输出更好地与合理的人体配置对齐，解决了结构无关方法中观察到的不合理结果。
SSL 信号可以在现有架构中注入（如 Attention to Scale、DeepLabV2），仅需最小的架构修改且无需额外的关节注释。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。