[论文解读] Understanding Humans in Crowded Scenes: Deep Nested Adversarial Learning and A New Benchmark for Multi-Human Parsing
本文介绍一个包含 25,403 张图像和 58 个细粒度类别的大规模多人体解析(MHP v2.0)数据集,以及一个用于端到端多人体解析的新颖深度嵌套对抗网络(NAN)。NAN 由三个 GAN-like 子网络组成,用于语义显著性、实例无关解析和实例感知聚类,在嵌套对抗框架中进行训练。
Despite the noticeable progress in perceptual tasks like detection, instance segmentation and human parsing, computers still perform unsatisfactorily on visually understanding humans in crowded scenes, such as group behavior analysis, person re-identification and autonomous driving, etc. To this end, models need to comprehensively perceive the semantic information and the differences between instances in a multi-human image, which is recently defined as the multi-human parsing task. In this paper, we present a new large-scale database "Multi-Human Parsing (MHP)" for algorithm development and evaluation, and advances the state-of-the-art in understanding humans in crowded scenes. MHP contains 25,403 elaborately annotated images with 58 fine-grained semantic category labels, involving 2-26 persons per image and captured in real-world scenes from various viewpoints, poses, occlusion, interactions and background. We further propose a novel deep Nested Adversarial Network (NAN) model for multi-human parsing. NAN consists of three Generative Adversarial Network (GAN)-like sub-nets, respectively performing semantic saliency prediction, instance-agnostic parsing and instance-aware clustering. These sub-nets form a nested structure and are carefully designed to learn jointly in an end-to-end way. NAN consistently outperforms existing state-of-the-art solutions on our MHP and several other datasets, and serves as a strong baseline to drive the future research for multi-human parsing.
研究动机与目标
- 促使对拥挤场景中对人类的整体理解,超越仅检测与实例分割的能力。
- 提供一个大规模、注释丰富的多人体解析基准,涵盖细粒度语义类别。
- 在嵌套对抗设定下开发一个统一的端到端模型,使解析与实例区分能够同时学习。
- 实现高效、单次前向传播的多人体解析,适用于现实世界应用。
提出的方法
- 提出 MHP v2.0,作为一个包含 25,403 张图像和 58 个语义类别(涵盖身体部位、服装和配饰)的大规模数据集。
- 引入 NAN,一个三分支 GAN-like 框架,用于语义显著性预测、实例无关解析和实例感知聚类。
- 每个子网在对抗损失和任务特定损失的共同作用下训练,形成嵌套、相互促进的结构,使端到端反向传播成为可能。
- 将语义显著性作为先验以辅助解析,结合实例无关解析,最终在不使用区域提议的情况下进行实例感知聚类。
- 提供训练细节,包括网络初始化、损失项和端到端优化目标。
- 在 MHP v2.0 及其他数据集上的评估表明 NAN 相较于现有方法具有优越性。
实验结果
研究问题
- RQ1嵌套式对抗学习框架是否能在拥挤场景中提升对人体的整体解析?
- RQ2大规模、细粒度数据集(MHP v2.0)是否更好地支持在存在遮挡与交互时对实例级别的身体部位与时尚物品解析的学习?
- RQ3端到端的 NAN 能否在单次前向传播中提供准确的解析与实例区分,且无需大量的前处理/后处理?
- RQ4将语义显著性先验和实例无关解析纳入对实例感知聚类的效果如何?
主要发现
- NAN 在多人体解析方面相较于最新方法在 MHP v2.0 及其他基准上具备领先性能。
- 该模型实现了单通道的多人体解析,速度具有竞争力,避免了昂贵的区域提议。
- NAN 通过其嵌套对抗结构实现端到端训练,并在多种损失的联合优化下表现出色。
- MHP v2.0 数据集提供了丰富的注释(58 类)以及现实世界的多样性,包括视角、遮挡和交互。
- 实验包括在 MHP v2.0、MHP v1.0、PASCAL-Person-Part 和 Buffy 上的评估,以验证 NAN 的通用性。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。