QUICK REVIEW

[论文解读] Learning Human-Object Interaction Detection using Interaction Points

Tiancai Wang, Tong Yang|arXiv (Cornell University)|Mar 31, 2020

Multimodal Machine Learning Applications参考文献 53被引用 35

一句话总结

本文提出一个完全卷积、无锚点的 HOI 检测器，将交互建模框定为交互点的关键点检测并通过与人/物检测的分组来预测 HOI 三元组。

ABSTRACT

Understanding interactions between humans and objects is one of the fundamental problems in visual classification and an essential step towards detailed scene understanding. Human-object interaction (HOI) detection strives to localize both the human and an object as well as the identification of complex interactions between them. Most existing HOI detection approaches are instance-centric where interactions between all possible human-object pairs are predicted based on appearance features and coarse spatial information. We argue that appearance features alone are insufficient to capture complex human-object interactions. In this paper, we therefore propose a novel fully-convolutional approach that directly detects the interactions between human-object pairs. Our network predicts interaction points, which directly localize and classify the inter-action. Paired with the densely predicted interaction vectors, the interactions are associated with human and object detections to obtain final predictions. To the best of our knowledge, we are the first to propose an approach where HOI detection is posed as a keypoint detection and grouping problem. Experiments are performed on two popular benchmarks: V-COCO and HICO-DET. Our approach sets a new state-of-the-art on both datasets. Code is available at https://github.com/vaesl/IP-Net.

研究动机与目标

动机：需要超越基于外观、以实例为中心的 HOI 检测架构，这些架构在面对大量的人-物对时扩展性差。
引入一种基于交互点和交互向量的新 HOI 表征，以直接定位和分类交互。
开发一个完全卷积网络，检测交互点和向量，并将它们与检测到的人/物进行分组以形成 HOI 三元组。
在两个基准数据集（V-COCO 和 HICO-DET）上展示最先进的性能，并通过消融研究验证每个组件。

提出的方法

将 HOI 检测视为一个受无锚点目标检测启发的关键点检测与分组问题。
使用 Hourglass 主干网络提取特征，并产生两个并行分支：交互点热力图和无符号交互向量图。
用高斯监督训练交互点热力图，并使用类似 focal 的损失来平衡正负样本。
训练交互向量以预测指向人/物中心的绝对水平/垂直长度（无符号向量）。
推理阶段，提取前 k 个交互点，恢复交互向量，并形成交互框。
通过一个软约束机制将交互点与检测到的人/物框进行分组，该机制检查与人/物框的 IoU 以及到参考框的角点距离。

实验结果

研究问题

RQ1HOI 检测能否有效地被表述为关键点检测与分组问题，而不是多模态、以实例为中心的流程？
RQ2相较于传统的多流方法，交互点和向量是否改善了 HOI 的定位和分类？
RQ3所提出的交互分组以及辅助组件（angle-filter、dist-ratio-filter、center-pool）对 HOI 检测性能的影响是什么？
RQ4所提出的方法和组件在标准 HOI 基准（V-COCO、HICO-DET）上是否具有可扩展性和有效性？

主要发现

所提出的 IP-Net 在 V-COCO 上实现了最先进的 mAP_role（在没有在 HICO-DET 预训练时为 51.0，在有 HICO-DET 预训练时为 52.3），在 HICO-DET 的 Default 与 Known Object 设置下亦如此。
消融研究表明，将交互分组方案与交互框和角点距离约束结合，显著提升性能（例如在 V-COCO 上从 46.2 提升到 50.5 再到 51.0 mAP_role）。
Center-pool 与两分支的交互生成（点热力图和无符号向量）相比基线带来可观的提升，总体绝对提升点数为基线的 11.4 个百分点。
尽管理论上看起来是二次级别，但由于通过热力图筛选和软约束，该方法在实际中保持了近似线性复杂度的高效分组（<5 ms）。
交互分数的动态阈值相较固定阈值，在 HICO-DET 上改善了稀有类与非稀有类的性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。