QUICK REVIEW

[论文解读] Detecting and Recognizing Human-Object Interactions

Georgia Gkioxari, Ross Girshick|arXiv (Cornell University)|Apr 24, 2017

Multimodal Machine Learning Applications参考文献 28被引用 60

一句话总结

本文提出 InteractNet，是基于 Faster R-CNN 的模型，具有一个以人为中心的分支，预测与动作相关的目标对象位置，以检测和识别图像中的 <human, verb, object> 三元组。它在 V-COCO 上实现了最先进的角色 AP，并在 HICO-DET 上取得强劲结果，训练高效的端到端。

ABSTRACT

To understand the visual world, a machine must not only recognize individual object instances but also how they interact. Humans are often at the center of such interactions and detecting human-object interactions is an important practical and scientific problem. In this paper, we address the task of detecting triplets in challenging everyday photos. We propose a novel model that is driven by a human-centric approach. Our hypothesis is that the appearance of a person -- their pose, clothing, action -- is a powerful cue for localizing the objects they are interacting with. To exploit this cue, our model learns to predict an action-specific density over target object locations based on the appearance of a detected person. Our model also jointly learns to detect people and objects, and by fusing these predictions it efficiently infers interaction triplets in a clean, jointly trained end-to-end system we call InteractNet. We validate our approach on the recently introduced Verbs in COCO (V-COCO) and HICO-DET datasets, where we show quantitatively compelling results.

研究动机与目标

在真实世界图像中，将人与对象的交互识别建模为 <human, verb, object> 三元组，作为研究动机和目标。
利用人体外观（姿势、动作）来预测潜在的目标对象位置并减少搜索空间。
联合训练一个端到端系统，将以人为中心的动作线索与标准对象检测和成对交互推理相融合。
在 V-COCO 和 HICO-DET 数据集上展示有效性，并实现实用的推理速度。

提出的方法

在 Faster R-CNN 上扩展一个以人为中心的分支，该分支对动作进行分类并为每个动作预测目标对象位置密度。
将目标对象位置建模为一个条件于人类外观和动作的4维高斯均值 μ_h^a，其中 g_h,o^a 作为将 b_o 与 μ_h^a 结合的似然项。
计算三元组得分 S_h,o^a = s_h · s_o · s_h^a · g_h,o^a，并进行级联推理以维持 O(n) 运行时复杂度。
可选地用一个交互分支 s_{h,o}^a 替代 s_h^a，该分支将人和对象外观结合起来进行动作评分。
将所有分支联合训练为一个多任务目标，包括对象检测、动作分类和目标定位损失。
推理阶段对于每个检测到的人和动作，识别使 s_o · s_{h,o}^a · g_h,o^a 最大的对象，并形成三元组 <human, verb, object>。

实验结果

研究问题

RQ1以人为中心的线索是否能够改善涉及人类动作的目标对象定位，从而提高三元组检测的准确性？
RQ2在一个端到端框架中联合训练对象检测、动作分类和目标定位是否能提高交互识别性能？
RQ3所提出的目标定位密度（单峰 vs 多峰）如何影响不同动作的检测准确性？
RQ4一个可选的交互分支对将动作分数条件化于人和对象外观有何影响？

主要发现

模型	AP_agent (19 个动作)	AP_agent（所有动作）	AP_role（19 个动作）	AP_role（所有动作）
baseline [13] (Res50-FPN reimplementation)	62.1	？	31.0	？
InteractNet w/o target localization	65.1	？	31.9	？
InteractNet w/o interaction branch	65.5	？	36.8	？
InteractNet (full)	68.0	？	37.5	？

InterActNet 在 V-COCO 测试的所有动作上实现 AP_role 40.0，相对强基线 (31.8) 绝对提升 8.2 点。
在 V-COCO 上相对基线，InterActNet 将 AP_role 提升了 26%，从 31.8 提升到 40.0。
在 HICO-DET 数据集上，InteractNet 相对前一方法实现约 27% 的相对提升。
消融实验显示目标定位是性能的关键因素；去除它后 AP_role 从 37.5 降至 31.9。
该方法在单个 Nvidia M40 GPU 上大约每张图像 135 ms，显示出实际的效率。
使用 FPN 主干在性能上显著优于普通的 ResNet-50，尤其对小对象。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。