QUICK REVIEW

[论文解读] RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection

Hangjie Yuan, Jianwen Jiang|arXiv (Cornell University)|Sep 5, 2022

Multimodal Machine Learning Applications被引用 29

一句话总结

RLIP-ParSe 引入了关系语言-图像预训练，采用并行实体检测与序贯关系推理的架构，以及数据与标签处理策略，在零样本、少样本和微调设置中提升 HOI 检测并提高对嘈杂标签的鲁棒性。

ABSTRACT

The task of Human-Object Interaction (HOI) detection targets fine-grained visual parsing of humans interacting with their environment, enabling a broad range of applications. Prior work has demonstrated the benefits of effective architecture design and integration of relevant cues for more accurate HOI detection. However, the design of an appropriate pre-training strategy for this task remains underexplored by existing approaches. To address this gap, we propose Relational Language-Image Pre-training (RLIP), a strategy for contrastive pre-training that leverages both entity and relation descriptions. To make effective use of such pre-training, we make three technical contributions: (1) a new Parallel entity detection and Sequential relation inference (ParSe) architecture that enables the use of both entity and relation descriptions during holistically optimized pre-training; (2) a synthetic data generation framework, Label Sequence Extension, that expands the scale of language data available within each minibatch; (3) mechanisms to account for ambiguity, Relation Quality Labels and Relation Pseudo-Labels, to mitigate the influence of ambiguous/noisy samples in the pre-training data. Through extensive experiments, we demonstrate the benefits of these contributions, collectively termed RLIP-ParSe, for improved zero-shot, few-shot and fine-tuning HOI detection performance as well as increased robustness to learning from noisy annotations. Code will be available at https://github.com/JacobYuan7/RLIP.

研究动机与目标

通过带有关系语言监督的预训练，使 HOI 检测与下游任务对齐，推动改进。
提出 ParSe 架构，以解耦主体、客体和关系表示，提升跨模态学习。
引入标签序列扩展，在小批次中扩展语言监督。
利用关系质量标签和关系伪标签来缓解标签噪声与语义歧义。

提出的方法

提出 ParSe：一个类似 DETR 的架构，具备并行主体/客体检测和序贯关系推理，从而实现解耦的实体与关系表示。
应用 RLIP 学习图像特征与自由文本描述之间的跨模态对应，针对实体与关系。
使用标签序列扩展，通过用批内外的描述扩展标签来合成负样本。
通过关系质量标签缓解标签噪声：依据主体/客体检测质量来放大关系文本的置信度。
通过文本嵌入相似性传播相似的关系标签，以缓解语义歧义。

实验结果

研究问题

RQ1关系语言-图像预训练能否在 HOI 检测中超越传统以对象为中心的预训练？
RQ2解耦主体、客体和关系表示（ParSe）是否更好地支持 HOI 任务的跨模态对齐？
RQ3合成负样本（LSE）和标签/噪声处理（RQL、RPL）如何影响零样本、少量样本和微调的 HOI 性能？
RQ4RLIP-ParSe 对自由文本语言中的嘈杂关系注释和语义同义词是否鲁棒？

主要发现

RLIP 预训练在 VG 上的 HOI 检测中优于原生目标检测预训练。
在某些协议下，RLIP-ParSe 实现了强劲的零样本 HOI 检测性能，超越了若干微调方法。
与传统预训练相比，RLIP 在数据稀缺时尤其提升了少样本迁移。
通过 RQL 和 RPL 增加对关系标签噪声和同义词的鲁棒性，提升对嘈杂监督下的稳定性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。