QUICK REVIEW

[论文解读] Zero-shot Visual Relation Detection via Composite Visual Cues from Large Language Models

Lin Li, Jun Xiao|arXiv (Cornell University)|May 21, 2023

Multimodal Machine Learning Applications被引用 9

一句话总结

本文提出 RECODE，一种零-shot VRD 方法，使用 LLM 生成的描述性视觉线索来描述主体、客体及空间组成，以增强 CLIP，并通过基于链式推理的线索加权来提高对相似关系的判别能力。

ABSTRACT

Pretrained vision-language models, such as CLIP, have demonstrated strong generalization capabilities, making them promising tools in the realm of zero-shot visual recognition. Visual relation detection (VRD) is a typical task that identifies relationship (or interaction) types between object pairs within an image. However, naively utilizing CLIP with prevalent class-based prompts for zero-shot VRD has several weaknesses, e.g., it struggles to distinguish between different fine-grained relation types and it neglects essential spatial information of two objects. To this end, we propose a novel method for zero-shot VRD: RECODE, which solves RElation detection via COmposite DEscription prompts. Specifically, RECODE first decomposes each predicate category into subject, object, and spatial components. Then, it leverages large language models (LLMs) to generate description-based prompts (or visual cues) for each component. Different visual cues enhance the discriminability of similar relation categories from different perspectives, which significantly boosts performance in VRD. To dynamically fuse different cues, we further introduce a chain-of-thought method that prompts LLMs to generate reasonable weights for different visual cues. Extensive experiments on four VRD benchmarks have demonstrated the effectiveness and interpretability of RECODE.

研究动机与目标

突出基于类别的提示在零-shot VRD中的弱点并引出基于线索的提示
提出 RECODE，将关系类别分解为由LLMs生成的主体、客体和空间线索
展示一种链式推理提示方案能为组合线索提供合理权重
在四个基准数据集（VG、GQA、HICO-DET、V-COCO）上展示改进的零-shot VRD 性能

提出的方法

将每个关系分解为主体、客体和空间成分
使用LLM为每个成分生成基于描述的视觉线索
通过一个有限的仿真空间图像集合来表示空间关系，以保持计算可控
计算视觉嵌入（CLIP）与语义线索嵌入（LLM 生成的提示）的相似度
用主体、客体和空间成分的学习权重来 Fuse 线索；权重通过链式推理提示策略生成
可选地应用一个过滤模块以移除不合理的预测（指导/过滤）

实验结果

研究问题

RQ1通过从基于类别的提示走向复合描述性线索，零-shot VRD 能否被改善？
RQ2LLM 生成的主体、客体和空间线索是否提升对细粒度关系的判别力？
RQ3链式推理提示是否能为多视觉线索在VRD中的组合提供合理权重？
RQ4与基线提示相比，RECODE 在标准VRD基准上的表现如何？

主要发现

RECODE 在 VG 和 GQA 上显著优于基于类别的 CLIP 基线（在不同设置下的 R@K 与 mR@K 均有提升）。
带有高级物体类别信息（动物/人/产品）的引导线索提升了线索质量和关系判别能力。
引入空间线索和LLM派生权重相对于仅使用线索的设置带来额外收益。
在完整 RECODE⋆（包含过滤）下取得最佳结果，展现对各数据集与指标的鲁棒提升。
在 HOI 数据集（HICO-DET 和 V-COCO）上，RECODE 相对于基线有温和但稳定的改进。
消融与架构研究表明该方法对不同 CLIP 主干具有鲁棒性，每个组件（线索、空间、权重、过滤）均对性能有贡献。
定性分析（注意力图）表明基于描述的提示将 CLIP 引导到更相关的图像区域。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。