QUICK REVIEW

[论文解读] BGT-Net: Bidirectional GRU Transformer Network for Scene Graph Generation

Naina Dhingra, Florian Ritter|arXiv (Cornell University)|Jan 1, 2021

Multimodal Machine Learning Applications参考文献 1被引用 2

一句话总结

BGT-Net 提出了一种用于场景图生成的新型双向 GRU Transformer 网络，通过双向 GRU 实现对象之间的双向通信以增强对象表征，随后使用双 Transformer 编码器进行对象和边的上下文预测。通过结合频率软化和偏差自适应机制，该方法有效缓解了长尾关系分布偏差，在 Visual Genome、Open Images 和 VRD 数据集上均实现了最先进性能。

ABSTRACT

Scene graphs are nodes and edges consisting of objects and object-object relationships, respectively. Scene graph generation (SGG) aims to identify the objects and their relationships. We propose a bidirectional GRU (BiGRU) transformer network (BGT-Net) for the scene graph generation for images. This model implements novel object-object communication to enhance the object information using a BiGRU layer. Thus, the information of all objects in the image is available for the other objects, which can be leveraged later in the object prediction step. This object information is used in a transformer encoder to predict the object class as well as to create object-specific edge information via the use of another transformer encoder. To handle the dataset bias induced by the long-tailed relationship distribution, softening with a log-softmax function and adding a bias adaptation term to regulate the bias for every relation prediction individually showed to be an effective approach. We conducted an elaborate study on experiments and ablations using open-source datasets, i.e., Visual Genome, Open-Images, and Visual Relationship Detection datasets, demonstrating the effectiveness of the proposed model over state of the art.

研究动机与目标

解决场景图生成数据集中长尾关系分布带来的挑战。
通过在图像中所有检测到的对象之间实现双向信息流动，提升对象表征能力。
通过使用 Transformer 编码器建模特定于对象的边上下文，提升关系预测的准确性。
在不损害频繁关系预测性能的前提下，减轻数据集偏差对稀有关系预测的负面影响。
在多个基准数据集上，同时实现场景图检测和分类任务的最先进性能。

提出的方法

使用双向 GRU（BiGRU）层实现全对象通信，使每个对象能够聚合来自其他所有对象的上下文信息。
采用带有缩放点积注意力的 Transformer 编码器，在接收聚合后的对象信息后预测对象类别。
为每个对象部署第二个 Transformer 编码器，以提取用于关系预测的边上下文特征。
对主体-客体关系分布应用 log-softmax 函数，以软化预测分布。
引入偏差自适应（BA）机制，根据场景特定输入动态调整每个主体-客体对的偏差。
结合频率软化与偏差自适应机制，以应对 Visual Genome 等数据集中关系的长尾分布问题。

实验结果

研究问题

RQ1通过 BiGRU 实现的双向对象通信是否能提升场景图生成中的对象表征学习？
RQ2使用两个专用的 Transformer 编码器——一个用于对象类别预测，一个用于边上下文建模——是否能提升关系预测性能？
RQ3频率软化与偏差自适应是否能有效缓解稀有关系预测性能的下降，同时不损害对频繁关系的预测？
RQ4BGT-Net 与 MOTIFS 等最先进模型相比，在标准 SGG 基准测试中的表现如何？
RQ5该模型在 Visual Genome、Open Images 和 Visual Relationship Detection 等多样化数据集上的泛化能力如何？

主要发现

BGT-Net 在 Visual Genome 数据集上实现了最先进性能，在场景图检测与分类两种协议下均优于以往的 SOTA 模型。
由于有效运用了频率软化与偏差自适应机制，该模型在稀有关系的召回率上表现出显著提升。
定性结果表明，与 MOTIFS 相比，BGT-Net 生成的场景图在语义准确性与视觉一致性方面更优，正确或合理的预测（图中用橙色标注）比例更高，错误预测更少。
消融实验验证了基于 BiGRU 的对象通信机制与双 Transformer 编码器结构对性能提升具有显著贡献。
偏差自适应机制有效降低了对频繁关系的过度自信，同时显著提升了对不频繁关系的预测质量，尤其在 SGCls 协议下表现突出。
该模型在对象检测阶段保持了高精度，对象预测错误极为罕见，表明其在对象检测阶段具有较强的鲁棒性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。