QUICK REVIEW

[论文解读] Information Screening whilst Exploiting! Multimodal Relation Extraction with Feature Denoising and Multimodal Topic Modeling

Shengqiong Wu, Hao Fei|arXiv (Cornell University)|May 19, 2023

Topic Modeling被引用 21

一句话总结

介绍了一种多模态关系抽取框架，通过图信息瓶颈去噪内部特征并用潜在多模态主题丰富上下文，在基准MRE数据集上达到SOTA。

ABSTRACT

Existing research on multimodal relation extraction (MRE) faces two co-existing challenges, internal-information over-utilization and external-information under-exploitation. To combat that, we propose a novel framework that simultaneously implements the idea of internal-information screening and external-information exploiting. First, we represent the fine-grained semantic structures of the input image and text with the visual and textual scene graphs, which are further fused into a unified cross-modal graph (CMG). Based on CMG, we perform structure refinement with the guidance of the graph information bottleneck principle, actively denoising the less-informative features. Next, we perform topic modeling over the input image and text, incorporating latent multimodal topic features to enrich the contexts. On the benchmark MRE dataset, our system outperforms the current best model significantly. With further in-depth analyses, we reveal the great potential of our method for the MRE task. Our codes are open at https://github.com/ChocoWu/MRE-ISE.

研究动机与目标

通过对文本和图像输入进行细粒度特征筛选，解决多模态关系抽取中的内部信息过度利用。
通过利用潜在的多模态主题建模来丰富上下文信息，解决外部信息利用不足。
提出一个跨模态图骨干网络，融合文本和视觉场景图，并通过图信息瓶颈进行细化。
引入潜在的多模态主题（Lamo）模块，将视觉和文本主题关键词整合到CMG中。
证明联合内部筛选和外部利用在MRE数据集上带来显著提升，并对何时每个组件最有益进行分析。

提出的方法

用可视场景图和文本场景图来表示输入图像和文本（VSG 和 TSG）。
将 VSG 和 TSG 融合成具有内在模态链接的跨模态图（CMG）。
应用基于图信息瓶颈（GIB）的特征精炼（Gene）来修剪与任务无关的节点/边。
开发潜在的多模态主题（Lamo）建模，以提取最相关的文本和视觉主题并将其整合到 CMG 中。
通过对文本和视觉主题关键词的注意力以及与经细化 CMG 特征的拼接实现跨模态主题整合。
采用暖启动训练：先用 GIB 损失优化 Gene，然后用 LAMO 损失进行预训练 Lamo，最后使用交叉熵损失端到端联合训练。

实验结果

研究问题

RQ1细粒度内部信息筛选是否能通过修剪不相关的视觉/文本特征来提升多模态关系抽取？
RQ2通过潜在的多模态主题实现外部信息利用是否能丰富上下文并超越仅修剪的效果？
RQ3Gene（GIB）与 Lamo 如何交互以在不同文本–视觉相关性下提升关系预测？
RQ4跨模态图结构和 SG 质量对 MRE 性能有何影响？
RQ5在何种数据场景（高文本–视觉相关性与低相关性）下内部筛选和外部利用各自贡献最大？

主要发现

方法	Acc.	Pre.	Rec.	F1
Text-based Methods - BERT	-	63.85	55.79	59.55
Text-based Methods - PCNN	72.67	62.85	49.69	55.49
Text-based Methods - MTB	72.73	64.46	57.81	60.86
Text-based Methods - DP-GCN	74.60	64.04	58.44	61.11
Multimodal Methods - BERT(Text+Image)	74.59	63.07	59.53	61.25
Multimodal Methods - BERT+SG	74.09	62.95	62.65	62.80
Multimodal Methods - MEGA	76.15	64.51	68.44	66.41
Multimodal Methods - VisualBERT	-	57.15	59.48	58.30
Multimodal Methods - ViLBERT	-	64.50	61.86	63.16
Multimodal Methods - RDS	-	66.83	65.47	66.14
Multimodal Methods - HVPNet	-	83.64	80.78	81.85
Multimodal Methods - MKGformer	-	92.31	82.67	81.95
Ours	94.06	84.69	83.38	84.03
w/o Gene (Eq. 11)	92.42	82.41	81.83	82.12
w/o I(z,G) (Eq. 13)	93.64	83.61	82.34	82.97
w/o Lamo (Eq. 4)	92.86	82.97	81.22	82.09
w/o o^T	93.05	83.95	82.53	83.23
w/o o^I	93.63	84.03	83.18	83.60
w/o VSG&TSG	93.12	83.51	82.67	83.09
w/o CMG	93.97	84.38	83.20	83.78

所提出的框架在基准 MRE 数据集上达到最先进的结果，优于强大的多模态基线。
GIB 引导的特征精炼通过修剪节点/边来去噪输入特征，提升任务聚焦的表示。
潜在的多模态主题建模（Lamo）提供连贯的文本和视觉主题特征，丰富上下文并提升预测。
消融实验显示 Gene 和 Lamo 都有实质性贡献；基于 SG 的跨模态图和 CMG 连通性至关重要。
分析表明在高文本–视觉相关性时 Gene 更有利，而在跨模态相关性较低时 Lamo 更有帮助；两者结合在各种情景中都能带来稳健提升。
定性案例研究表明修订后的图具有与任务相关的边和主题关键词，指导关系推断。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。