QUICK REVIEW

[논문 리뷰] ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph

F. Richard Yu, Jiji Tang|arXiv (Cornell University)|2020. 06. 30.

Multimodal Machine Learning Applications참고 문헌 32인용 수 118

한 줄 요약

ERNIE-ViL은 Scene Graph Prediction 작업을 도입하여 비전-언어 사전학습에 구조화된 장면 지식을 주입하고, 다섯 개의 다운스트림 태스크에서 최첨단 성과를 달성하며 VCR 리더보드에서 절대 3.7% 포인트 앞서고 있다.

ABSTRACT

We propose a knowledge-enhanced approach, ERNIE-ViL, which incorporates structured knowledge obtained from scene graphs to learn joint representations of vision-language. ERNIE-ViL tries to build the detailed semantic connections (objects, attributes of objects and relationships between objects) across vision and language, which are essential to vision-language cross-modal tasks. Utilizing scene graphs of visual scenes, ERNIE-ViL constructs Scene Graph Prediction tasks, i.e., Object Prediction, Attribute Prediction and Relationship Prediction tasks in the pre-training phase. Specifically, these prediction tasks are implemented by predicting nodes of different types in the scene graph parsed from the sentence. Thus, ERNIE-ViL can learn the joint representations characterizing the alignments of the detailed semantics across vision and language. After pre-training on large scale image-text aligned datasets, we validate the effectiveness of ERNIE-ViL on 5 cross-modal downstream tasks. ERNIE-ViL achieves state-of-the-art performances on all these tasks and ranks the first place on the VCR leaderboard with an absolute improvement of 3.7%.

연구 동기 및 목표

비전-언어 사전학습을 개선하기 위해 객체, 속성, 관계 등 교차 모달의 세부적 의미를 포착해야 한다는 동기를 제시한다.
사전학습에 Scene Graph의 구조화된 지식을 통합하여 교차 모달 정렬을 강화한다.
Scene-graph-guided 사전학습이 다수의 교차 모달 벤치마크에서 이득을 낳는지 입증한다.

제안 방법

문장을 파싱한 Scene Graph에서 객체, 속성, 관계를 마스킹하고 예측하는 Scene Graph Prediction 작업을 구성한다.
이미지 영역과 텍스트를 교차 모달 주의(attention)로 공동 모델링하는 두-스트림 교차 모달 트랜스포머 아키텍처를 사용한다.
Scene Graph Prediction 손실, MLM, Masked Region Prediction, Image-Text Matching의 조합으로 사전학습한다.
텍스트에서 파싱한 Scene Graph가 Object Prediction, Attribute Prediction, Relationship Prediction를 타깃 예측으로 가이드한다.
Object/Attribute/Relationship 노드가 특정 전략으로 마스킹되고 텍스트와 이미지 영역의 컨텍스트를 이용해 복구된다.

실험 결과

연구 질문

RQ1사전학습 도중 구조화된 Scene-Graph 지식을 포함하면 미세한 비전-언어 이해를 향상시킬 수 있는가?
RQ2Scene Graph Prediction 작업이 모달리티 간 객체, 속성, 관계에 대한 더 나은 교차 모달 정렬을 가져오는가?
RQ3ERNIE-ViL은 VCR, VQA, RefCOCO+, Flickr 기반 검색과 같은 표준 비전-언어 벤치마크에서 기존 사전학습 방식에 비해 어떤 성능을 보이는가?
RQ4도메인 내 데이터와 도메인 외 데이터 중 Scene-Graph guided 목표를 사용할 때 어떤 영향이 있는가?

주요 결과

다섯 개의 다운스트림 비전-언어 작업에서 최첨단 결과를 달성한다.
VCR에서 ERNIE-ViL-large은 베이스라인보다 상당한 이득을 보이고, Q→AR에서 Prior 방법 대비 절대 3.7% 포인트의 개선으로 VCR 리더보드 1위를 차지한다.
Scene Graph Prediction을 사용할 때 Region-to-Phrase 정 grounding(RefCOCO+)이 테스트 세트에서 2.4%의 의미 있는 개선을 보인다.
Scene Graph Prediction으로 사전학습하면 작업 전반에 걸쳐 측정 가능한 이득이 있으며, ERNIE-2.0 또는 BERT로 초기화된 모델과 비교할 때 개선이 도출된다.
Cloze 테스트는 SGP 작업을 가진 모델이 객체, 속성, 관계를 더 잘 예측함을 보여주어 교차 모달 상세 의미 이해가 강화되었음을 시사한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.