QUICK REVIEW

[논문 리뷰] Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation

Ruichi Yu, Ang Li|arXiv (Cornell University)|2017. 07. 28.

Multimodal Machine Learning Applications참고 문헌 28인용 수 57

한 줄 요약

본 논문은 내부 및 외부의 언어 지식을 시각 관계 탐지기로 증류하는 교사-학생 프레임워크를 제시하여, 특히 제로샷의 경우 술어 예측 성능을 향상시킨다.

ABSTRACT

Understanding visual relationships involves identifying the subject, the object, and a predicate relating them. We leverage the strong correlations between the predicate and the (subj,obj) pair (both semantically and spatially) to predict the predicates conditioned on the subjects and the objects. Modeling the three entities jointly more accurately reflects their relationships, but complicates learning since the semantic space of visual relationships is huge and the training data is limited, especially for the long-tail relationships that have few instances. To overcome this, we use knowledge of linguistic statistics to regularize visual model learning. We obtain linguistic knowledge by mining from both training annotations (internal knowledge) and publicly available text, e.g., Wikipedia (external knowledge), computing the conditional probability distribution of a predicate given a (subj,obj) pair. Then, we distill the knowledge into a deep model to achieve better generalization. Our experimental results on the Visual Relationship Detection (VRD) and Visual Genome datasets suggest that with this linguistic knowledge distillation, our model outperforms the state-of-the-art methods significantly, especially when predicting unseen relationships (e.g., recall improved from 8.45% to 19.17% on VRD zero-shot testing set).

연구 동기 및 목표

세 구성요소를 함께 모델링하여 시각 관계를 ⟨subject, predicate, object⟩ 삼중항으로 포착하고 예측한다.
롱테일 분포 및 미관계 관계를 다루기 위해 언어 지식으로 딥 시각 모델을 규제한다.
지식 증류를 통해 내부(학습 주석)와 외부(공개 텍스트)의 언어 통계를 활용한다.

제안 방법

주어(subject)와 객체(object) 표현 및 그들의 공간 구성과 함께 술어를 공동으로 모델링한다.
언어 지식 P(pred|subj,obj)를 이용해 교사 네트워크를 구성하고 학습 중에 이를 학생 네트워크로 증류한다.
학습 주석과 Wikipedia에서 언어 지식을 수집하여 교사의 지도를 형성하기 위해 이를 결합한다.
주어/객체의 의미 임베딩과 공간 특징을 사용하여 술어 확률을 조건화한다.
실제 정답 감독과 교사 지도를 혼합하는 손실(KL 유사 증류)로 엔드투엔드 학습한다.
VRD 및 Visual Genome 데이터셋에서 zero-shot 분할을 포함한 Recall@k를 사용해 평가한다.

실험 결과

연구 질문

RQ1언어 통계(내부 및 외부)가 딥 시각 관계 모델을 규제하여 일반화를 개선할 수 있는가?
RQ2교사와 학생 네트워크의 결합이 관찰 데이터와 제로샷 상황에서 성능에 어떤 영향을 미치는가?
RQ3의미적 및 공간 표현이 술어 예측 정확도에 미치는 영향은 무엇인가?
RQ4외부 지식 소스(예: Wikipedia)가 내부 학습 데이터와 통합될 때 도움이 되는가, 해를 끼치는가?

주요 결과

언어 지식 증류는 순수 데이터 기반 기준선에 비해 술어 예측을 크게 향상시킨다.
LK 증류로 VRD의 제로샷 리콜이 8.45%에서 19.17%로 향상된다.
교사 및 학생 예측(T+S)의 결합이 최상의 성능을 보이며, 관찰 데이터 및 제로샷 설정에서 기준선을 능가한다.
주어/객체의 의미 표현 및 공간 특징을 사용하면 예측 능력과 일반화가 향상된다.
외부 지식만으로는 노이즈가 있을 수 있지만 내부 지식 및 시각 데이터와 결합하면 LK 증류가 여전히 도움이 된다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.