QUICK REVIEW

[논문 리뷰] CM-GANs: Cross-modal Generative Adversarial Networks for Common Representation Learning

Yuxin Peng, Jinwei Qi|arXiv (Cornell University)|2017. 10. 14.

Advanced Image and Video Retrieval Techniques참고 문헌 41인용 수 65

한 줄 요약

CM-GANs는 가중치 공유 오토인코더와 이중 판별기를 갖춘 교차모달 GAN을 사용하여 구별력 있는 교차모달 공통 표현을 학습하고, 여러 데이터셋에서 최첨단 교차모달 검색 성능을 달성합니다.

ABSTRACT

It is known that the inconsistent distribution and representation of different modalities, such as image and text, cause the heterogeneity gap that makes it challenging to correlate such heterogeneous data. Generative adversarial networks (GANs) have shown its strong ability of modeling data distribution and learning discriminative representation, existing GANs-based works mainly focus on generative problem to generate new data. We have different goal, aim to correlate heterogeneous data, by utilizing the power of GANs to model cross-modal joint distribution. Thus, we propose Cross-modal GANs to learn discriminative common representation for bridging heterogeneity gap. The main contributions are: (1) Cross-modal GANs architecture is proposed to model joint distribution over data of different modalities. The inter-modality and intra-modality correlation can be explored simultaneously in generative and discriminative models. Both of them beat each other to promote cross-modal correlation learning. (2) Cross-modal convolutional autoencoders with weight-sharing constraint are proposed to form generative model. They can not only exploit cross-modal correlation for learning common representation, but also preserve reconstruction information for capturing semantic consistency within each modality. (3) Cross-modal adversarial mechanism is proposed, which utilizes two kinds of discriminative models to simultaneously conduct intra-modality and inter-modality discrimination. They can mutually boost to make common representation more discriminative by adversarial training process. To the best of our knowledge, our proposed CM-GANs approach is the first to utilize GANs to perform cross-modal common representation learning. Experiments are conducted to verify the performance of our proposed approach on cross-modal retrieval paradigm, compared with 10 methods on 3 cross-modal datasets.

연구 동기 및 목표

이미지와 텍스트 모달리티 간의 이질성 격차를 다리처럼 연결하여 교차모달 검색을 촉진하려는 동기 부여.
공동 분포를 모델링하여 구별력 있는 공통 표현을 학습하기 위한 교차모달 GAN 프레임워크를 제안한다.
대립적 학습을 통해 모달 간 상관성을 강제하는 동안 모달 내 의미 재구성을 보존한다.
공유 표현을 학습하고 모달별 정보를 유지하기 위해 가중치 공유 교차모달 autoencoder를 도입한다.

제안 방법

공유된 최종 계층 가중치를 갖는 교차모달 컨볼루션 오토인코더(G_I 및 G_T)를 도입하여 공통 표현(s_p^i, s_p^t)을 학습하고 표현(r_p^i, r_p^t)을 재구성한다.
원본과 재구성을 구분하기 위한 모달 내부 판별기(D_I, D_T)와 교차모달 공통 표현을 위한 모달 간 판별기(D_Ci, D_Ct)라는 두 개의 평행 GAN을 사용한다.
두 개의 대립 손실(L_GAN1: 모달 내 재구성, L_GAN2: 모달 간 상관관계)을 공식화하고 이를 미니맥스 목적함수로 결합한다.
모듈 간 대립 절차를 통해 판별 모델과 생성 모델의 업데이트를 교대시키며 구별력 있는 공통 표현 학습에서 상호 촉진을 최대화하도록 학습한다.
인코더의 최종 계층에서 가중치 공유를 활용하고 softmax 제약을 통해 모달 간 의미 정렬을 강제한다.

실험 결과

연구 질문

RQ1GAN 기반 아키텍처가 서로 다른 모달리티(이미지와 텍스트)의 이질적 데이터를 상관시키는 구별력 있는 공통 표현을 학습할 수 있는가?
RQ2모달 내부 및 모달 간 판별기가 포함된 교차모달 대립 학습이 교차모달 검색 성능을 향상시킬까?
RQ3가중치 공유 교차모달 autoencoder가 모달 내 의미를 효과적으로 보존하면서 모달 간 상관관계를 가능하게 하는가?

주요 결과

CM-GANs는 세 데이터셋에서 10개의 최첨단 교차모달 검색 방법과 비교하여 최상의 검색 정확도를 달성한다.
저자들의 XMediaNet 데이터셋에서도 효과를 보여준다.
가중치 공유가 있는 교차모달 컨볼루션 오토인코더가 모달 간 상관관계를 포착하면서 각 모달 내 의미 일관성을 보존한다는 것을 보여준다.
제안된 교차모달 대립 메커니즘이 구별력 있는 공통 표현 학습을 향상시키는 수단임을 입증한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.