QUICK REVIEW

[논문 리뷰] Image Captioning at Will: A Versatile Scheme for Effectively Injecting Sentiments into Image Descriptions

Quanzeng You, Hailin Jin|arXiv (Cornell University)|2018. 01. 30.

Multimodal Machine Learning Applications참고 문헌 31인용 수 47

한 줄 요약

이 논문은 이미지 자막에 감정을 주입하는 두 가지 엔드-투-엔드 모델을 제시하여 시각-의미 정합을 손상시키지 않으면서 긍정/부정 자막을 제어 가능하게 하고, 이전 감정-자막 방법들보다 우수한 성능을 보임.

ABSTRACT

Automatic image captioning has recently approached human-level performance due to the latest advances in computer vision and natural language understanding. However, most of the current models can only generate plain factual descriptions about the content of a given image. However, for human beings, image caption writing is quite flexible and diverse, where additional language dimensions, such as emotion, humor and language styles, are often incorporated to produce diverse, emotional, or appealing captions. In particular, we are interested in generating sentiment-conveying image descriptions, which has received little attention. The main challenge is how to effectively inject sentiments into the generated captions without altering the semantic matching between the visual content and the generated descriptions. In this work, we propose two different models, which employ different schemes for injecting sentiments into image captions. Compared with the few existing approaches, the proposed models are much simpler and yet more effective. The experimental results show that our model outperform the state-of-the-art models in generating sentimental (i.e., sentiment-bearing) image captions. In addition, we can also easily manipulate the model by assigning different sentiments to the testing image to generate captions with the corresponding sentiments.

연구 동기 및 목표

사실적 설명을 넘어서 감정 인식이 반영된 이미지 자막의 필요성을 제시한다.
이미지-텍스트 정합을 저하시킴 없이 자막 생성에 감정을 주입하는 엔드-투-엔드 모델을 제안한다.
명시적 감정 라벨에 조건화하여 제어 가능한 감정 생성을 가능하게 한다.
감정 인식 모델이 감정 자막 생성 작업에서 최첨단 baselines보다 우수하다는 것을 보여준다.

제안 방법

Direct Injection: 매 생성 단계마다 RNN 입력에 감정 단위(-1,0,1)를 연결하여 단어 선택에 편향을 준다.
Sentiment Flow: 초기 감정 신호를 LSTM을 통해 전파하는 감정 셀을 도입하고, 감정 손실이 최종 감정 상태를 이끈다.
MS-COCO plus SentiCap 데이터로 엔드-투-엔드 학습을 수행하며, 감정 라벨 및 선택적 감정 손실을 사용한다.
CNN 인코더로 ResNet-152를 사용하고 256-d 임베딩과 512-d RNN을 사용하며 Adam 옵티마이저로 학습한다.

실험 결과

연구 질문

RQ1감정이 이미지와 의미적으로 일치하는 상태를 유지하면서 이미지 자막에 주입될 수 있는가?
RQ2직접 주입과 감정 흐름 중 어떤 아키텍처 스킴이 제어 가능한 감정 자막 생성을 더 잘 지원하는가?
RQ3감정 손실을 도입하면 모델이 자막 시퀀스 전반에 걸쳐 감정을 구별하고 전파하는 능력이 개선되는가?
RQ4주어진 양의/음의 예에 대해 모델이 감정 라벨과 일치하는 자막을 얼마나 잘 생성하는가?

주요 결과

제안된 두 모델 모두 기준선 대비 감정 자막 생성 벤치마크에서 표준 지표를 통해 우수한 성능을 보인다.
Direct Injection은 매 단계당 더 강한 감정 신호를 제공하고 부정 자막에서 특히 더 높은 감정 자막 비율을 달성한다.
Sentiment Flow는 POS 및 NEG 세트에서 균형 잡힌 성능을 제공하며 여러 구성에서 감정 손실의 이점을 갖는다.
테스트 시 감정 라벨을 바꾸어 컨트롤 가능한 생성을 지원하며, 자막이 이미지 내용 전반에 걸쳐 일치하는 감정 단어를 분포시키는 결과를 낳는다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.