QUICK REVIEW

[논문 리뷰] Adaptively Aligned Image Captioning via Adaptive Attention Time

Lun Huang, Wenmin Wang|arXiv (Cornell University)|2019. 09. 19.

Multimodal Machine Learning Applications인용 수 39

한 줄 요약

이 논문은 Adaptive Attention Time (AAT)를 도입한다. 이는 이미지 캡션 생성에서 매 디코딩 스텝당 얼마나 많은 어텐션 스텝을 수행할지 결정하는 차별화 가능한 메커니즘으로, 고정된 한 스텝 어텐션 및 순환 어텐션 모델보다 개선된다.

ABSTRACT

Recent neural models for image captioning usually employ an encoder-decoder framework with an attention mechanism. However, the attention mechanism in such a framework aligns one single (attended) image feature vector to one caption word, assuming one-to-one mapping from source image regions and target caption words, which is never possible. In this paper, we propose a novel attention model, namely Adaptive Attention Time (AAT), to align the source and the target adaptively for image captioning. AAT allows the framework to learn how many attention steps to take to output a caption word at each decoding step. With AAT, an image region can be mapped to an arbitrary number of caption words while a caption word can also attend to an arbitrary number of image regions. AAT is deterministic and differentiable, and doesn't introduce any noise to the parameter gradients. In this paper, we empirically show that AAT improves over state-of-the-art methods on the task of image captioning. Code is available at https://github.com/husthuaan/AAT.

연구 동기 및 목표

표준 어텐션 모델의 이미지 영역-단어 1대1 가정 해결을 통해 이미지 캡션 생성을 개선한다.
적응적, 단어별 어텐션 스텝을 가능하게 하여 이미지 영역과 캡션 간의 유연한 정렬을 수행한다.
해석 가능성과 안정성을 유지하면서 디코딩 중 적응 계산을 허용한다.

제안 방법

각 디코딩 스텝에서 얼마나 많은 어텐션 스텝을 수행할지 학습하는 Adaptive Attention Time (AAT)을 제안한다.
AAT를 두 계층 LSTM 인코더-디코더에 내장하고, 단어당 여러 개의 어텐트된 스텝을 수행할 수 있는 어텐션 모듈을 포함시킨다.
Adaptive Computation Time (ACT)에서 영감을 받아 어텐션을 중단하고 단어를 출력할 때를 결정하는 신경망(Confidence network)을 사용한다.
다중 헤드 어텐션을 도입하여 이미지 영역 간의 상호 작용을 더 잘 포착한다.
훈련 시 시간 비용 패널티를 추가하여 정확성과 계산 사이의 균형을 맞춘다.
기본, 재귀적, 적응형 어텐션 모델을 AAT의 특수 케이스로 보여 연결해준다.

실험 결과

연구 질문

RQ1각 디코딩 스텝마다 적응적 어텐션 스텝이 한 스텝 또는 고정 스텝 어텐션 모델보다 캡션 품질을 향상시킬 수 있는가?
RQ2AAT가 어텐션 시간 측면에서 캡션 품질과 계산 비용 사이를 어떻게 균형 잡는가?
RQ3이 프레임워크에서 어텐션 헤드 수와 Additive vs dot-product 어텐션의 영향은 무엇인가?
RQ4적응적 어텐션 메커니즘이 이미지 캡션 생성을 넘어서 다른 인코더-디코더 작업에도 일반화되는가?

주요 결과

모델	교차 엔트로피 BLEU-4	교차 엔트로피 METEOR	교차 엔트로피 ROUGE	교차 엔트로피 CIDEr-D	교차 엔트로피 SPICE	셀프-크리티컬 BLEU-4	셀프-크리티컬 METEOR	셀프-크리티컬 ROUGE	셀프-크리티컬 CIDEr-D	셀프-크리티컬 SPICE
LSTM	29.6	25.2	52.6	94.0	-	31.9	25.5	54.3	106.3	-
ADP-ATT	33.2	26.6	-	108.5	-	-	-	-	-	-
SCST	30.0	25.9	53.4	99.4	-	34.2	26.7	55.7	114.0	-
Up-Down	36.2	27.0	56.4	113.5	20.3	36.3	27.7	56.9	120.1	21.4
RFNet	35.8	27.4	56.8	112.5	20.5	36.5	27.7	57.3	121.9	21.2
GCN-LSTM	36.8	27.9	57.0	116.3	20.9	38.2	28.5	58.3	127.6	22.0
SGAE	-	-	-	-	-	38.4	28.4	58.6	127.8	22.1
AAT (Ours)	37.0	28.1	57.3	117.2	21.2	38.7	28.6	58.5	128.6	22.2

AAT는 METEOR, CIDEr-D, SPICE에서 MS COCO (Karpathy split) 기본 및 순환 어텐션 모델을 능가하며 평균 디코딩 스텝당 2.55개의 어텐션 스텝을 사용한다.
lambda = 1e-4일 때 AAT는 강력한 성능을 달성하면서 평균 어텐션 스텝이 비교적 낮은 상태를 유지한다(2.54–2.84 in ablations).
다중 헤드 ADDITIVE 어텐션(8 헤드)이 최적의 균형을 제공하며 self-critical 학습에서 CIDEr-D 128.6 및 SPICE 22.2를 달성한다.
이전 SOTA인 Up-Down과 비교하여 두 학습 단계 모두에서 BLEU-4, METEOR, ROUGE-L, CIDEr-D, SPICE를 현저한 차이로 개선한다.
단일 AAT 모델이 MS COCO 테스트 세트에서 128.6 CIDEr-D를 달성하여 당시 최첨단 성능을 나타낸다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.