QUICK REVIEW

[논문 리뷰] Image Captioning: Transforming Objects into Words

Simao Herdade, Armin Kappeler|arXiv (Cornell University)|2019. 06. 14.

Multimodal Machine Learning Applications참고 문헌 26인용 수 93

한 줄 요약

Object Relation Transformer를 도입하여 Object 간의 공간 관계에 대한 기하학적 어텐션을 통합하고 MS-COCO의 이미지 캡션 작성 성능을 향상시켜 단일 모델 접근법 중 최첨단 결과를 달성한다.

ABSTRACT

Image captioning models typically follow an encoder-decoder architecture which uses abstract image feature vectors as input to the encoder. One of the most successful algorithms uses feature vectors extracted from the region proposals obtained from an object detector. In this work we introduce the Object Relation Transformer, that builds upon this approach by explicitly incorporating information about the spatial relationship between input detected objects through geometric attention. Quantitative and qualitative results demonstrate the importance of such geometric attention for image captioning, leading to improvements on all common captioning metrics on the MS-COCO dataset.

연구 동기 및 목표

감지된 객체들 간의 공간 관계를 명시적으로 모델링함으로써 이미지 캡션 달성의 동기를 부여하고 향상시킨다.
캡션 생성을 위한 Transformer 기반 인코더에 지오메트릭 어텐션을 통합한다.
기준 방법 및 이전 방법 대비 MS-COCO에서 정량적·정성적 향상을 입증한다.

제안 방법

Faster R-CNN (ResNet-101)을 사용하여 객체를 탐지하고 박스당 2048 차원 특징을 추출한다.
표준 Transformer 인코더 어텐션을Appearance+Geometric 어텐션의 결합으로 교체하며, 기하학적 가중치는 상대 박스 위치와 크기로부터 도출된다.
Compute relative geometry lambda(m,n) and embed it to produce omega_G, then form combined attention omega^{mn} = (omega_G^{mn} exp(omega_A^{mn})) / sum_l omega_G^{ml} exp(omega_A^{ml}).
교차 엔트로피로 학습한 후 Self-Critical Sequence Training(CIDEr-D 최적화) 및 빔 탐색으로 미세 조정한다.
CIDEr-D, SPICE, BLEU, METEOR, ROUGE-L 지표를 사용하여 MS-COCO 2014 Captions를 평가한다.

실험 결과

연구 질문

RQ1감지된 객체들 간의 공간 관계를 지오메트릭 어텐션으로 통합하는 것이 이미지 캡션 성능을 향상시키는가?
RQ2Object Relation Transformer는 표준 Transformer 및 강력한 베이스라인과 MS-COCO에서 어떻게 비교되는가?
RQ3관계 및 개수와 관련된 SPICE 하위 범주에서 기하학적 어텐션의 영향은 무엇인가?

주요 결과

Algorithm	CIDEr-D	SPICE	BLEU-1	BLEU-4	METEOR	ROUGE-L
Att2all	114	-	-	34.2	26.7	55.7
Up-Down	120.1	21.4	79.8	36.3	27.7	56.9
Visual-policy	126.3	21.6	-	38.6	28.3	58.5
GCN-LSTM	127.6	22.0	80.5	38.2	28.5	58.3
SGAE	127.8	22.1	80.8	38.4	28.4	58.6
Ours	128.3	22.6	80.5	38.6	28.7	58.4

Object Relation Transformer가 Standard Transformer에 비해 CIDEr-D, SPICE, BLEU-1, BLEU-4, METEOR, ROUGE-L을 향상시키며 여러 지표에서 통계적으로 유의미한 이득을 보인다.
Geometric attention은 SPICE Relation 및 Count 점수를 높여 캡션의 관계 추론 및 대상 카운팅 능력이 향상되었음을 시사한다.
특성 제거 실험에서 Transformer에 객체 관계를 추가하면 빔 검색과 함께 CIDEr-D 및 BLEU 지표에서 더 큰 향상을 보인다.
크기 기반 또는 좌우/상하 순서와 비교하여, 기하학적 어텐션은 CIDEr-D를 향상시키고 간단한 위치 인코딩보다 효과적임을 보여준다.
정성적 예시에서 공간 인식 능력 및 관계 표현이 향상되어 더 정확한 관계를 제시한다(예: “two chairs under an umbrella”).
기하학적 어텐션 사용 시 SPICE Count 하위 범주가 11.30에서 17.51로 크게 개선된다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.