QUICK REVIEW

[논문 리뷰] Actor-Critic Sequence Training for Image Captioning

Li Zhang, Flood Sung|arXiv (Cornell University)|2017. 06. 29.

Multimodal Machine Learning Applications참고 문헌 13인용 수 99

한 줄 요약

본 논문은 이미지 캡션 생성을 강화 학습(액터-크리틱)을 이용해 비미분 가능 언어 지표(CIDEr 등)를 직접 최적화하고, 모델 앙상블 없이도 최첨단 성능을 달성한다.

ABSTRACT

Generating natural language descriptions of images is an important capability for a robot or other visual-intelligence driven AI agent that may need to communicate with human users about what it is seeing. Such image captioning methods are typically trained by maximising the likelihood of ground-truth annotated caption given the image. While simple and easy to implement, this approach does not directly maximise the language quality metrics we care about such as CIDEr. In this paper we investigate training image captioning methods based on actor-critic reinforcement learning in order to directly optimise non-differentiable quality metrics of interest. By formulating a per-token advantage and value computation strategy in this novel reinforcement learning based captioning model, we show that it is possible to achieve the state of the art performance on the widely used MSCOCO benchmark.

연구 동기 및 목표

Motivate improving image captioning by directly optimizing language quality metrics rather than likelihood-based training.
Address exposure bias in teacher-forcing by treating caption generation as an RL problem.
Develop an actor-critic framework with per-token advantages to guide caption generation.
Demonstrate state-of-the-art performance on MSCOCO using a single model.
Evaluate and compare against strong supervised and RL-based baselines.

제안 방법

Model image captioning as an encoder–decoder with CNN image features and an LSTM decoder.
Formulate caption generation as a Markov decision process where actions are word tokens.
Use an actor network to produce token distributions and a separate critic network to estimate state values.
Compute per-token advantages using a forward-view TD(1) formulation (lambda=1) to guide policy gradients.
Define reward as the final caption quality score (e.g., CIDEr) and backpropagate via policy gradients using the TD target.
Pre-train the actor with cross-entropy loss and the critic with fixed-actor samples before joint training.

실험 결과

연구 질문

RQ1Can actor-critic reinforcement learning directly optimize non-differentiable language metrics in image captioning?
RQ2Does per-token advantage and a separate value network improve training stability and performance over previous RL approaches?
RQ3What is the impact of RL-based training on MSCOCO captioning performance compared to supervised and other RL methods?

주요 결과

The proposed actor-critic model achieves state-of-the-art performance on MSCOCO without model ensembles (ranked third on the official test server).
On the development set, the method improves CIDEr-D from 1.007 (supervised baseline) to 1.162 with single-model greedy decoding.
The approach outperforms attention-based and memory-enhanced baselines and several RL-based methods in CIDEr-D and other metrics.
Training efficiency is higher for the proposed method compared to some RL baselines, due in part to not requiring attention cells and fewer Monte Carlo samples.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.