QUICK REVIEW

[논문 리뷰] Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning

Abhishek Das, Satwik Kottur|arXiv (Cornell University)|2017. 03. 20.

Multimodal Machine Learning Applications참고 문헌 31인용 수 91

한 줄 요약

이 논문은 Q-bot과 A-bot 간의 협력적인 이미지 추측 게임을 통해 시각 질문 응답 및 대화를 목표 지향적으로 학습시키는 강화 학습(end-to-end deep reinforcement learning)으로 학습된다.

ABSTRACT

We introduce the first goal-driven training for visual question answering and dialog agents. Specifically, we pose a cooperative 'image guessing' game between two agents -- Qbot and Abot -- who communicate in natural language dialog so that Qbot can select an unseen image from a lineup of images. We use deep reinforcement learning (RL) to learn the policies of these agents end-to-end -- from pixels to multi-agent multi-round dialog to game reward. We demonstrate two experimental results. First, as a 'sanity check' demonstration of pure RL (from scratch), we show results on a synthetic world, where the agents communicate in ungrounded vocabulary, i.e., symbols with no pre-specified meanings (X, Y, Z). We find that two bots invent their own communication protocol and start using certain symbols to ask/answer about certain visual attributes (shape/color/style). Thus, we demonstrate the emergence of grounded language and communication among 'visual' dialog agents with no human supervision. Second, we conduct large-scale real-image experiments on the VisDial dataset, where we pretrain with supervised dialog data and show that the RL 'fine-tuned' agents significantly outperform SL agents. Interestingly, the RL Qbot learns to ask questions that Abot is good at, ultimately resulting in more informative dialog and a better team.

연구 동기 및 목표

이미지를 이해하고 토론할 수 있는 시각적으로 근거가 있는 대화형 AI 개발을 촉진한다.
한 에이전트가 질문을 제기하고 다른 에이전트가 답하는 협력적 두 에이전트 설정을 제안하여 보지 못한 이미지를 식별한다.
엔드 투 엔드 딥 RL이 언어를 근거화하고 감독 학습 기준선을 넘어 대화 품질을 향상시킬 수 있음을 보인다.

제안 방법

Q-bot(질문자)과 A-bot(답변자)로 구성된 협력적 이미지 추측 게임을 정식화한다.
대화를 이산 자연어 토큰으로 표현하고 예측을 특징 회귀 네트워크를 통해 이미지 임베딩에 근거화한다.
픽셀에서 다중 라운드 대화로의 보상까지의 기능을 갖춘 엔드투엔드 딥 RL(REINFORCE)을 사용하여 두 에이전트와 그 근거 predictor를 학습한다.
공유된 토큰 어휘를 갖춘 Q-bot와 A-bot용 이중 수준의 계층적 인코더–디코더 정책을 제공한다.
이미지 표현 예측 개선을 최대화하여 순수 지도 학습에서 목표 지향 최적화로 전환한다.
감독된 VisDial 데이터로 먼저 사전 학습하고 그다음 RL로 미세 조정하여 성능을 향상시킨다.

실험 결과

연구 질문

RQ1두 협력 대화 에이전트가 인간의 감독 없이 시각적 근거화를 위한 근거 있는 의사소통을 학습할 수 있는가?
RQ2감독 학습 사전 학습 후 강화 학습이 순수 감독 대화보다 더 나은 이미지 추측 성능을 낼 수 있는가?
RQ3에이전트가 보지 못한 이미지에 대한 정보 이득을 극대화하기 위해 질문과 답변을 어떻게 구성해야 하는가?

주요 결과

비준거 기호가 있는 합성적이고 근거된 환경에서 에이전트들은 기호를 속성에 연결하는 자체 언어 매핑을 발명한다.
실제 이미지(VisDial)에서 RL-미세조정 에이전트는 이미지 근거화 과제에서 감독 기반의 기준선을 능가한다.
RL로 학습된 Q-bot은 A-bot의 강점과 일치하는 질문 전략을 학습하여 더 정보가 풍부한 대화를 이끌고 팀 성과를 향상시킨다.
지각이 불완전해도 상호 작용을 통해 근거 있는 언어가 엔드 투 엔드로 등장한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.