QUICK REVIEW

[논문 리뷰] Real-Time Referring Expression Comprehension by Single-Stage Grounding Network

Xinpeng Chen, Lin Ma|arXiv (Cornell University)|2018. 12. 08.

Multimodal Machine Learning Applications인용 수 63

한 줄 요약

SSG는 영역 제안 없이 이미지에서 참조 표현을 로컬라이즈하는 엔드-투-엔드 단일 단계 모델을 제시하며, ReferItGame에서 최첨단을 포함한 경쟁력 있는 정확도와 실시간 속도를 달성하고, GPU 환경에서 RefCOCO에서 초당 40개의 참조를 처리한다.

ABSTRACT

In this paper, we propose a novel end-to-end model, namely Single-Stage Grounding network (SSG), to localize the referent given a referring expression within an image. Different from previous multi-stage models which rely on object proposals or detected regions, our proposed model aims to comprehend a referring expression through one single stage without resorting to region proposals as well as the subsequent region-wise feature extraction. Specifically, a multimodal interactor is proposed to summarize the local region features regarding the referring expression attentively. Subsequently, a grounder is proposed to localize the referring expression within the given image directly. For further improving the localization accuracy, a guided attention mechanism is proposed to enforce the grounder to focus on the central region of the referent. Moreover, by exploiting and predicting visual attribute information, the grounder can further distinguish the referent objects within an image and thereby improve the model performance. Experiments on RefCOCO, RefCOCO+, and RefCOCOg datasets demonstrate that our proposed SSG without relying on any region proposals can achieve comparable performance with other advanced models. Furthermore, our SSG outperforms the previous models and achieves the state-of-art performance on the ReferItGame dataset. More importantly, our SSG is time efficient and can ground a referring expression in a 416*416 image from the RefCOCO dataset in 25ms (40 referents per second) on average with a Nvidia Tesla P40, accomplishing more than 9* speedups over the existing multi-stage models.

연구 동기 및 목표

영역 제안 없이 실시간 참조 표현 그로딩의 필요성과 동기를 제시한다.
다중모달 인코딩, 인터랙터, 그라운더를 갖춘 엔드투엔드 단일 스테이지 그라운딩 네트워크(SSG)를 제안한다.
가이드된 어텐션과 속성 예측을 도입하여 로컬라이제이션 정확도를 높인다.
표준 데이터셋(RefCOCO, RefCOCO+, RefCOCOg, ReferItGame)에서 효율성과 경쟁력 있는 정확도를 시연한다.

제안 방법

로컬 영역 특징을 얻기 위해 YOLO-v3 기반 백본으로 이미지를 인코딩한다.
EMLo 임베딩을 사용한 두 층 Bi-LSTM으로 참조 표현을 인코딩한다.
주의를 갖춘 다중모달 인터랙터를 사용하여 이미지-텍스트 결합 표현을 생성한다.
결합 표현으로부터 바운딩 박스와 신뢰도 점수를 직접 예측하여 표현을 그라운드한다.
보조 손실로 로컬라이제이션(MSE), 신뢰도(이진 교차 엔트로피), 가이드된 어텐션(중심 편향), 속성 예측(멀티레이블)을 적용한다.
손실의 가중합으로 학습하고 추론은 로컬라이제이션 모듈만 활성화하여 수행한다.

실험 결과

연구 질문

RQ1영역 제안 없이 엔드투엔드 단일 스테이지 모델이 경쟁력 있는 로컬라이제이션 정확도를 달성할 수 있는가?
RQ2참조 대상 중심에 대한 가이드된 어텐션이 로컬라이제이션을 향상시키는가?
RQ3보조 속성 예측이 참조 대상의 구분을 더욱 명확히 하고 정확도를 높일 수 있는가?
RQ4단일 스테이지 방식이 표준 데이터셋에서 실시간 로컬라이제이션에 충분히 계산 효율적인가?

주요 결과

SSG는 영역 제안 없이 RefCOCO, RefCOCO+, RefCOCOg에서 경쟁력 있는 성능을 달성한다.
SSG는 ReferItGame 데이터셋에서 최첨단 성능을 달성한다.
GPU 가속을 활용하면 RefCOCO(416×416 입력)에서 초당 약 40개의 참조를 처리한다.
제거 연구에서 신뢰도, 가이드된 어텐션 및 속성 예측 손실을 추가하면 개선이 나타난다.
추론 시간은 다단계 방법에 비해 현저히 빠르며, 실시간 로컬라이제이션 능력을 입증한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.