QUICK REVIEW

[논문 리뷰] Video Description: A Survey of Methods, Datasets and Evaluation Metrics

Nayyer Aafaq, Ajmal Mian|UWA Profiles and Research Repository (UWA)|2018. 06. 01.

Multimodal Machine Learning Applications참고 문헌 38인용 수 95

한 줄 요약

비디오 설명 연구에 대한 포괄적 고찰로, 고전적, 통계적, 심층 학습 방법을 추적하고; 데이터 세트와 평가 지표를 비교하며; 도전 과제와 향후 방향에 대해 논의한다.

ABSTRACT

Video description is the automatic generation of natural language sentences that describe the contents of a given video. It has applications in human-robot interaction, helping the visually impaired and video subtitling. The past few years have seen a surge of research in this area due to the unprecedented success of deep learning in computer vision and natural language processing. Numerous methods, datasets and evaluation metrics have been proposed in the literature, calling the need for a comprehensive survey to focus research efforts in this flourishing new direction. This paper fills the gap by surveying the state of the art approaches with a focus on deep learning models; comparing benchmark datasets in terms of their domains, number of classes, and repository size; and identifying the pros and cons of various evaluation metrics like SPICE, CIDEr, ROUGE, BLEU, METEOR, and WMD. Classical video description approaches combined subject, object and verb detection with template based language models to generate sentences. However, the release of large datasets revealed that these methods can not cope with the diversity in unconstrained open domain videos. Classical approaches were followed by a very short era of statistical methods which were soon replaced with deep learning, the current state of the art in video description. Our survey shows that despite the fast-paced developments, video description research is still in its infancy due to the following reasons. Analysis of video description models is challenging because it is difficult to ascertain the contributions, towards accuracy or errors, of the visual features and the adopted language model in the final description. Existing datasets neither contain adequate visual diversity nor complexity of linguistic structures. Finally, current evaluation metrics ...

연구 동기 및 목표

고전적 방법에서 딥 러닝에 이르는 비디오 설명 방법의 진화를 조사한다.
도메인, 규모, 다양성 측면에서 벤치마크 데이터 세트를 비교한다.
평가 지표와 인간 판단과의 상관관계를 분석한다.
데이터 세트와 지표의 현재 한계를 식별하고 향후 연구 방향을 제안한다.

제안 방법

비디오 설명 방법을 고전적 SVO/템플릿 기반, 통계적, 그리고 딥 러닝 접근 방식으로 분류한다.
CNN-LSTM/GRU 인코더, 어텐션, 의미 속성과 같은 아키텍처 트렌드를 설명한다.
데이터 세트 특성과 대규모 오픈 도메인 데이터 세트가 방법 개발을 어떻게 주도하는지 논의한다.
평가 지표(BLEU, ROUGE, METEOR, CIDEr, SPICE, WMD)와 인간 판단과의 일치도를 검토한다.

실험 결과

연구 질문

RQ1비디오 설명 진화의 주요 방법론적 단계는 무엇이며 그 한계는 무엇인가?
RQ2벤치마크 데이터 세트는 콘텐츠, 복잡성, 규모 측면에서 어떻게 다릅니까?
RQ3현재 비디오 설명 평가 지표의 강점과 약점은 무엇입니까?
RQ4데이터 세트 다양성과 인간 판단과의 지표 일치를 다룰 향후 방향은 무엇입니까?

주요 결과

비디오 설명은 대형 다중 모달 데이터 세트에 의해 촉진된 템플릿 기반에서 딥 러닝 방법으로 진화했다.
오픈 도메인 및 더 긴 비디오가 어휘와 언어적 복잡성을 드러내어 초기 방법으로 처리할 수 없었다.
평가 지표는 측정하는 바가 다르고 종종 인간 판단과 완벽하게 일치하지 않는다.
현재 지표인 BLEU, METEOR, ROUGE, CIDEr, SPICE, 그리고 WMD는 설명 품질의 서로 다른 측면을 다루며 불안정성 문제가 있다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.