QUICK REVIEW

[论文解读] Video Description: A Survey of Methods, Datasets and Evaluation Metrics

Nayyer Aafaq, Ajmal Mian|UWA Profiles and Research Repository (UWA)|Jun 1, 2018

Multimodal Machine Learning Applications参考文献 38被引用 95

一句话总结

对视频描述研究的综合综述，追踪经典、统计和深度学习方法；比较数据集和评估指标；讨论挑战与未来方向。

ABSTRACT

Video description is the automatic generation of natural language sentences that describe the contents of a given video. It has applications in human-robot interaction, helping the visually impaired and video subtitling. The past few years have seen a surge of research in this area due to the unprecedented success of deep learning in computer vision and natural language processing. Numerous methods, datasets and evaluation metrics have been proposed in the literature, calling the need for a comprehensive survey to focus research efforts in this flourishing new direction. This paper fills the gap by surveying the state of the art approaches with a focus on deep learning models; comparing benchmark datasets in terms of their domains, number of classes, and repository size; and identifying the pros and cons of various evaluation metrics like SPICE, CIDEr, ROUGE, BLEU, METEOR, and WMD. Classical video description approaches combined subject, object and verb detection with template based language models to generate sentences. However, the release of large datasets revealed that these methods can not cope with the diversity in unconstrained open domain videos. Classical approaches were followed by a very short era of statistical methods which were soon replaced with deep learning, the current state of the art in video description. Our survey shows that despite the fast-paced developments, video description research is still in its infancy due to the following reasons. Analysis of video description models is challenging because it is difficult to ascertain the contributions, towards accuracy or errors, of the visual features and the adopted language model in the final description. Existing datasets neither contain adequate visual diversity nor complexity of linguistic structures. Finally, current evaluation metrics ...

研究动机与目标

对视频描述方法从经典到深度学习的演变进行综述。
从领域、规模与多样性角度比较基准数据集。
分析评估指标及其与人类评估的一致性/相关性。
识别数据集和指标的当前局限性并提出未来研究方向。

提出的方法

将视频描述方法分为经典的SVO/模板驱动、统计方法和深度学习方法。
描述诸如CNN-LSTM/GRU编码器、注意力机制和语义属性等架构趋势。
讨论数据集特征以及大规模开放领域数据集如何推动方法发展。
回顾评估指标(BLEU、ROUGE、METEOR、CIDEr、SPICE、WMD)及其与人类判断的一致性。

实验结果

研究问题

RQ1视频描述演变的主要方法阶段及其局限性是什么？
RQ2基准数据集在内容、复杂性和规模方面在视频描述任务上有何差异？
RQ3当前视频描述评估指标的优点和不足是什么？
RQ4哪些未来方向可以解决数据集多样性和评估指标与人类判断的一致性问题？

主要发现

视频描述已从基于模板的方法发展到由大型多模态数据集支持的深度学习方法。
开放域和更长的视频揭示了早期方法难以处理的词汇和语言复杂性。
评测指标在测量的内容上存在差异，且往往与人类判断并不完全一致。
当前指标如BLEU、METEOR、ROUGE、CIDEr、SPICE和WMD覆盖描述质量的不同方面，并且存在不稳定性问题。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。