[Paper Review] A Dataset for Movie Description
This paper introduces a large-scale, temporally aligned dataset of 54,000+ sentence-video pairs from 72 full HD movies, combining Descriptive Video Service (DVS) transcripts and movie scripts. It demonstrates that DVS provides more accurate, visually grounded descriptions than scripts, enabling improved video description models through semantic parsing and visual feature fusion, with SMT-based approaches outperforming nearest neighbor baselines and achieving strong performance on open-domain video description tasks.
Descriptive video service (DVS) provides linguistic descriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an interesting data source for computer vision and computational linguistics. In this work we propose a novel dataset which contains transcribed DVS, which is temporally aligned to full length HD movies. In addition we also collected the aligned movie scripts which have been used in prior work and compare the two different sources of descriptions. In total the Movie Description dataset contains a parallel corpus of over 54,000 sentences and video snippets from 72 HD movies. We characterize the dataset by benchmarking different approaches for generating video descriptions. Comparing DVS to scripts, we find that DVS is far more visual and describes precisely what is shown rather than what should happen according to the scripts created prior to movie production.
Motivation & Objective
- To create a large-scale, temporally aligned dataset of video descriptions from Descriptive Video Service (DVS) for visually impaired access.
- To compare DVS transcripts with movie scripts as sources for video description, assessing their visual accuracy and relevance.
- To evaluate state-of-the-art video description models on this new dataset using semantic parsing and visual features.
- To demonstrate that DVS provides more precise, visually grounded descriptions than pre-production scripts.
- To enable research on long-term semantic dependencies and plot understanding in open-domain video description.
Proposed method
- Transcribed DVS audio from Blu-ray discs using crowd-sourced transcription, aligning descriptions to full HD movie segments.
- Collected and aligned movie scripts from prior work to create a parallel corpus with DVS.
- Applied a semantic parser to extract subject-verb-object-location triplets from DVS and scripts, filtering by minimum frequency (30 or 100 occurrences).
- Used a statistical machine translation (SMT) framework to generate descriptions from visual features and parsed labels.
- Combined visual features (DT, LSDA, PLACES, HYBRID) with CRF-based sequence modeling to predict sentence outputs.
- Evaluated models via human annotation on 250 test snippets, ranking outputs by correctness, grammar, and relevance.
Experimental results
Research questions
- RQ1How does the visual grounding of DVS compare to that of movie scripts in terms of accuracy and relevance to the video content?
- RQ2Can semantic parsing of DVS and script text improve the performance of video description models compared to direct visual feature matching?
- RQ3What is the relative contribution of different visual features (e.g., LSDA, PLACES, HYBRID) to video description quality on this dataset?
- RQ4How do SMT-based approaches using parsed labels compare to nearest neighbor baselines and visual word models in generating video descriptions?
- RQ5To what extent can this dataset support the modeling of long-term semantic dependencies and narrative structure in open-domain video description?
Key findings
- DVS descriptions are significantly more accurate and visually grounded than movie scripts, which often contain pre-production inaccuracies or irrelevant details.
- The HYBRID visual feature combination achieved the best performance among nearest neighbor baselines, outperforming DT, LSDA, and PLACES.
- SMT-based approaches using text labels from the semantic parser outperformed nearest neighbor baselines and visual word models, with the 30-occurrence threshold yielding better results than 100.
- The use of sense labels from word sense disambiguation performed slightly worse than text labels, likely due to errors in WSD.
- The actual DVS and script sentences from the corpus ranked significantly better than any automatic method, confirming their value as strong baselines.
- The dataset enables modeling of narrative structure and long-term dependencies, offering new opportunities beyond existing image and video description datasets.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.