Skip to main content
QUICK REVIEW

[Paper Review] Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books

Yukun Zhu, Ryan Kiros|arXiv (Cornell University)|Jun 22, 2015
Multimodal Machine Learning Applications34 references298 citations
TL;DR

This paper proposes a conditional random field (CRF)-based model that aligns movie shots with corresponding paragraphs from books by leveraging visual, textual, and dialogic signals. The method achieves story-like visual explanations by jointly modeling cross-modal alignment, with key results showing improved alignment fidelity through dialog grounding and increased coherence when borrowing from a broader book corpus (200 books).

ABSTRACT

Books are a rich source of both fine-grained information, how a character, an object or a scene looks like, as well as high-level semantics, what someone is thinking, feeling and how these states evolve through a story. This paper aims to align books to their movie releases in order to provide rich descriptive explanations for visual content that go semantically far beyond the captions available in current datasets. To align movies and books we exploit a neural sentence embedding that is trained in an unsupervised way from a large corpus of books, as well as a video-text neural embedding for computing similarities between movie clips and sentences in the book. We propose a context-aware CNN to combine information from multiple sources. We demonstrate good quantitative performance for movie/book alignment and show several qualitative examples that showcase the diversity of tasks our model can be used for.

Motivation & Objective

  • To enable story-like visual explanations by aligning movie shots with corresponding narrative paragraphs from books.
  • To address the challenge of weak visual signals in video by leveraging textual and dialogic content for grounding.
  • To explore whether meaningful cross-book alignments can emerge when the model is forced to select from unrelated books.
  • To evaluate the impact of increasing the number of candidate books on alignment quality and narrative coherence.

Proposed method

  • Uses a conditional random field (CRF) to model sequential dependencies between movie shots and book paragraphs.
  • Employs a contextual CNN to compute similarity scores between video shots and book paragraphs based on visual, textual, and subtitle features.
  • Incorporates dialog transcripts as strong signals for alignment, especially when visual cues are ambiguous.
  • Performs zero-shot alignment by matching shots to paragraphs across a diverse set of books, including non-corresponding ones.
  • Conducts two experiments: 10-book (limited candidate books) and 200-book (broad corpus) settings to assess generalization and coherence.
  • Uses frame-level visual features and subtitle overlaps to refine shot-to-paragraph alignment in the CRF framework.

Experimental results

Research questions

  • RQ1Can a joint model of visual, textual, and dialogic signals effectively align movie shots with book paragraphs?
  • RQ2How does dialog fidelity between movie and book improve alignment accuracy when visual features are weak?
  • RQ3Can the model generate plausible story-like explanations by borrowing paragraphs from unrelated books?
  • RQ4Does increasing the number of candidate books (from 10 to 200) lead to more coherent and meaningful cross-book alignments?
  • RQ5What role do contextualized textual features play in disambiguating visual-sentence alignments?

Key findings

  • Dialogs in movies that closely follow book text significantly improve alignment accuracy by grounding visual content.
  • In the 10-book experiment, top-scoring matches from unrelated books still show low similarity, indicating limited coherence without broader context.
  • In the 200-book experiment, the model produces increasingly coherent and story-like alignments, suggesting that a larger book corpus enhances narrative plausibility.
  • The CRF model successfully leverages contextual cues from surrounding paragraphs to improve alignment precision beyond isolated shot-book matches.
  • Visual and subtitle features alone are insufficient for strong alignment; dialogic consistency is a critical signal for grounding.
  • The model demonstrates the ability to generate plausible, story-like explanations by borrowing from a diverse book corpus, even when the source book does not match the movie.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.