Skip to main content
QUICK REVIEW

[Paper Review] Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects

Adam R. Kosiorek, Hyunjik Kim|arXiv (Cornell University)|Jun 5, 2018
Generative Adversarial Networks and Image Synthesis94 citations
TL;DR

sqair extends AIR to video by incorporating a spatio-temporal state-space model, enabling unsupervised discovery, tracking, and generation of moving objects across frames with improved handling of occlusion and overlap.

ABSTRACT

We present Sequential Attend, Infer, Repeat (SQAIR), an interpretable deep generative model for videos of moving objects. It can reliably discover and track objects throughout the sequence of frames, and can also generate future frames conditioning on the current frame, thereby simulating expected motion of objects. This is achieved by explicitly encoding object presence, locations and appearances in the latent variables of the model. SQAIR retains all strengths of its predecessor, Attend, Infer, Repeat (AIR, Eslami et. al., 2016), including learning in an unsupervised manner, and addresses its shortcomings. We use a moving multi-MNIST dataset to show limitations of AIR in detecting overlapping or partially occluded objects, and show how SQAIR overcomes them by leveraging temporal consistency of objects. Finally, we also apply SQAIR to real-world pedestrian CCTV data, where it learns to reliably detect, track and generate walking pedestrians with no supervision.

Motivation & Objective

  • Motivate learning interpretable, temporally consistent object representations in video without supervision.
  • Extend the AIR framework to sequences to model object persistence, appearance, and motion across frames.
  • Develop a discovery-propagation inference mechanism to track and manage objects as they enter, persist, or disappear from the scene.
  • Demonstrate improved object counting, reconstruction, and downstream task usefulness on synthetic and real-world data.

Proposed method

  • Extend AIR to a sequential, probabilistic model with Discovery and Propagation components.
  • Model objects with z^what, z^where, and z^pres across time, using a propagation prior for existing objects and a discovery prior for new ones.
  • Use a temporal RNN and a relation RNN to implement explaining away and capture object interactions over time.
  • Train with an importance-weighted auto-encoder (IWAE) objective and use the VIMCO gradient estimator for discrete variables.
  • Provide two architectures (mlp and conv-sqair) and compare against AIR and vrnn baselines.
  • Maintain interpretability by explicitly encoding presence, location, and appearance of objects through time.

Experimental results

Research questions

  • RQ1Can sqair reliably discover, track, and interpret objects in video sequences without supervision?
  • RQ2Does incorporating temporal consistency improve object counting, appearance preservation, and future-frame generation compared to frame-wise AIR?
  • RQ3How does sqair perform on synthetic moving MNIST data versus real CCTV pedestrian data in terms of likelihood, reconstruction, and latent interpretability?
  • RQ4What is the impact of temporal propagation and discovery on handling occlusion and object overlap?

Key findings

  • sqair achieves higher marginal log-likelihood (IWAE bound) than baselines on moving MNIST with conv-sqair reaching 6784.8 (log p_theta(x1:T)) and 6923.8 (log p_theta(x1:T | z1:T)) with KL 134.6; counting accuracy 0.9974 and addition accuracy 0.9990.
  • mlp-sqair and conv-sqair substantially outperform AIR and vrnn baselines on both likelihood and reconstruction metrics, with conv-sqair achieving the best overall scores.
  • sqair reduces KL divergence compared to vrnn and AIR, indicating better compressibility via temporally coherent object representations.
  • sqair can perform conditional generation, generating plausible future frames conditioned on initial frames, and preserves appearance and motion through time.
  • In real CCTV data, sqair learns to detect and track pedestrians unsupervised, with reasonable qualitative reconstruction and conditional generation results, though object counting remains challenging with smaller datasets.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.