QUICK REVIEW

[Paper Review] Weakly Supervised Action Labeling in Videos Under Ordering Constraints

Piotr Bojanowski, Rémi Lajugie|arXiv (Cornell University)|Jul 4, 2014

Human Pose and Action Recognition1 references44 citations

TL;DR

This paper proposes a weakly supervised method for temporal action localization in videos using only action order constraints from script-like annotations. By jointly learning action classifiers and assigning labels to video segments under temporal ordering constraints, the approach achieves state-of-the-art performance on a large-scale Hollywood video dataset, outperforming fully supervised baselines when only 25% of data is fully annotated.

ABSTRACT

We are given a set of video clips, each one annotated with an {\em ordered} list of actions, such as "walk" then "sit" then "answer phone" extracted from, for example, the associated text script. We seek to temporally localize the individual actions in each clip as well as to learn a discriminative classifier for each action. We formulate the problem as a weakly supervised temporal assignment with ordering constraints. Each video clip is divided into small time intervals and each time interval of each video clip is assigned one action label, while respecting the order in which the action labels appear in the given annotations. We show that the action label assignment can be determined together with learning a classifier for each action in a discriminative manner. We evaluate the proposed model on a new and challenging dataset of 937 video clips with a total of 787720 frames containing sequences of 16 different actions from 69 Hollywood movies.

Motivation & Objective

Address the challenge of temporal action localization in videos with minimal human annotation, leveraging only action order information from movie scripts.
Overcome the limitations of fully supervised methods that require expensive time-stamped annotations by utilizing weakly supervised signals.
Formulate the action labeling problem as a joint optimization over action classifiers and temporal assignments under ordering constraints.
Demonstrate that temporal ordering constraints significantly improve model performance, even when full supervision is limited.
Evaluate the method on a large, realistic dataset of 937 Hollywood video clips with 16 actions and 787,720 frames, showing strong generalization under weak supervision.

Proposed method

Model each video clip as a sequence of short temporal segments (frames), assigning one action label per segment while respecting the order of actions in the script.
Formulate the learning problem as a discriminative optimization that jointly learns action classifiers and assigns labels under temporal ordering constraints.
Use a convex optimization framework based on the Frank-Wolfe algorithm to minimize a cost function that enforces correct action order and improves classifier discriminability.
Incorporate both weak supervision (action order) and, optionally, partial full supervision (time-stamped annotations) in a semi-supervised setting.
Estimate implicit action classifiers from the optimal assignment matrix using a closed-form expression derived from the optimization solution.
Apply a square loss baseline for comparison, which only uses fully annotated data without leveraging ordering constraints.

Experimental results

Research questions

RQ1Can temporal ordering constraints from weakly annotated scripts improve action localization and classification in videos without requiring time-stamped annotations?
RQ2How does the performance of a weakly supervised method that exploits action order compare to fully supervised baselines when only a fraction of data is fully annotated?
RQ3To what extent do ordering constraints enhance classifier learning when combined with weak supervision?
RQ4Can a joint optimization of action classifiers and temporal label assignments outperform methods that treat classification and localization separately?
RQ5Does the proposed method generalize well to complex, real-world video data from Hollywood films with diverse action sequences?

Key findings

The proposed method outperforms the fully supervised baseline (using square loss) when only 25% of the data is fully annotated, demonstrating the value of weak supervision with ordering constraints.
On average, the method achieves higher alignment accuracy than baselines on the most frequent actions such as "Open Door", "Sit Down", and "Stand Up".
In the semi-supervised setting, the model consistently outperforms the supervised baseline (SL) even with limited full annotations, showing that ordering constraints enhance learning efficiency.
The method significantly improves over the Bojanowski et al. baseline, which lacks ordering constraints and performs poorly under weak supervision.
The recovered classifiers achieve higher average precision than both the supervised baseline and the Bojanowski et al. baseline, especially in the weakly supervised regime.
The use of the Frank-Wolfe algorithm enables efficient optimization without projection steps, supporting scalability to large-scale video datasets.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.