QUICK REVIEW

[Paper Review] Feature sampling and partitioning for visual vocabulary generation on large action classification datasets

Michael Sapienza, Fabio Cuzzolin|arXiv (Cornell University)|May 29, 2014

Human Pose and Action Recognition23 references20 citations

TL;DR

This paper proposes a systematic evaluation of feature sampling and partitioning strategies for visual vocabulary generation in action recognition, demonstrating that balanced sampling and per-component or per-category vocabulary learning significantly improve performance on large-scale datasets. Using Fisher vectors with optimized sampling and partitioning, the authors achieve state-of-the-art results on five major benchmarks, including 81.24% accuracy on UCF101 and 65.16% mAP on Hollywood2, outperforming prior work by up to 37.34% in accuracy gain.

ABSTRACT

The recent trend in action recognition is towards larger datasets, an increasing number of action classes and larger visual vocabularies. State-of-the-art human action classification in challenging video data is currently based on a bag-of-visual-words pipeline in which space-time features are aggregated globally to form a histogram. The strategies chosen to sample features and construct a visual vocabulary are critical to performance, in fact often dominating performance. In this work we provide a critical evaluation of various approaches to building a vocabulary and show that good practises do have a significant impact. By subsampling and partitioning features strategically, we are able to achieve state-of-the-art results on 5 major action recognition datasets using relatively small visual vocabularies.

Motivation & Objective

To evaluate the impact of feature sampling and partitioning strategies on visual vocabulary construction for large-scale action classification.
To address bias in uniform random sampling that favors longer videos and overrepresented action classes.
To investigate whether learning separate visual vocabularies per feature component or per action class improves performance.
To determine the optimal configuration of vocabulary size, sampling strategy, and encoding method for state-of-the-art performance on large datasets.
To provide a comprehensive empirical evaluation of these design choices on the largest and most challenging action recognition benchmarks available.

Proposed method

Proposes a balanced sampling strategy that selects a fixed number of features from each video and action class to prevent bias toward longer or more frequent actions.
Introduces partitioning of feature space by learning separate visual vocabularies for each feature component (e.g., trajectory, HOG, HOF) rather than a single joint vocabulary.
Applies per-category visual vocabulary learning, where a distinct vocabulary is trained for each action class to better capture class-specific features.
Employs Fisher vector encoding with k-means clustering to generate high-dimensional, discriminative video representations from the learned vocabularies.
Uses a global bag-of-features (BoF) and Fisher vector pipeline with optimized hyperparameters, including vocabulary size K and dimensionality D.
Implements a systematic experimental protocol across five major datasets (UCF101, Hollywood2, HMDB, USF101) with multiple train-test splits to ensure robust evaluation.

Experimental results

Research questions

RQ1Does balanced feature sampling—sampling uniformly across videos and action classes—improve performance compared to uniform random sampling on large action recognition datasets?
RQ2What is the impact of learning separate visual vocabularies for different feature components (e.g., HOG, HOF, trajectory) versus a single joint vocabulary?
RQ3How does per-category visual vocabulary learning compare to global BoF or Fisher vector encoding in terms of accuracy and generalization?
RQ4Can small visual vocabularies (e.g., K=128–256) combined with advanced encoding (e.g., Fisher vectors) achieve state-of-the-art performance on large-scale datasets?
RQ5What is the relative contribution of sampling strategy, vocabulary partitioning, and encoding method to overall performance in action classification?

Key findings

Balanced sampling outperformed uniform random sampling in 53% of experiments, with the best results on Hollywood2 (65.16% mAP) and HMDB (50.17% accuracy) achieved using balanced sampling.
Learning separate visual vocabularies per feature component (e.g., trajectory, HOG, HOF) led to significant performance gains, especially when combined with Fisher vector encoding.
Per-category visual vocabulary learning outperformed global BoF but was surpassed by Fisher vectors on larger and more complex datasets like UCF101 and HMDB.
The proposed method achieved 81.24% accuracy, 82.35% mAP, and 80.57% F1 on UCF101—37.34% higher than the original reported result in [12]—making it a new state-of-the-art.
The HMDB dataset remained the most challenging, with the highest performance gap between balanced and random sampling, indicating that imbalance has a stronger negative effect on harder datasets.
Computational cost was dominated by feature loading from disk (163.52 CPU hours for UCF101), highlighting the importance of efficient I/O in large-scale video analysis.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.