Skip to main content
QUICK REVIEW

[Paper Review] Feature sampling and partitioning for visual vocabulary generation on large action classification datasets

Michael Sapienza, Fabio Cuzzolin|arXiv (Cornell University)|May 29, 2014
Human Pose and Action Recognition23 references20 citations
TL;DR

This paper proposes a systematic evaluation of feature sampling and partitioning strategies for visual vocabulary generation in action recognition, demonstrating that balanced sampling and per-component or per-category vocabulary learning significantly improve performance on large-scale datasets. Using Fisher vectors with optimized sampling and partitioning, the authors achieve state-of-the-art results on five major benchmarks, including 81.24% accuracy on UCF101 and 65.16% mAP on Hollywood2, outperforming prior work by up to 37.34% in accuracy gain.

ABSTRACT

The recent trend in action recognition is towards larger datasets, an increasing number of action classes and larger visual vocabularies. State-of-the-art human action classification in challenging video data is currently based on a bag-of-visual-words pipeline in which space-time features are aggregated globally to form a histogram. The strategies chosen to sample features and construct a visual vocabulary are critical to performance, in fact often dominating performance. In this work we provide a critical evaluation of various approaches to building a vocabulary and show that good practises do have a significant impact. By subsampling and partitioning features strategically, we are able to achieve state-of-the-art results on 5 major action recognition datasets using relatively small visual vocabularies.

Motivation & Objective

  • To evaluate the impact of feature sampling and partitioning strategies on visual vocabulary construction for large-scale action classification.
  • To address bias in uniform random sampling that favors longer videos and overrepresented action classes.
  • To investigate whether learning separate visual vocabularies per feature component or per action class improves performance.
  • To determine the optimal configuration of vocabulary size, sampling strategy, and encoding method for state-of-the-art performance on large datasets.
  • To provide a comprehensive empirical evaluation of these design choices on the largest and most challenging action recognition benchmarks available.

Proposed method

  • Proposes a balanced sampling strategy that selects a fixed number of features from each video and action class to prevent bias toward longer or more frequent actions.
  • Introduces partitioning of feature space by learning separate visual vocabularies for each feature component (e.g., trajectory, HOG, HOF) rather than a single joint vocabulary.
  • Applies per-category visual vocabulary learning, where a distinct vocabulary is trained for each action class to better capture class-specific features.
  • Employs Fisher vector encoding with k-means clustering to generate high-dimensional, discriminative video representations from the learned vocabularies.
  • Uses a global bag-of-features (BoF) and Fisher vector pipeline with optimized hyperparameters, including vocabulary size K and dimensionality D.
  • Implements a systematic experimental protocol across five major datasets (UCF101, Hollywood2, HMDB, USF101) with multiple train-test splits to ensure robust evaluation.

Experimental results

Research questions

  • RQ1Does balanced feature sampling—sampling uniformly across videos and action classes—improve performance compared to uniform random sampling on large action recognition datasets?
  • RQ2What is the impact of learning separate visual vocabularies for different feature components (e.g., HOG, HOF, trajectory) versus a single joint vocabulary?
  • RQ3How does per-category visual vocabulary learning compare to global BoF or Fisher vector encoding in terms of accuracy and generalization?
  • RQ4Can small visual vocabularies (e.g., K=128–256) combined with advanced encoding (e.g., Fisher vectors) achieve state-of-the-art performance on large-scale datasets?
  • RQ5What is the relative contribution of sampling strategy, vocabulary partitioning, and encoding method to overall performance in action classification?

Key findings

  • Balanced sampling outperformed uniform random sampling in 53% of experiments, with the best results on Hollywood2 (65.16% mAP) and HMDB (50.17% accuracy) achieved using balanced sampling.
  • Learning separate visual vocabularies per feature component (e.g., trajectory, HOG, HOF) led to significant performance gains, especially when combined with Fisher vector encoding.
  • Per-category visual vocabulary learning outperformed global BoF but was surpassed by Fisher vectors on larger and more complex datasets like UCF101 and HMDB.
  • The proposed method achieved 81.24% accuracy, 82.35% mAP, and 80.57% F1 on UCF101—37.34% higher than the original reported result in [12]—making it a new state-of-the-art.
  • The HMDB dataset remained the most challenging, with the highest performance gap between balanced and random sampling, indicating that imbalance has a stronger negative effect on harder datasets.
  • Computational cost was dominated by feature loading from disk (163.52 CPU hours for UCF101), highlighting the importance of efficient I/O in large-scale video analysis.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.