[Paper Review] Feature sampling and partitioning for visual vocabulary generation on large action classification datasets
This paper proposes a systematic evaluation of feature sampling and partitioning strategies for visual vocabulary generation in action recognition, demonstrating that balanced sampling and per-component or per-category vocabulary learning significantly improve performance on large-scale datasets. Using Fisher vectors with optimized sampling and partitioning, the authors achieve state-of-the-art results on five major benchmarks, including 81.24% accuracy on UCF101 and 65.16% mAP on Hollywood2, outperforming prior work by up to 37.34% in accuracy gain.
The recent trend in action recognition is towards larger datasets, an increasing number of action classes and larger visual vocabularies. State-of-the-art human action classification in challenging video data is currently based on a bag-of-visual-words pipeline in which space-time features are aggregated globally to form a histogram. The strategies chosen to sample features and construct a visual vocabulary are critical to performance, in fact often dominating performance. In this work we provide a critical evaluation of various approaches to building a vocabulary and show that good practises do have a significant impact. By subsampling and partitioning features strategically, we are able to achieve state-of-the-art results on 5 major action recognition datasets using relatively small visual vocabularies.
Motivation & Objective
- To evaluate the impact of feature sampling and partitioning strategies on visual vocabulary construction for large-scale action classification.
- To address bias in uniform random sampling that favors longer videos and overrepresented action classes.
- To investigate whether learning separate visual vocabularies per feature component or per action class improves performance.
- To determine the optimal configuration of vocabulary size, sampling strategy, and encoding method for state-of-the-art performance on large datasets.
- To provide a comprehensive empirical evaluation of these design choices on the largest and most challenging action recognition benchmarks available.
Proposed method
- Proposes a balanced sampling strategy that selects a fixed number of features from each video and action class to prevent bias toward longer or more frequent actions.
- Introduces partitioning of feature space by learning separate visual vocabularies for each feature component (e.g., trajectory, HOG, HOF) rather than a single joint vocabulary.
- Applies per-category visual vocabulary learning, where a distinct vocabulary is trained for each action class to better capture class-specific features.
- Employs Fisher vector encoding with k-means clustering to generate high-dimensional, discriminative video representations from the learned vocabularies.
- Uses a global bag-of-features (BoF) and Fisher vector pipeline with optimized hyperparameters, including vocabulary size K and dimensionality D.
- Implements a systematic experimental protocol across five major datasets (UCF101, Hollywood2, HMDB, USF101) with multiple train-test splits to ensure robust evaluation.
Experimental results
Research questions
- RQ1Does balanced feature sampling—sampling uniformly across videos and action classes—improve performance compared to uniform random sampling on large action recognition datasets?
- RQ2What is the impact of learning separate visual vocabularies for different feature components (e.g., HOG, HOF, trajectory) versus a single joint vocabulary?
- RQ3How does per-category visual vocabulary learning compare to global BoF or Fisher vector encoding in terms of accuracy and generalization?
- RQ4Can small visual vocabularies (e.g., K=128–256) combined with advanced encoding (e.g., Fisher vectors) achieve state-of-the-art performance on large-scale datasets?
- RQ5What is the relative contribution of sampling strategy, vocabulary partitioning, and encoding method to overall performance in action classification?
Key findings
- Balanced sampling outperformed uniform random sampling in 53% of experiments, with the best results on Hollywood2 (65.16% mAP) and HMDB (50.17% accuracy) achieved using balanced sampling.
- Learning separate visual vocabularies per feature component (e.g., trajectory, HOG, HOF) led to significant performance gains, especially when combined with Fisher vector encoding.
- Per-category visual vocabulary learning outperformed global BoF but was surpassed by Fisher vectors on larger and more complex datasets like UCF101 and HMDB.
- The proposed method achieved 81.24% accuracy, 82.35% mAP, and 80.57% F1 on UCF101—37.34% higher than the original reported result in [12]—making it a new state-of-the-art.
- The HMDB dataset remained the most challenging, with the highest performance gap between balanced and random sampling, indicating that imbalance has a stronger negative effect on harder datasets.
- Computational cost was dominated by feature loading from disk (163.52 CPU hours for UCF101), highlighting the importance of efficient I/O in large-scale video analysis.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.