[Paper Review] On the effectiveness of task granularity for transfer learning
The paper investigates how the level of granularity in a source task (coarse to fine captions) affects the quality of learned features for transfer learning in video understanding, showing that finer-grained tasks yield better transfer performance, and that captioning can serve as an effective source task.
We describe a DNN for video classification and captioning, trained end-to-end, with shared features, to solve tasks at different levels of granularity, exploring the link between granularity in a source task and the quality of learned features for transfer learning. For solving the new task domain in transfer learning, we freeze the trained encoder and fine-tune a neural net on the target domain. We train on the Something-Something dataset with over 220, 000 videos, and multiple levels of target granularity, including 50 action groups, 174 fine-grained action categories and captions. Classification and captioning with Something-Something are challenging because of the subtle differences between actions, applied to thousands of different object classes, and the diversity of captions penned by crowd actors. Our model performs better than existing classification baselines for SomethingSomething, with impressive fine-grained results. And it yields a strong baseline on the new Something-Something captioning task. Experiments reveal that training with more fine-grained tasks tends to produce better features for transfer learning.
Motivation & Objective
- Investigate the relationship between source-task label granularity and transferable feature quality.
- Develop a unified encoder-decoder model for video classification and captioning with shared representations.
- Evaluate transfer learning from Something-Something features to new domains, including a kitchen-action dataset.
- Introduce 20bn-kitchenware as a transfer-learning benchmark for fine-grained tasks.
Proposed method
- Use a two-channel video encoder (2D spatial CNN and 3D spatiotemporal CNN) feeding into a shared LSTM encoder.
- Jointly train a classification head and a caption decoder using a weighted loss: loss = lambda * classification_loss + (1 - lambda) * captioning_loss.
- Train four tasks: coarse-grained action groups, fine-grained action categories, simplified object placeholders captions, and full object placeholder captions.
- Cap decoders generate captions conditioned on the encoded video representation; training uses teacher forcing with fixed caption length (14 words).
- Evaluation includes transfer learning: freeze encoder and train a classifier on target data, comparing features learned under different source granularity levels.
Experimental results
Research questions
- RQ1Does training on finer-grained source tasks yield richer features for transfer learning?
- RQ2How does joint training for classification and captioning compare to single-task training for transfer performance?
- RQ3What is the impact of different granularity levels (coarse groups, fine-grained actions, simplified captions, full captions) on classification and captioning performance?
- RQ4How well do Something-Something derived features transfer to a new, fine-grained kitchen-action dataset (20bn-kitchenware)?
Key findings
- Training with more fine-grained tasks tends to produce better features for transfer learning.
- Models trained to jointly perform classification and captioning learn features that transfer better to new tasks.
- For coarse vs fine-grained classification, fine-grained training yielded higher test accuracy (e.g., 50.44% vs 41.7% in the reported setup).
- Captioning as a source task is viable and beneficial; combined captioning and action classification training improves transfer performance.
- The proposed 20bn-kitchenware benchmark shows that Something-Something–pretrained features and temporal models with recurrence outperform baselines when transferring to fine-grained kitchen actions.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.