[Paper Review] Describing Common Human Visual Actions in Images
This paper introduces COCO-a, a large-scale, data-driven dataset of 140 visually detectable human actions in monocular images, derived from MS COCO using linguistic analysis of VerbNet and image captions. It provides exhaustive, localized annotations of subjects, objects, actions, postures, emotions, and spatial relationships, enabling robust training and benchmarking for visual scene understanding, action recognition, and image retrieval systems.
Which common human actions and interactions are recognizable in monocular still images? Which involve objects and/or other people? How many is a person performing at a time? We address these questions by exploring the actions and interactions that are detectable in the images of the MS COCO dataset. We make two main contributions. First, a list of 140 common `visual actions', obtained by analyzing the largest on-line verb lexicon currently available for English (VerbNet) and human sentences used to describe images in MS COCO. Second, a complete set of annotations for those `visual actions', composed of subject-object and associated verb, which we call COCO-a (a for `actions'). COCO-a is larger than existing action datasets in terms of number of actions and instances of these actions, and is unique because it is data-driven, rather than experimenter-biased. Other unique features are that it is exhaustive, and that all subjects and objects are localized. A statistical analysis of the accuracy of our annotations and of each action, interaction and subject-object combination is provided.
Motivation & Objective
- To identify and catalog the most common, visually discriminable human actions in everyday images, independent of experimenter bias.
- To create a comprehensive, exhaustive, and localized annotation set of actions, subjects, and objects in the MS COCO dataset.
- To provide a benchmark dataset that supports training and evaluation of visual scene understanding systems, including visual question answering and image retrieval.
- To empirically ground the debate on semantic network representations in scene understanding by using real-world data.
- To explore the frequency, spatial relationships, and contextual cues of human actions and interactions in still images.
Proposed method
- Constructed Visual VerbNet (VVN) by analyzing the largest available English verb lexicon (VerbNet) and human-annotated captions from MS COCO to identify 140 common, visually detectable actions.
- Annotated 10,000 MS COCO images with full subject-object-action triplets, including posture, emotion, and spatial relationships (distance, relative location).
- Ensured data-driven, unbiased annotation by deriving actions from real image descriptions rather than predefined action lists.
- Localized all subjects and objects using pixel-precise segmentation masks from the original MS COCO dataset.
- Used statistical analysis to evaluate annotation accuracy and frequency distribution across actions, interactions, and subject-object pairs.
- Enabled complex querying of rare combinations (e.g., 'cry' + 'sink') to test expressive power and utility for retrieval and learning.
Experimental results
Research questions
- RQ1Which common human actions and interactions are visually detectable in monocular still images?
- RQ2What is the frequency and distribution of actions, postures, and spatial relationships in real-world scenes?
- RQ3How do visual actions involving people, objects, and interactions differ in terms of spatial proximity, posture, and emotional context?
- RQ4To what extent can linguistic analysis of image captions and verb lexicons help identify a comprehensive, unbiased set of visual actions?
- RQ5Can a fully annotated, data-driven dataset improve the performance and generalization of visual scene understanding systems?
Key findings
- The study identified 140 common, visually discriminable human actions through linguistic and data-driven analysis, forming the Visual VerbNet (VVN) taxonomy.
- The COCO-a dataset contains 10,000 images with exhaustive annotations of subjects, objects, actions, postures, emotions, and spatial relationships, making it larger and more complete than existing action datasets.
- People most commonly interact with others through actions like 'be in the same group', 'accompany', or 'pose', typically at close distances and in front or side-by-side positions.
- The action 'touch' is most frequently performed with other people, wearable items, or objects in front or below the subject, with high spatial proximity and full or light contact.
- Rare combinations—such as 'fight' + 'above' or 'cry' + 'sink'—were successfully retrieved, demonstrating the dataset’s utility for complex image retrieval and zero-shot learning.
- Statistical analysis confirmed high annotation accuracy and revealed that actions like 'stand', 'sit', and 'walk' are the most frequent, while rarer actions such as 'kneel' or 'crouch' are underrepresented and may require data augmentation.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.