QUICK REVIEW

[Paper Review] UAV-Human: A Large Benchmark for Human Behavior Understanding with Unmanned Aerial Vehicles

Tianjiao Li, Jun Liu|arXiv (Cornell University)|Apr 2, 2021

Human Pose and Action Recognition40 references18 citations

TL;DR

This paper introduces UAV-Human, a large-scale, multi-modal benchmark for UAV-based human behavior understanding, comprising 67,428 video sequences across diverse urban and rural environments, day/night conditions, and 119 subjects. It proposes a fisheye video action recognition method using unbounded transformation learning guided by flat RGB videos, achieving 34.12% CSv1 accuracy on fisheye data—outperforming prior methods and demonstrating efficacy in handling severe distortions.

ABSTRACT

Human behavior understanding with unmanned aerial vehicles (UAVs) is of great significance for a wide range of applications, which simultaneously brings an urgent demand of large, challenging, and comprehensive benchmarks for the development and evaluation of UAV-based models. However, existing benchmarks have limitations in terms of the amount of captured data, types of data modalities, categories of provided tasks, and diversities of subjects and environments. Here we propose a new benchmark - UAVHuman - for human behavior understanding with UAVs, which contains 67,428 multi-modal video sequences and 119 subjects for action recognition, 22,476 frames for pose estimation, 41,290 frames and 1,144 identities for person re-identification, and 22,263 frames for attribute recognition. Our dataset was collected by a flying UAV in multiple urban and rural districts in both daytime and nighttime over three months, hence covering extensive diversities w.r.t subjects, backgrounds, illuminations, weathers, occlusions, camera motions, and UAV flying attitudes. Such a comprehensive and challenging benchmark shall be able to promote the research of UAV-based human behavior understanding, including action recognition, pose estimation, re-identification, and attribute recognition. Furthermore, we propose a fisheye-based action recognition method that mitigates the distortions in fisheye videos via learning unbounded transformations guided by flat RGB videos. Experiments show the efficacy of our method on the UAV-Human dataset. The project page: https://github.com/SUTDCV/UAV-Human

Motivation & Objective

To address the lack of large-scale, comprehensive, and diverse benchmarks for UAV-based human behavior understanding.
To collect multi-modal data (RGB, fisheye, IR, night-vision) across varied environments, times, and UAV flight dynamics to reflect real-world complexity.
To develop a robust method for action recognition in highly distorted fisheye video by learning unbounded transformations guided by undistorted RGB videos.
To evaluate state-of-the-art models across multiple tasks: action recognition, pose estimation, person re-identification, and attribute recognition.
To establish a benchmark that enables systematic evaluation and advancement of deep learning models for UAV-based human behavior understanding.

Proposed method

The UAV-Human benchmark was collected using a flying UAV equipped with Azure DK, fisheye, and night-vision cameras over three months in urban and rural areas, capturing data in day and night conditions.
A fisheye-based action recognition method was proposed that learns unbounded spatial transformations to correct distortions, guided by corresponding flat RGB video sequences.
The method employs a GT-Module (Guided Transformation Module) to learn a mapping from fisheye to undistorted space using a supervision signal from RGB videos.
For action recognition, models were trained and evaluated on multiple modalities: RGB, fisheye, depth, IR, and night-vision videos, with cross-subset (CSv1, CSv2) evaluation protocols.
Pose estimation was evaluated using keypoint annotations on 22,476 frames with 17 keypoints per subject, using state-of-the-art models like HigherHRNet and AlphaPose.
Person re-identification and attribute recognition were evaluated using 41,290 frames with 1,144 identities and 22,263 frames with 7 attributes, respectively, using ResNet and DenseNet baselines.

Experimental results

Research questions

RQ1How does the performance of action recognition models vary across different video modalities (e.g., fisheye, RGB, IR) in UAV-captured data?
RQ2Can a learning-based approach effectively correct severe fisheye distortions in UAV video for action recognition?
RQ3How do skeleton-based representations compare to video-based representations in UAV scenarios with dynamic viewpoints and motion blur?
RQ4What are the performance limits of current state-of-the-art models in pose estimation, person re-identification, and attribute recognition on UAV-Human?
RQ5To what extent does the diversity of subjects, environments, and UAV flight dynamics in UAV-Human challenge existing models?

Key findings

The proposed fisheye action recognition method with guided transformation achieved 34.12% CSv1 accuracy, outperforming baseline fisheye models and demonstrating efficacy in handling distortion.
Fisheye video with the proposed method achieved 23.24% CSv1 accuracy, a significant improvement over the 20.76% baseline, showing the value of guided distortion correction.
Skeleton-based methods outperformed video-based methods in action recognition, with Shift-GCN achieving 67.04% Rank-1 accuracy on CSv2, highlighting the robustness of skeletal representations in dynamic UAV views.
Pose estimation models achieved only 56.5–56.9% mAP, indicating high difficulty due to viewpoint changes, scale variations, and occlusions in UAV data.
Person re-identification models achieved up to 85.71% mAP with DG-Net, showing that overhead, moving-camera perspectives pose significant challenges for feature learning.
Attribute recognition performance was lowest for clothing color and style (e.g., 44.4% for UCC/S), reflecting the difficulty of recognizing attributes under diverse viewpoints and long-term data collection.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.