QUICK REVIEW

[论文解读] RNNs, CNNs and Transformers in Human Action Recognition: A Survey and a Hybrid Model

Khaled Alomar, Halil Ibrahim Aysel|arXiv (Cornell University)|Jun 2, 2024

Human Pose and Action Recognition被引用 11

一句话总结

对 CNNs、RNNs 和 Vision Transformers 在 Human Action Recognition 的全面综述，并提出一个 CNN–ViT 混合模型，讨论趋势与未来方向。

ABSTRACT

Human Action Recognition (HAR) encompasses the task of monitoring human activities across various domains, including but not limited to medical, educational, entertainment, visual surveillance, video retrieval, and the identification of anomalous activities. Over the past decade, the field of HAR has witnessed substantial progress by leveraging Convolutional Neural Networks (CNNs) to effectively extract and comprehend intricate information, thereby enhancing the overall performance of HAR systems. Recently, the domain of computer vision has witnessed the emergence of Vision Transformers (ViTs) as a potent solution. The efficacy of transformer architecture has been validated beyond the confines of image analysis, extending their applicability to diverse video-related tasks. Notably, within this landscape, the research community has shown keen interest in HAR, acknowledging its manifold utility and widespread adoption across various domains. This article aims to present an encompassing survey that focuses on CNNs and the evolution of Recurrent Neural Networks (RNNs) to ViTs given their importance in the domain of HAR. By conducting a thorough examination of existing literature and exploring emerging trends, this study undertakes a critical analysis and synthesis of the accumulated knowledge in this field. Additionally, it investigates the ongoing efforts to develop hybrid approaches. Following this direction, this article presents a novel hybrid model that seeks to integrate the inherent strengths of CNNs and ViTs.

研究动机与目标

研究 CNNs、RNNs 和 Vision Transformers (ViTs) 在 HAR 中的发展演变。
分析包括 ViTs 和混合方法在内的动作识别领域的前沿文献。
提出一个将 CNNs 与 ViTs 结合用于 HAR 的新型混合模型，并与现有模型进行比较。
讨论 HAR 中的新兴趋势、挑战和未来研究方向。

提出的方法

回顾与 HAR 相关的基础 CNN、RNN 以及 Transformer/VIT 文献。
解释从原始 RNNs 到基于注意力的 Transformers 与自注意力机制的发展。
描述 Vision Transformers 如何适应用于 HAR 的时空视频数据。
提出并评估一个将 CNNs 与 ViTs 融合用于 HAR 的新型混合模型。

实验结果

研究问题

RQ1CNNs、RNNs 和 ViTs 如何演变并提升 HAR 的性能？
RQ2与单一架构相比，混合 CNN–ViT 模型为 HAR 提供了哪些好处？
RQ3使用 transformers 及 CNN–transformer 混合在 HAR 中当前的挑战与未来方向是什么？

主要发现

Transformers 和 ViTs 已成为视觉任务中对 CNNs 的强有力替代品，且正在扩展到视频 HAR。
自注意力与多头注意力使在 HAR 任务中实现长程依赖和全局上下文的建模成为可能。
提出了一种新颖的 CNN–ViT 混合模型，旨在将 CNNs 的高效局部特征提取与 ViTs 的全局上下文建模结合起来。
该综述强调通过时序整合、时空嵌入和跨帧注意力等方法，将 ViTs 扩展到时空视频数据的持续努力。
论文讨论了迁移学习、大规模预训练等趋势，以及混合模型在鲁棒性/可解释性方面的潜在收益。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。