QUICK REVIEW

[论文解读] Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison

Dongxu Li, Cristian Rodriguez Opazo|arXiv (Cornell University)|Oct 24, 2019

Hand Gesture Recognition Systems参考文献 73被引用 53

一句话总结

本文提出大规模的逐字手语（WLASL）数据集，包含超过21k段视频，覆盖2k词汇，与基于外观和基于姿态的基线进行比较，并提出 Pose-TGCN 以联合建模空间和时间姿态动态。

ABSTRACT

Vision-based sign language recognition aims at helping deaf people to communicate with others. However, most existing sign language datasets are limited to a small number of words. Due to the limited vocabulary size, models learned from those datasets cannot be applied in practice. In this paper, we introduce a new large-scale Word-Level American Sign Language (WLASL) video dataset, containing more than 2000 words performed by over 100 signers. This dataset will be made publicly available to the research community. To our knowledge, it is by far the largest public ASL dataset to facilitate word-level sign recognition research. Based on this new large-scale dataset, we are able to experiment with several deep learning methods for word-level sign recognition and evaluate their performances in large scale scenarios. Specifically we implement and compare two different models,i.e., (i) holistic visual appearance-based approach, and (ii) 2D human pose based approach. Both models are valuable baselines that will benefit the community for method benchmarking. Moreover, we also propose a novel pose-based temporal graph convolution networks (Pose-TGCN) that models spatial and temporal dependencies in human pose trajectories simultaneously, which has further boosted the performance of the pose-based method. Our results show that pose-based and appearance-based models achieve comparable performances up to 66% at top-10 accuracy on 2,000 words/glosses, demonstrating the validity and challenges of our dataset. Our dataset and baseline deep models are available at \url{https://dxli94.github.io/WLASL/}.

研究动机与目标

以来自网络来源的大规模、说话人多样的数据集，推动并实现可扩展的逐词手语识别。
提供公开可用的外观基线和基于姿态的手语识别基线，以便为未来工作进行基准比较。
研究在大词汇量下，姿态基的时序图网络（Pose-TGCN）相对于基于外观的方法在手语识别中的有效性。

提出的方法

构建一个大规模的单目RGB逐字手语数据集（WLASL），包含21,083段视频、119位签名者和3,126个词汇；确保签名者多样性和方言注释。
开发基于外观的基线：2D CNN（VGG16）+ GRU，以及在Kinetics衍生特征上微调的3D CNN（I3D）。
开发基于姿态的基线：使用55个2D关键点的GRU的 Pose-GRU；使用覆盖全身关键点轨迹的时序图卷积的 Pose-TGCN。
提出一个时序图卷积网络（TGCN），其中人体建模为一个具有可学习邻接矩阵的全连接图，堆叠残差块，并对时间维度进行平均池化用于分类。
标准训练协议：将边界框对角线缩放到256；训练时随机选取50帧片段；Adam优化器；200个epoch；每个词汇的训练/验证/测试分配比为4:1:1。

实验结果

研究问题

RQ1大规模、签名者多样的逐词手语数据集是否能够为数千个词汇实现鲁棒学习？
RQ2在大词汇量的逐词手语识别中，基于外观的方法和基于姿态的方法的表现差异如何？
RQ3姿态基的时序图方法（Pose-TGCN）是否在手语识别中优于标准的姿态和外观基线？
RQ4词汇规模和样本数量对逐词SLR模型性能有何影响？

主要发现

模型	WLASL100_top1	WLASL100_top5	WLASL100_top10	WLASL300_top1	WLASL300_top5	WLASL300_top10	WLASL1000_top1	WLASL1000_top5	WLASL1000_top10	WLASL2000_top1	WLASL2000_top5	WLASL2000_top10
Pose-GRU	46.51	76.74	85.66	33.68	64.37	76.05	30.01	58.42	70.15	22.54	49.81	61.38
Pose-TGCN	55.43	78.68	87.60	38.32	67.51	79.64	34.86	61.73	71.91	23.65	51.75	62.24
VGG-GRU	25.97	55.04	63.95	19.31	46.56	61.08	14.66	37.31	49.36	8.44	23.58	32.58
I3D	65.89	84.11	89.92	56.14	79.94	86.98	47.33	76.44	84.33	32.48	57.31	66.31

WLASL包含21,083段视频，覆盖2,000个词汇，来自119位签名者；数据集对外公开。
Pose-TGCN在大词汇量上与基于外观的模型的top-10性能具有竞争力（在WLASL2000上最高可达62.24%，在某些设置可与I3D相当）。
I3D通常优于VGG-GRU，Pose-TGCN优于Pose-GRU，证明联合建模空间和时间姿态信息的好处。
在小词汇子集上，姿态基和外观基方法的表现都更好，但随着词汇量增加，性能趋于饱和；更大词汇量引入更多歧义，需要更多数据或更先进的学习策略。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。