[论文解读] Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble
本文提出 SAM-SLR-v2,一种骨架感知的多模态框架,将2D/3D全身骨架图与RGB/RGB-D线索通过全局集成模型融合,在多个数据集上实现最优的孤立式SLR。
Sign language is commonly used by deaf or mute people to communicate but requires extensive effort to master. It is usually performed with the fast yet delicate movement of hand gestures, body posture, and even facial expressions. Current Sign Language Recognition (SLR) methods usually extract features via deep neural networks and suffer overfitting due to limited and noisy data. Recently, skeleton-based action recognition has attracted increasing attention due to its subject-invariant and background-invariant nature, whereas skeleton-based SLR is still under exploration due to the lack of hand annotations. Some researchers have tried to use off-line hand pose trackers to obtain hand keypoints and aid in recognizing sign language via recurrent neural networks. Nevertheless, none of them outperforms RGB-based approaches yet. To this end, we propose a novel Skeleton Aware Multi-modal Framework with a Global Ensemble Model (GEM) for isolated SLR (SAM-SLR-v2) to learn and fuse multi-modal feature representations towards a higher recognition rate. Specifically, we propose a Sign Language Graph Convolution Network (SL-GCN) to model the embedded dynamics of skeleton keypoints and a Separable Spatial-Temporal Convolution Network (SSTCN) to exploit skeleton features. The skeleton-based predictions are fused with other RGB and depth based modalities by the proposed late-fusion GEM to provide global information and make a faithful SLR prediction. Experiments on three isolated SLR datasets demonstrate that our proposed SAM-SLR-v2 framework is exceedingly effective and achieves state-of-the-art performance with significant margins. Our code will be available at https://github.com/jackyjsy/SAM-SLR-v2
研究动机与目标
- Motivate sign language recognition (SLR) as a challenging task due to fine-grained hand gestures and signer variability.
- Explore skeleton-based representations using full-body keypoints including hands and a graph-based dynamic model.
- Develop multi-modal fusion with an automatic, data-driven ensemble to leverage complementary modalities.
- Demonstrate state-of-the-art performance on several isolated SLR datasets with RGB and RGB-D data.
提出的方法
- Construct a 2D/3D whole-body skeleton graph (reduced to 27 nodes) from a pretrained pose estimator to model sign dynamics.
- Propose SL-GCN with multi-stream inputs (joint, bone, joint motion, bone motion) and a decoupled GCN with STC self-attention for robust dynamics learning.
- Introduce SSTCN to exploit skeleton features with a separable 4-stage architecture and label smoothing with Swish activations.
- Develop 3DCNN baselines (ResNet2+1D variants) for RGB, optical flow, HHA, and depth modalities with pretraining on SLR500.
- Propose Global Ensemble Model (GEM) to learn modality weights automatically for RGB and RGB-D tracks, surpassing fixed late-fusion approaches.
实验结果
研究问题
- RQ1Can whole-body 2D/3D skeleton graphs (including hands) improve isolated SLR performance over RGB-only methods?
- RQ2Do multi-stream skeleton dynamics (joint/bone and their motions) provide superior recognition than single-stream counterparts?
- RQ3Does a learnable late-fusion ensemble (GEM) outperform manual fixed fusion across seven modalities?
- RQ4How do skeleton-based methods compare to RGB/RGB-D baselines on AUTSL, SLR500, and WLASL2000 datasets?
- RQ5What is the contribution of each component (graph reduction, STC attention, SSTCN, pretraining) to final accuracy?
主要发现
| Dataset | Top-1 (SL-GCN streams) | Top-5 (SL-GCN streams) | Top-1 (Single-modality) | Top-5 (Single-modality) | Top-1 (RGB-Flow/HHA/etc.) | Top-5 (RGB-Flow/HHA/etc.) | Notes |
|---|---|---|---|---|---|---|---|
| AUTSL (validation) | 95.02 | N/A | 95.00 (RGB Frames) | 99.47 | 90.41 (RGB Flow) | 98.0? | Ablation indicates deep impact of components on AUTSL validation |
| SL-GCN Multi-stream (AUTSL) | 96.47 | 99.76 | N/A | N/A | N/A | N/A | See Table II for multi-streams results |
| SL-GCN Multi-stream (SLR500) | 98.16 | 99.95 | N/A | N/A | N/A | N/A | See Table II |
| SL-GCN Multi-stream (WLASL2000) | 51.50 | 84.94 | N/A | N/A | N/A | N/A | See Table II |
- Multi-stream SL-GCN with skeleton graphs achieves high top-1/top-5 on AUTSL, SLR500, and WLASL2000 (e.g., multi-stream AUTSL: 96.47/99.76 top-1/top-5).
- Single-modality skeleton streams (2D/3D keypoints) outperform other single modalities on AUTSL (e.g., 2D: 96.47 top-1; 3D: 96.53 top-1).
- Graph reduction (133-node to 27-node) significantly boosts accuracy and helps avoid overfitting.
- SSTCN on skeleton features provides competitive gains over traditional 3D convolutions; Swish activations and label smoothing improve generalization.
- GEM fusion learns modality weights and achieves state-of-the-art results in RGB and RGB-D tracks on AUTSL (e.g., RGB: 98.00 top-1; RGB-D: 98.10 top-1 after fine-tuning not required).
- Compared with baselines, SAM-SLR-v2 surpasses previous methods by large margins on the evaluated datasets.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。