QUICK REVIEW

[논문 리뷰] Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble

Songyao Jiang, Bin Sun|arXiv (Cornell University)|2021. 10. 12.

Hand Gesture Recognition Systems참고 문헌 69인용 수 30

한 줄 요약

논문은 SAM-SLR-v2를 소개합니다. 이는 2D/3D 전신 골격 그래프와 RGB/RGB-D 단서를 Global Ensemble Model로 융합하여 다수 데이터셋에서 isolated SLR의 최첨단 성능을 달성하는 뼈대 인식 기반 멀티모달 프레임워크입니다.

ABSTRACT

Sign language is commonly used by deaf or mute people to communicate but requires extensive effort to master. It is usually performed with the fast yet delicate movement of hand gestures, body posture, and even facial expressions. Current Sign Language Recognition (SLR) methods usually extract features via deep neural networks and suffer overfitting due to limited and noisy data. Recently, skeleton-based action recognition has attracted increasing attention due to its subject-invariant and background-invariant nature, whereas skeleton-based SLR is still under exploration due to the lack of hand annotations. Some researchers have tried to use off-line hand pose trackers to obtain hand keypoints and aid in recognizing sign language via recurrent neural networks. Nevertheless, none of them outperforms RGB-based approaches yet. To this end, we propose a novel Skeleton Aware Multi-modal Framework with a Global Ensemble Model (GEM) for isolated SLR (SAM-SLR-v2) to learn and fuse multi-modal feature representations towards a higher recognition rate. Specifically, we propose a Sign Language Graph Convolution Network (SL-GCN) to model the embedded dynamics of skeleton keypoints and a Separable Spatial-Temporal Convolution Network (SSTCN) to exploit skeleton features. The skeleton-based predictions are fused with other RGB and depth based modalities by the proposed late-fusion GEM to provide global information and make a faithful SLR prediction. Experiments on three isolated SLR datasets demonstrate that our proposed SAM-SLR-v2 framework is exceedingly effective and achieves state-of-the-art performance with significant margins. Our code will be available at https://github.com/jackyjsy/SAM-SLR-v2

연구 동기 및 목표

제약된 핸드 제스처와 화자 가변성으로 어려운 과제로 여겨지는 SLR의 동기를 제시한다.
손과 전체 신체의 키포인트를 포함한 뼈대 기반 표현과 그래프 기반 동적 모델을 탐구한다.
상호 보완 모달리티를 활용하기 위한 자동화된 데이터 주도형 앙상블을 통해 다모달 융합을 개발한다.
RGB 및 RGB-D 데이터로 여러 isolated SLR 데이터셋에서 최첨단 성능을 Demonstrate한다.

제안 방법

사전 학습된 포즈 추정기로부터 2D/3D 전신 골격 그래프를 구성하고(27 노드로 축소) 제스처 동역학을 모델링한다.
SH-GCN에 멀티 스트림 입력(관절, 뼈대, 관절 모션, 뼈대 모션)과 STC 자가 주의(attention)로 분리된 GCN을 도입해 강건한 동역학 학습을 달성한다.
SSTCN을 도입해 separable 4-stage 아키텍처와 Swish 활성화 및 라벨 스무딩으로 뼈대 특징에서의 이점과 일반화 능력을 활용한다.
RGB, 옵티컬 플로우, HHA, 깊이 모달리티에 대해 사전 학습이 가능한 3DCNN 베이스라인(ResNet2+1D 변형)을 제공합니다.
GEM(Global Ensemble Model)을 제안하여 RGB 및 RGB-D 트랙에 대한 모달리티 가중치를 자동으로 학습하게 해 고정된 late-fusion 접근법을 능가합니다.

실험 결과

연구 질문

RQ1전신 2D/3D 골격 그래프(손 포함)가 RGB 전용 방법보다 isolated SLR 성능을 향상시킬 수 있는가?
RQ2,
RQ3,
RQ4,
RQ5 는 학습 가능한 late-fusion 앙상블(GEM)이 seven 모달리티 전체에서 수작업으로 구성된 고정 융합보다 성능이 우수한가?

주요 결과

멀티스트림 SL-GCN은 골격 그래프를 이용해 AUTSL, SLR500, WLASL2000에서 높은 top-1/top-5를 달성한다(예: multi-stream AUTSL: 96.47/99.76 top-1/top-5).
단일 모달리티 골격 스트림(2D/3D 키포인트)이 AUTSL에서 다른 단일 모달리티를 능가한다(예: 2D: 96.47 top-1; 3D: 96.53 top-1).
그래프 축소(133-node → 27-node)가 정확도를 크게 향상시키고 과적합을 피하는 데 도움이 된다.
SSTCN은 뼈대 특징에서 전통적인 3D 컨볼루션보다 경쟁력 있는 이점을 제공하며, Swish 활성화와 라벨 스무딩은 일반화 능력을 향상시킨다.
GEM 융합은 모달리티 가중치를 학습하고 AUTSL의 RGB 및 RGB-D 트랙에서 최첨단 결과를 달성한다(예: RGB: 98.00 top-1; RGB-D: 98.10 top-1, 미세 조정은 필요 없음).
baselines와 비교했을 때 SAM-SLR-v2는 평가 데이터셋에서 이전 방법들보다 큰 차이로 우수한 성능을 보여준다

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.