QUICK REVIEW

[논문 리뷰] Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison

Dongxu Li, Cristian Rodriguez Opazo|arXiv (Cornell University)|2019. 10. 24.

Hand Gesture Recognition Systems참고 문헌 73인용 수 53

한 줄 요약

이 논문은 2만 1천여 개의 비디오와 2천 개의 gloss를 포함하는 대규모 Word-Level ASL(WLASL) 데이터셋을 소개하고, appearance-based와 pose-based baseline을 비교하며, 공간-시간적인 포즈 역학을 공동으로 모델링하는 Pose-TGCN를 제안한다.

ABSTRACT

Vision-based sign language recognition aims at helping deaf people to communicate with others. However, most existing sign language datasets are limited to a small number of words. Due to the limited vocabulary size, models learned from those datasets cannot be applied in practice. In this paper, we introduce a new large-scale Word-Level American Sign Language (WLASL) video dataset, containing more than 2000 words performed by over 100 signers. This dataset will be made publicly available to the research community. To our knowledge, it is by far the largest public ASL dataset to facilitate word-level sign recognition research. Based on this new large-scale dataset, we are able to experiment with several deep learning methods for word-level sign recognition and evaluate their performances in large scale scenarios. Specifically we implement and compare two different models,i.e., (i) holistic visual appearance-based approach, and (ii) 2D human pose based approach. Both models are valuable baselines that will benefit the community for method benchmarking. Moreover, we also propose a novel pose-based temporal graph convolution networks (Pose-TGCN) that models spatial and temporal dependencies in human pose trajectories simultaneously, which has further boosted the performance of the pose-based method. Our results show that pose-based and appearance-based models achieve comparable performances up to 66% at top-10 accuracy on 2,000 words/glosses, demonstrating the validity and challenges of our dataset. Our dataset and baseline deep models are available at \url{https://dxli94.github.io/WLASL/}.

연구 동기 및 목표

인터넷 소스에서 수집한 대규모, 화자 다양성 있는 데이터셋으로 확장 가능한 단어 수준 ASL 인식의 동기 부여와 가능성을 제시한다.
향후 연구를 벤치마크하기 위한 appearance-based와 pose-based 인식의 공개 가능한 baselines를 제공한다.
대규모 어휘에서 pose-based Temporal Graph Network(Pose-TGCN)가 appearance-based 방법 대비 효과적임을 조사한다.

제안 방법

WLASL: 21,083개의 비디오, 119명의 화자, 3,126개의 gloss를 가진 대규모 단일 카메라 RGB 단어 수준 ASL 데이터셋 구축; 화자 다양성과 방언 주석 보장.
appearance-based baselines 개발: 2D CNN (VGG16) + GRU, 그리고 3D CNN (I3D)를 Kinetics 유래 특징으로 미세조정.
pose-based baselines 개발: 55개의 2D 키포인트를 가진 GRU를 사용하는 Pose-GRU; 전체 신체 키포인트 궤적에 대해 시간 그래프 합성 층 TGCN.
인간 신체를 학습 가능한 인접 행렬이 있는 완전 연결 그래프로 모델링하고 잔차 블록으로 쌓아 시간에 따른 평균 풀링으로 분류하는 Temporal Graph Convolution Network(TGCN) 제안.
표준 학습 프로토콜: 경계사이즈 대각선을 256로 리사이즈; 훈련 시 랜덤 50 프레임 클립; Adam 옵티마이저; 200 에폭; 글로스별로 4:1:1 비율의 훈련/검증/테스트 분할.

실험 결과

연구 질문

RQ1대규모의 화자 다양성 단어 수준 ASL 데이터셋이 수천 개의 gloss에서 강건한 학습을 가능하게 할까?
RQ2appearance-based와 pose-based 접근 방식이 대규모 어휘 단어 수준의 수화 인식에서 어떻게 비교되는가?
RQ3포즈 기반 시간 그래프 접근법(Pose-TGCN)이 표준 포즈 및 appearance baselines를 능가하는가?
RQ4단어 수준 SLR에서 어휘 크기와 샘플 수가 모델 성능에 미치는 영향은 무엇인가?

주요 결과

모델	WLASL100_top1	WLASL100_top5	WLASL100_top10	WLASL300_top1	WLASL300_top5	WLASL300_top10	WLASL1000_top1	WLASL1000_top5	WLASL1000_top10	WLASL2000_top1	WLASL2000_top5	WLASL2000_top10
Pose-GRU	46.51	76.74	85.66	33.68	64.37	76.05	30.01	58.42	70.15	22.54	49.81	61.38
Pose-TGCN	55.43	78.68	87.60	38.32	67.51	79.64	34.86	61.73	71.91	23.65	51.75	62.24
VGG-GRU	25.97	55.04	63.95	19.31	46.56	61.08	14.66	37.31	49.36	8.44	23.58	32.58
I3D	65.89	84.11	89.92	56.14	79.94	86.98	47.33	76.44	84.33	32.48	57.31	66.31

WLASL은 2,000 gloss에 대해 21,083개의 비디오를 119명의 화자로부터 수집; 데이터셋은 공개되어 있다.
Pose-TGCN은 대규모 어휘에서 appearance-based 모델과 경쟁력 있는 top-10 성능을 달성한다(최대 62.24% top-10 on WLASL2000, 특정 설정에서 I3D와 비슷한 수준).
I3D는 일반적으로 VGG-GRU보다 우수하게 작동하며, Pose-TGCN은 Pose-GRU보다 향상되며 공간 및 시간 포즈 정보를 공동으로 모델링하는 이점을 보여준다.
작은 어휘 subsets에서 포즈 기반과 appearance 기반 방법 모두 성능이 더 좋지만 어휘 크기가 커질수록 성능이 포화되고 더 많은 데이터나 고급 학습 전략이 필요하다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.