QUICK REVIEW

[論文レビュー] TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for Sign Language Translation

Dongxu Li, Chenchen Xu|arXiv (Cornell University)|Oct 12, 2020

Hand Gesture Recognition Systems参考文献 28被引用数 73

ひとこと要約

TSPNetは、時系列セマンティックピラミッドとスケール間・内スケールの注意機構を用いて、マルチスケール区間から手話ビデオ表現を学習し、グロス注釈なしで手話翻訳を改善します。

ABSTRACT

Sign language translation (SLT) aims to interpret sign video sequences into text-based natural language sentences. Sign videos consist of continuous sequences of sign gestures with no clear boundaries in between. Existing SLT models usually represent sign visual features in a frame-wise manner so as to avoid needing to explicitly segmenting the videos into isolated signs. However, these methods neglect the temporal information of signs and lead to substantial ambiguity in translation. In this paper, we explore the temporal semantic structures of signvideos to learn more discriminative features. To this end, we first present a novel sign video segment representation which takes into account multiple temporal granularities, thus alleviating the need for accurate video segmentation. Taking advantage of the proposed segment representation, we develop a novel hierarchical sign video feature learning method via a temporal semantic pyramid network, called TSPNet. Specifically, TSPNet introduces an inter-scale attention to evaluate and enhance local semantic consistency of sign segments and an intra-scale attention to resolve semantic ambiguity by using non-local video context. Experiments show that our TSPNet outperforms the state-of-the-art with significant improvements on the BLEU score (from 9.58 to 13.41) and ROUGE score (from 31.80 to 34.96)on the largest commonly-used SLT dataset. Our implementation is available at https://github.com/verashira/TSPNet.

研究の動機と目的

SLTにおける高価なグロス注釈への依存を減らす動機付け：手話動画の時間的構造を活用する。
短期的および長期的な時間意味論を捉えるためのマルチスケール区間表現を開発する。
局所的意味的一貫性のためのスケール間注意機構と非局所的文脈のためのスケール内注意機構を用いた階層的特徴学習を提案する。
セグメンテーションのノイズや曖昧さを緩和するため、局所的および非局所的な映像意味の共同学習を可能にする。

提案手法

ウィンドウ幅（例：8、12、16フレーム）とスライディングストライドを用いてマルチスケールの手話ビデオ区間を作成する。
WSLRデータセット上でファインチューニングしたI3Dバックボーンを用いて区間特徴を抽出する。
スケールを跨いだ区間位置をエンコードするために共有位置埋め込みを導入する。
中心セグメントとそのより大きなスケールの近傍との間のスケール間注意を介して局所的な意味的一貫性を強制する。
拡張された中心特徴量に対するスケール内自己注意で局所的な曖昧さを解決する。
任意で、すべてのピボットを含む拡張周辺領域に拡張して、局所的および非局所的意味を共同学習する。
エンコーダの出力から翻訳を生成するためにTransformerデコーダを用いる。

実験結果

リサーチクエスチョン

RQ1マルチスケールの手話ビデオ区間はフレーム単位の特徴よりSLTを改善できるか。
RQ2スケール間の注意は、スケール間で局所的な意味一貫性を改善するか、またスケール内の自己注意は非局所的文脈を活用してセグメンテーションの曖昧さを減らすか。
RQ3局所的および非局所的意味の共同学習は、逐次的注意機構と比較して翻訳品質をさらに高めるか。
RQ4RPWTデータセットにおけるTSPNetの性能は、グロス注釈なしの従来のブートストラップモデルと比較してどうか。

主な発見

方法	幅（s）	ROUGE-L	BLEU-1	BLEU-2	BLEU-3	BLEU-4
Conv2d-RNN [2]	{1}	29.70	27.10	15.61	10.82	8.35
+ Luong Attn. [2] + [18]	{1}	30.70	29.86	17.52	11.96	9.00
+ Bahdanau Attn. [2] + [17]	{1}	31.80	32.24	19.03	12.83	9.58
TSPNet-Single (Transformer)	{8}	28.93	30.29	17.75	12.35	9.41
TSPNet-Single	{12}	28.10	29.02	17.03	12.08	9.39
TSPNet-Single	{16}	32.36	32.52	20.33	14.75	11.61
TSPNet-Sequential	{8,12,16}	34.77	35.65	22.80	16.60	12.97
TSPNet-Joint	{8,12,16}	34.96	36.10	23.12	16.88	13.41

TSPNet-JointはRPWTで最良の翻訳スコアを達成し、ROUGE-L 34.96およびBLEU-4 13.41。
マルチスケール（8,12,16）区間は単一スケール手法を上回り、BLEU-4とROUGE-Lの値が高い。
スケール間の注意は、マルチスケール区間を統合することで局所的な意味的一貫性を改善する。
スケール内自己注意は非局所的な文脈を強化して、局所的なジェスチャーの曖昧さを解消する。
局所および非局所学習の結合（TSPNet-Joint）は逐次的集約（TSPNet-Sequential）を上回る。
Conv2d-RNNと比較して、TSPNet系はBLEU-4で顕著な伸び（13.41対9.58）、ROUGE-Lで（34.96対31.80）の利得を示す。
TSPNet-Jointのトレーニングには、特徴抽出を除き、およそ1台のNVIDIA V100 GPUで約2時間を要する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。