[Paper Review] SF-Net: Structured Feature Network for Continuous Sign Language Recognition
SF-Net learns frame-, gloss-, and sentence-level features in a structured, end-to-end framework to improve continuous sign language recognition without frame-level supervision. It achieves state-of-the-art results on CSL and RWTH-PHOENIX datasets.
Continuous sign language recognition (SLR) aims to translate a signing sequence into a sentence. It is very challenging as sign language is rich in vocabulary, while many among them contain similar gestures and motions. Moreover, it is weakly supervised as the alignment of signing glosses is not available. In this paper, we propose Structured Feature Network (SF-Net) to address these challenges by effectively learn multiple levels of semantic information in the data. The proposed SF-Net extracts features in a structured manner and gradually encodes information at the frame level, the gloss level and the sentence level into the feature representation. The proposed SF-Net can be trained end-to-end without the help of other models or pre-training. We tested the proposed SF-Net on two large scale public SLR datasets collected from different continuous SLR scenarios. Results show that the proposed SF-Net clearly outperforms previous sequence level supervision based methods in terms of both accuracy and adaptability.
Motivation & Objective
- Address weakly supervised continuous SLR where gloss alignment is unavailable.
- Capture multi-level semantic information by structuring feature learning at frame, gloss, and sentence levels.
- Enable end-to-end training without extra pre-training or auxiliary models.
- Improve recognition accuracy and adaptability across datasets with diverse signing scenarios.
Proposed method
- Extract frame-level features using a 2D/3D convolutional framework with residual temporal learning by summing 2D and 3D branches.
- Introduce a gloss-level framing operation to create meta-frames and use an LSTM to model temporal dependencies within meta-frames.
- Apply a gloss-level regularizer based on Kullback–Leibler divergence to align gloss and sentence level distributions.
- Model sentence-level context with a Bi-LSTM over gloss-level features and optimize with CTC loss.
- Use a greedy decoder at test time to obtain the final gloss sequence from sentence-level predictions.
Experimental results
Research questions
- RQ1Can a multi-level (frame, gloss, sentence) feature learning architecture improve continuous SLR without frame-level supervision?
- RQ2Does incorporating 3D convolutions and gloss-level framing improve alignment and recognition accuracy across datasets?
- RQ3What is the impact of a gloss-level regularizer and its introduction timing on training stability and final performance?
- RQ4How does SF-Net perform relative to prior sentence-level supervision methods on large-scale CSL and RWTH-PHOENIX-Weather-2014 datasets?
Key findings
- SF-Net outperforms previous sentence-level supervision based methods on CSL and RWTH-PHOENIX-Weather-2014 datasets.
- Incorporating 3D convolution branches yields notable gains in both word-level CSL accuracy and sentence-level RWTH-WER.
- Gloss-level framing with an LSTM significantly improves alignment and reduces decoding errors compared to frame-level-only approaches.
- A gloss-level regularizer improves performance on the more vocabulously rich RWTH dataset when introduced at appropriate training stages.
- SF-Net achieves state-of-the-art results on CSL (scratch: 4.8, with pretraining: 3.8 WER) and RWTH (scratch: 38.1–40.8 WER depending on setup; improved with pretraining).
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.