[Paper Review] Joint Online Spoken Language Understanding and Language Modeling with Recurrent Neural Networks
This paper proposes a conditional RNN model that jointly performs online spoken language understanding (SLU) and language modeling by updating intent predictions and slot filling in real time as words arrive. The model leverages recurrent intent and slot label contexts to improve language modeling perplexity by 11.8% and reduce intent detection error by 22.3% relative to independent training, with strong robustness in noisy ASR settings.
Speaker intent detection and semantic slot filling are two critical tasks in spoken language understanding (SLU) for dialogue systems. In this paper, we describe a recurrent neural network (RNN) model that jointly performs intent detection, slot filling, and language modeling. The neural network model keeps updating the intent estimation as word in the transcribed utterance arrives and uses it as contextual features in the joint model. Evaluation of the language model and online SLU model is made on the ATIS benchmarking data set. On language modeling task, our joint model achieves 11.8% relative reduction on perplexity comparing to the independent training language model. On SLU tasks, our joint model outperforms the independent task training model by 22.3% on intent detection error rate, with slight degradation on slot filling F1 score. The joint model also shows advantageous performance in the realistic ASR settings with noisy speech input.
Motivation & Objective
- To address the limitation of existing joint SLU models that require full utterance input, making them unsuitable for real-time, online applications.
- To improve language modeling and intent detection performance by jointly training SLU and language modeling components within a single RNN framework.
- To explore the use of recurrent intent and slot label states as contextual features for next-word prediction in online ASR systems.
- To evaluate the model’s robustness under realistic noisy speech input conditions, simulating practical deployment scenarios.
Proposed method
- A conditional RNN architecture is designed to process input word sequences incrementally, updating intent and slot predictions in real time as each word arrives.
- The model incorporates recurrent hidden states that encode both intent and slot label information, which are used as context vectors for next-word prediction.
- A scheduled scaling mechanism is applied to the intent vector's contribution to the context vector, increasing its influence over time to improve language modeling performance.
- The model integrates local and recurrent context features: local intent and slot labels are concatenated with the RNN hidden state, while recurrent states capture long-term dependencies.
- The joint model is trained end-to-end to optimize both language modeling and SLU objectives simultaneously, with shared parameters across tasks.
- The model is evaluated using the ATIS benchmark, with ablation studies on context types and training schedules to isolate contributions.
Experimental results
Research questions
- RQ1Can a joint RNN model achieve better language modeling and intent detection performance compared to independent training of separate models?
- RQ2How does incorporating recurrent intent and slot label states improve online SLU and language modeling in real time?
- RQ3What is the impact of scheduling the intent vector's contribution to the context vector on language modeling perplexity?
- RQ4How does the joint model perform under realistic noisy speech input conditions, particularly in ASR rescoring pipelines?
Key findings
- The joint model achieves a 11.8% relative reduction in language modeling perplexity compared to an independent training language model on the ATIS test set.
- The joint model reduces intent detection error by 22.3% relative to the independent training model, demonstrating significant improvement in online intent classification.
- Incorporating recurrent slot label context improves slot filling F1 scores and reduces intent classification error by 16.8% relative, showing the benefit of modeling label dependencies.
- The model with both recurrent intent and slot label context achieves the best overall performance, maintaining gains in language modeling and intent detection while slightly degrading slot F1.
- In noisy ASR settings, the joint training RNN LM rescoring outperforms 5-gram LM and independently trained RNN LM rescoring, reducing WER to 12.59% and intent error to 4.44%.
- The model shows consistent performance gains in realistic ASR conditions, with only a 2.87% increase in intent error and 7.77% drop in F1 when using ASR outputs instead of ground-truth text.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.