QUICK REVIEW

[論文レビュー] End-to-end LSTM-based dialog control optimized with supervised and reinforcement learning

J. D. Williams, Geoffrey Zweig|arXiv (Cornell University)|Jun 3, 2016

Speech and dialogue systems参考文献 25被引用数 122

ひとこと要約

LSTMを用いて生の対話履歴をアクション分布へマッピングするエンドツーエンドの対話コントローラを、監督付き学習とポリシー勾配で訓練し、ルールとAPIのドメイン特有のソフトウェアで補完します。

ABSTRACT

This paper presents a model for end-to-end learning of task-oriented dialog systems. The main component of the model is a recurrent neural network (an LSTM), which maps from raw dialog history directly to a distribution over system actions. The LSTM automatically infers a representation of dialog history, which relieves the system developer of much of the manual feature engineering of dialog state. In addition, the developer can provide software that expresses business rules and provides access to programmatic APIs, enabling the LSTM to take actions in the real world on behalf of the user. The LSTM can be optimized using supervised learning (SL), where a domain expert provides example dialogs which the LSTM should imitate; or using reinforcement learning (RL), where the system improves by interacting directly with end users. Experiments show that SL and RL are complementary: SL alone can derive a reasonable initial policy from a small number of training dialogs; and starting RL optimization with a policy trained with SL substantially accelerates the learning rate of RL.

研究の動機と目的

LSTMが履歴表現を推測できるようにすることで、手作業による対話状態設計を削減する。
ビジネスルールとAPIをエンコードするドメイン特有のソフトウェアと再帰型ニューラルネットワークを統合して、実世界のアクションを実現する。
監督学習と強化学習の両方を用いた対話制御のエンドツーエンド訓練を実証する。
SLが強力な初期ポリシーを提供し、後続のRL最適化を加速することを示す。
リアルタイムの対話中にポリシーを適応させるオンライン再訓練を可能にする。

提案手法

Three-component model: an LSTM, domain-specific software with action gating and API access, and a language understanding module.
The LSTM takes a feature vector from entity recognition and developer-provided features to output a distribution over action templates.
An action mask provided by developer code gates available actions, which the LSTM uses to renormalize probabilities.
Actions are selected by sampling during RL and by choosing the max-probability action otherwise, with history fed back to the LSTM.
RL uses a policy gradient update with a baseline to reduce variance, and a small constant is added to probabilities when masks clip actions.
Supervised learning trains the model to imitate provided example dialogs; RLFine-tunes the policy while ensuring it still reconstructs the training dialogs.]
research_questions:[

実験結果

リサーチクエスチョン

RQ1エンドツーエンドのLSTMは、手作りの状態表現を用いずに生の対話履歴をアクションへマッピングすることで、対話制御をどれだけ効果的に学習できるか。
RQ2監督学習と強化学習を組み合わせることで、単独のいずれかよりデータ効率とポリシー性能が向上するか。
RQ3アクションマスキングとドメイン特有のAPIが学習されたポリシーおよび実世界のアクション実行能力に与える影響は？
RQ4監督性への忠実性を損なうことなく、モデルをリアルタイムでオンラインに訓練・更新できるか。
RQ5対話履歴の維持において、再帰構造は非再帰的アーキテクチャとどのように比較されるか？
RQ6key_findingsทuced: null,
RQ7table_headers: []

主な発見

The LSTM can learn to map from dialog history to action templates with minimal hand-crafted state representation.
After one dialog, 70% of dialog turns are correctly predicted; after 20 dialogs, accuracy exceeds 90% per-turn, with nearly 50% of dialogs predicted completely correctly.
A non-recurrent DNN failed to reconstruct the training set when trained on 20 dialogs, while an RNN could, showing the importance of memory for history.
Adding a small amount of supervised learning before reinforcement learning substantially accelerates RL learning and reduces policy variance.
Policies trained with SL are further improved by RL, but RL alone can struggle to discover complete policies; prior SL pretraining improves reliability and performance.
Retraining the LSTM takes less than one second on a standard CPU, enabling online corrections and active learning; ROC analysis indicates low-scoring actions are more likely to be incorrect, guiding efficient labeling.
Active RL with pretraining reduces variability across runs and accelerates convergence when optimizing with policy gradients]
table_headers: []
table_rows: []

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。