QUICK REVIEW

[论文解读] Modelling Hierarchical Structure between Dialogue Policy and Natural Language Generator with Option Framework for Task-oriented Dialogue System

Jianhong Wang, Yuan Zhang|arXiv (Cornell University)|May 3, 2021

Topic Modeling参考文献 54被引用 2

一句话总结

该论文提出HDNO，一种基于选项的分层强化学习框架，通过建模对话策略与自然语言生成（NLG）之间的层次结构，提升任务导向对话中的任务成功率与话语可理解性。通过异步训练解耦策略与NLG，并引入基于语言模型的判别器进行奖励塑造，HDNO在MultiWoz 2.0和2.1上取得了当前最优性能，自动评估与人工评估指标均有显著提升。

ABSTRACT

Designing task-oriented dialogue systems is a challenging research topic, since it needs not only to generate utterances fulfilling user requests but also to guarantee the comprehensibility. Many previous works trained end-to-end (E2E) models with supervised learning (SL), however, the bias in annotated system utterances remains as a bottleneck. Reinforcement learning (RL) deals with the problem through using non-differentiable evaluation metrics (e.g., the success rate) as rewards. Nonetheless, existing works with RL showed that the comprehensibility of generated system utterances could be corrupted when improving the performance on fulfilling user requests. In our work, we (1) propose modelling the hierarchical structure between dialogue policy and natural language generator (NLG) with the option framework, called HDNO, where the latent dialogue act is applied to avoid designing specific dialogue act representations; (2) train HDNO via hierarchical reinforcement learning (HRL), as well as suggest the asynchronous updates between dialogue policy and NLG during training to theoretically guarantee their convergence to a local maximizer; and (3) propose using a discriminator modelled with language models as an additional reward to further improve the comprehensibility. We test HDNO on MultiWoz 2.0 and MultiWoz 2.1, the datasets on multi-domain dialogues, in comparison with word-level E2E model trained with RL, LaRL and HDSA, showing improvements on the performance evaluated by automatic evaluation metrics and human evaluation. Finally, we demonstrate the semantic meanings of latent dialogue acts to show the explanability for HDNO.

研究动机与目标

解决任务导向对话系统中任务成功率与话语可理解性之间的权衡问题。
利用潜在对话行为表示，在无需显式对话行为标注的情况下，建模对话策略与NLG之间的层次关系。
通过分层强化学习框架中策略与NLG的异步更新，实现对话策略与NLG的稳定、收敛训练。
利用基于预训练语言模型的判别器作为额外奖励信号，提升生成系统回复的自然性与连贯性。

提出的方法

提出HDNO，一种基于选项的分层强化学习框架，利用潜在对话行为作为选项空间，建模对话策略与NLG之间的层次结构。
采用分层强化学习，为策略与NLG分别设置独立的探索与更新策略，确保理论收敛至局部最优解。
引入策略与NLG之间的异步训练更新机制，解耦其学习动态，提升训练稳定性。
引入基于预训练语言模型的判别器，提供自然语言层面的奖励信号，增强生成话语的可理解性。
利用潜在对话行为作为策略与NLG之间的共享表示，避免对人工设计的对话行为模板的依赖。
采用端到端的强化学习训练方式，奖励函数由任务成功率与基于判别器的自然度得分共同构成。

实验结果

研究问题

RQ1基于选项的分层强化学习框架是否能同时提升任务导向对话系统中的任务成功率与话语可理解性？
RQ2策略与NLG之间的异步训练是否能确保在联合优化过程中的收敛性与稳定性？
RQ3基于语言模型的判别器是否能有效提升生成系统回复的自然性与流畅性？
RQ4HDNO在MultiWoz 2.0与2.1等多领域基准上，相较于现有端到端与分层对话模型，性能提升程度如何？
RQ5所学习的潜在对话行为是否在语义上具有意义且可被人类理解？

主要发现

与在MultiWoz 2.0和2.1上使用RL、LaRL和HDSA训练的词级别端到端模型相比，HDNO在自动评估与人工评估指标上均表现更优。
引入基于语言模型的判别器显著提升了生成话语的可理解性，且未降低任务成功率。
策略与NLG之间的异步训练实现了稳定的训练过程，并收敛至局部最优解，理论分析支持该结果。
HDNO所学习的潜在对话行为在语义上具有意义，可被解释为连贯的对话状态，体现了模型的可解释性。
HDNO在成功率与流畅性方面均优于强基线模型，在BLEU、BLEU-4及人工评分的自然度指标上均有可观测提升。
该框架成功解耦了策略与NLG的学习过程，同时保持高性能，验证了基于选项的分层结构的有效性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。