QUICK REVIEW

[论文解读] Identifying Untrustworthy Samples: Data Filtering for Open-domain Dialogues with Bayesian Optimization

Lei Shen, Haolan Zhan|arXiv (Cornell University)|Sep 14, 2021

Topic Modeling参考文献 36被引用 4

一句话总结

该论文提出了一种基于贝叶斯优化的对话数据过滤方法，通过将七个对话属性整合为加权质量度量，识别开放域对话系统中的不可信训练样本。该方法在验证集上通过贝叶斯优化对属性权重进行优化，过滤低分样本，并采用混合MLE-NEG训练框架加速重训练，在两个基准数据集上实现了响应质量的提升。

ABSTRACT

Being able to reply with a related, fluent, and informative response is an indispensable requirement for building high-quality conversational agents. In order to generate better responses, some approaches have been proposed, such as feeding extra information by collecting large-scale datasets with human annotations, designing neural conversational models (NCMs) with complex architecture and loss functions, or filtering out untrustworthy samples based on a dialogue attribute, e.g., Relatedness or Genericness. In this paper, we follow the third research branch and present a data filtering method for open-domain dialogues, which identifies untrustworthy samples from training data with a quality measure that linearly combines seven dialogue attributes. The attribute weights are obtained via Bayesian Optimization (BayesOpt) that aims to optimize an objective function for dialogue generation iteratively on the validation set. Then we score training samples with the quality measure, sort them in descending order, and filter out those at the bottom. Furthermore, to accelerate the "filter-train-evaluate" iterations involved in BayesOpt on large-scale datasets, we propose a training framework that integrates maximum likelihood estimation (MLE) and negative training method (NEG). The training method updates parameters of a trained NCMs on two small sets with newly maintained and removed samples, respectively. Specifically, MLE is applied to maximize the log-likelihood of newly maintained samples, while NEG is used to minimize the log-likelihood of newly removed ones. Experimental results on two datasets show that our method can effectively identify untrustworthy samples, and NCMs trained on the filtered datasets achieve better performance.

研究动机与目标

解决开放域对话系统中不可信训练样本导致的通用性、不一致或无关响应问题。
通过过滤低质量训练数据而非仅依赖模型架构或损失函数的修改来提升对话生成质量。
开发一种将多种对话属性整合为统一质量度量的数据过滤方法，而非依赖单一指标。
利用贝叶斯优化优化这些属性的权重，以在验证集上最大化性能。
通过一种新颖的MLE-NEG微调策略，显著加速大规模数据集上迭代式过滤-训练-评估流程。

提出的方法

定义质量度量 𝑆 为七个对话属性（具体性、重复性、相关性、连贯性、一致性、流畅性和连贯性）的线性组合。
通过贝叶斯优化（BayesOpt）在验证集上对 𝑆 中的属性权重进行优化，以最大化对话生成的目标函数。
目标函数基于自动评估指标，如 BLEU、困惑度、Distinct-n 和响应内多样性。
使用优化后的 𝑆 对样本进行打分，按得分降序排列，并过滤掉得分最低的样本。
提出一种新颖的训练框架，对新保留的样本使用最大似然估计（MLE），对新移除的样本使用负样本训练（NEG），以加速重训练。
通过在小规模、动态变化的样本集合上更新模型参数，使大规模数据集上的高效‘过滤-训练-评估’迭代成为可能。

实验结果

研究问题

RQ1结合七个对话属性的多属性质量度量是否能优于基于单一属性的过滤方法，在识别不可信对话样本方面表现更优？
RQ2贝叶斯优化在学习对话数据过滤的最优属性权重方面是否有效？
RQ3所提出的 MLE-NEG 微调策略是否能显著减少大规模数据集上迭代式数据过滤过程中的训练时间？
RQ4基于优化后质量度量过滤数据是否能在自动评估和人工评估指标上带来性能提升？
RQ5在过滤数据上训练的对话模型性能是否优于在原始数据或单属性过滤数据上训练的模型？

主要发现

所提出的方法通过贝叶斯优化的权重整合七个对话属性，在两个基准数据集的所有自动评估指标上均达到最佳性能。
在 DailyDialog 数据集上，过滤后的模型 BLEU 得分为 0.80，相比基线（0.46）相对提升 17%，困惑度为 46.06，显著低于基线的 48.98。
在过滤数据上训练的模型 Distinct-3 得分为 1.70，表明响应多样性更高，而基线为 0.27。
MLE-NEG 训练框架实现了高效的重训练，显著降低了大规模数据集上迭代过滤的耗时。
贝叶斯优化成功探索了广泛的假设空间，并在迭代过程中持续提升验证指标，如 J-value 曲线所示。
该方法优于基于单一属性（如连贯性、相关性或流畅性）的过滤方法，证明了多维度质量评估的优势。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。