QUICK REVIEW

[论文解读] LaMDA: Language Models for Dialog Applications

Romal Thoppilan, Daniel De Freitas|arXiv (Cornell University)|Jan 20, 2022

Topic Modeling被引用 705

一句话总结

LaMDA 是一族基于 Transformer 的大规模对话模型（参数高达137B），在公开对话和网络数据上进行预训练，并通过微调和外部工具进行 refinement 提升对话质量、安全性和 grounding（事实基础性）。

ABSTRACT

We present LaMDA: Language Models for Dialog Applications. LaMDA is a family of Transformer-based neural language models specialized for dialog, which have up to 137B parameters and are pre-trained on 1.56T words of public dialog data and web text. While model scaling alone can improve quality, it shows less improvements on safety and factual grounding. We demonstrate that fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements towards the two key challenges of safety and factual grounding. The first challenge, safety, involves ensuring that the model's responses are consistent with a set of human values, such as preventing harmful suggestions and unfair bias. We quantify safety using a metric based on an illustrative set of human values, and we find that filtering candidate responses using a LaMDA classifier fine-tuned with a small amount of crowdworker-annotated data offers a promising approach to improving model safety. The second challenge, factual grounding, involves enabling the model to consult external knowledge sources, such as an information retrieval system, a language translator, and a calculator. We quantify factuality using a groundedness metric, and we find that our approach enables the model to generate responses grounded in known sources, rather than responses that merely sound plausible. Finally, we explore the use of LaMDA in the domains of education and content recommendations, and analyze their helpfulness and role consistency.

研究动机与目标

研究模型规模如何影响对话质量、安全性和 grounding（事实基础性）。
使用带注释的对话数据开发微调策略，以提升安全性和回答质量。
通过工具实现外部知识访问，以增强 grounding 和事实准确性。
在教育和内容推荐场景中评估 LaMDA，以评估有用性和角色一致性。

提出的方法

训练仅解码器的 Transformer 模型，参数高达 137B，在 1.56T 字的公开对话和网络数据上进行训练。
使用 sample-and-rank 策略，根据对数似然和长度来生成并选择候选回复。
使用判别式和生成式目标进行微调，以优化质量（SSI）和安全性。
用外部工具集（信息检索、计算器、翻译器）增强输出，并训练模型发出工具使用查询并整合检索到的片段。
收集并使用大规模带注释的数据集（对话轮次、安全注释、 grounding 注释）用于评估和微调。
以应用场景特定对话对 LaMDA 进行预条件化评估，以评估针对角色的有用性和一致性。

实验结果

研究问题

RQ1如何将规模（模型大小）与人类表现相比，影响对话质量、安全性和 grounding？
RQ2微调结合扩展是否在质量、安全性和 grounding 方面超越仅扩大规模的效果？
RQ3通过使模型能够查阅外部知识源和工具来提升 grounding 吗？
RQ4在教育和内容推荐领域，LaMDA 在有用性和角色一致性方面的表现如何？

主要发现

模型规模提升改善对话质量（合理性、特异性、趣味性）。
仅靠扩展规模在安全性和 grounding 相对于人类表现的改进有限。
微调加缩放在质量、安全性和 grounding 方面带来显著提升。
用外部知识工具集增强输出提升 grounding 并减少无依据的断言。
判别式微调和单独的安全预测器有助于在排序候选前筛选出不安全的回复。
LaMDA 的应用场景特定微调变体在教育和内容推荐场景中更有帮助且保持角色一致性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。