QUICK REVIEW

[论文解读] The AI Teacher Test: Measuring the Pedagogical Ability of Blender and GPT-3 in Educational Dialogues

Anaïs Tack, Chris Piech|arXiv (Cornell University)|May 16, 2022

Topic Modeling被引用 37

一句话总结

该论文提出使用人类参与的成对比较的AI教师测试，以评估 Blender 和 GPT-3 在三种教学能力上与人类教师的比较，结果发现AI教师落后于人类，特别是在有帮助性方面。

ABSTRACT

How can we test whether state-of-the-art generative models, such as Blender and GPT-3, are good AI teachers, capable of replying to a student in an educational dialogue? Designing an AI teacher test is challenging: although evaluation methods are much-needed, there is no off-the-shelf solution to measuring pedagogical ability. This paper reports on a first attempt at an AI teacher test. We built a solution around the insight that you can run conversational agents in parallel to human teachers in real-world dialogues, simulate how different agents would respond to a student, and compare these counterpart responses in terms of three abilities: speak like a teacher, understand a student, help a student. Our method builds on the reliability of comparative judgments in education and uses a probabilistic model and Bayesian sampling to infer estimates of pedagogical ability. We find that, even though conversational agents (Blender in particular) perform well on conversational uptake, they are quantifiably worse than real teachers on several pedagogical dimensions, especially with regard to helpfulness (Blender: Δ ability = -0.75; GPT-3: Δ ability = -0.93).

研究动机与目标

动机：在教育对话中超越对话 uptake 的层面来评估AI教师的必要性。
提出一个人类参与、成对比较的方法来衡量教学能力。
量化Blender和GPT-3在三种教学维度上与人类教师的比较。
开源数据、代码和方法学，以促进AI教学代理的自主改进。

提出的方法

在真实教育对话中运行Blender和GPT-3并为学生话语生成平行的AI教师回答。
通过在线比较判断收集人类评估，覆盖三种教学能力，使用随机项选择。
使用贝叶斯Bradley-Terry模型推断潜在能力参数并按能力对回答进行排序。
包含截距参数以捕捉主场效应并在成对比较中处理平手。
在Stan中应用4000个HMC样本以获得能力估计的后验均值和95% HDI可信区间。
在 uptake（接受度）和三种教学维度上将AI回应与人类教师回应进行比较。

实验结果

研究问题

RQ1最先进的对话代理在教育对话中是否能像教师一样讲话、理解学生并帮助学生，与人类教师一样？
RQ2Blender和GPT-3在这三种教学能力方面与人类教师相比如何？
RQ3对AI教师而言，对话 uptake 与测量到的教学能力之间有哪些关系？
RQ4贝叶斯成对比较在多大程度上能够为AI教师回应提供可靠的能力分数与排名？

主要发现

Blender（9B）在语言和数学对话的对话 uptake方面优于其他模型，并超越一些AI回应。
GPT-3在所有三个维度上的教学能力量化上都低于Blender和人类教师。
与人类教师相比，Blender和GPT-3在像教师一样说话、理解学生和帮助学生三个方面均显著落后。
教学能力估计与对话 uptake 相关联，其中理解学生的关联最强。
大量人类教师回应获得正向评价，但在许多情境下AI回应也获得正向评价，表明有潜力从AI输出中采样更佳的回复。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。