QUICK REVIEW

[论文解读] People cannot distinguish GPT-4 from a human in a Turing test

Cameron R. Jones, Benjamin K. Bergen|arXiv (Cornell University)|May 9, 2024

Computability, Logic, AI Algorithms被引用 19

一句话总结

在一项事前注册的双人图灵测试中，GPT-4被判定为人类的比例为54%，超过ELIZA但低于真正的人类，表明在某些条件下，GPT-4可以通过现场互动测试。

ABSTRACT

We evaluated 3 systems (ELIZA, GPT-3.5 and GPT-4) in a randomized, controlled, and preregistered Turing test. Human participants had a 5 minute conversation with either a human or an AI, and judged whether or not they thought their interlocutor was human. GPT-4 was judged to be a human 54% of the time, outperforming ELIZA (22%) but lagging behind actual humans (67%). The results provide the first robust empirical demonstration that any artificial system passes an interactive 2-player Turing test. The results have implications for debates around machine intelligence and, more urgently, suggest that deception by current AI systems may go undetected. Analysis of participants' strategies and reasoning suggests that stylistic and socio-emotional factors play a larger role in passing the Turing test than traditional notions of intelligence.

研究动机与目标

评估现代AI是否能在对人类审问者进行的互动式图灵测试中通过。
在受控环境下量化GPT-4、GPT-3.5、ELIZA与人类的通过率。
识别审问者在简短对话中判断人性所使用的策略和线索。

提出的方法

随机、受控、事前注册的五分钟对话的双人图灵测试。
三位证人：GPT-4、GPT-3.5、ELIZA；一位人类证人；审问者判断证人是否为人。
提示指示GPT-4和GPT-3.5采用简洁、带俚语的个性并包含故意的不完美之处。
使用延迟来模拟接近人类的回答时间。
主要分析使用逻辑回归来将通过率与机会（50%）进行比较。
探索性分析检视策略、人口统计效应和信心水平。

Figure 1: A selection of conversations between human interrogators (green) and witnesses (grey). One of these four conversations is with a human witness, the rest are with AI. Interrogator verdicts and ground truth identities are below (to allow readers to indirectly participate). 1 1 1 A) Verdict:

实验结果

研究问题

RQ1GPT-4是否能够在五分钟的互动式图灵测试中将自己与人类区分开来？
RQ2与审问者相比，GPT-4、GPT-3.5和ELIZA的通过率表现如何？
RQ3审问者使用的策略和判断人性背后的原因有哪些，哪些最能预测正确判断？

主要发现

GPT-4的通过率为54%（被判定为人类），GPT-3.5为50%，均高于ELIZA的22%，但低于真实人类的67%。
ELIZA的通过率显著低于机会，验证了设计中的操控检查和敏感性。
审问者在使用聚焦于人类经验、逻辑/数学，或指控证人是AI的策略时更准确；仅依赖语言风格时准确性较低。
对GPT-4被判为人类的信心平均为73%，表明判断并非随机猜测。
在通过率上GPT-4与GPT-3.5之间没有显著差异，表明在此设置下GPT-4并未超越GPT-3.5。
年龄与审问者的准确性呈负相关，而对LLMs的专业知识并未预测更高的表现。

Figure 2: Pass rates (left) and interrogator confidence (right) for each witness type. Pass rates are the proportion of the time a witness type was judged to be human. Error bars represent 95% bootstrap confidence intervals. Significance stars above each bar indicate whether the pass rate was signif

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。