QUICK REVIEW

[论文解读] Does GPT-4 pass the Turing test?

Cameron R. Jones, Benjamin K. Bergen|arXiv (Cornell University)|Oct 31, 2023

Misinformation and Its Impacts被引用 13

一句话总结

研究在公开在线图灵测试中评估GPT-4，发现GPT-4的提示在成功率上最高可达41%，相比人类63%，存在显著的提示驱动差异， interrogator 人口统计与准确度之间未发现明确联系。

ABSTRACT

We evaluated GPT-4 in a public online Turing test. The best-performing GPT-4 prompt passed in 49.7% of games, outperforming ELIZA (22%) and GPT-3.5 (20%), but falling short of the baseline set by human participants (66%). Participants' decisions were based mainly on linguistic style (35%) and socioemotional traits (27%), supporting the idea that intelligence, narrowly conceived, is not sufficient to pass the Turing test. Participant knowledge about LLMs and number of games played positively correlated with accuracy in detecting AI, suggesting learning and practice as possible strategies to mitigate deception. Despite known limitations as a test of intelligence, we argue that the Turing test continues to be relevant as an assessment of naturalistic communication and deception. AI models with the ability to masquerade as humans could have widespread societal consequences, and we analyse the effectiveness of different strategies and criteria for judging humanlikeness.

研究动机与目标

评估GPT-4在在线图灵测试中是否会被误认为人类。
在多种提示下将GPT-4与GPT-3.5和ELIZA基线进行比较。
分析提示设计、策略与提问者特征如何影响通过的可能性。
探究图灵测试在研究自然交际和欺骗方面为何仍具相关性。

提出的方法

实现了一个两人参与的在线图灵测试，包含提问者与证人。
使用GPT-3.5或GPT-4创建25个AI证人，采用不同提示和参数（temperature、persona、错误、AI-vs-human框架）。
实现带有时间/长度限制、打字延迟和 verdict 与置信度日志的聊天界面。
在turingtest.live上进行实验，随机分配为人类或AI证人，并提供AI探索/利用模型选择器以选择证人。
收集人口统计数据，分析考试者策略、裁决理由及各提示的表现。
将GPT-4提示与GPT-3.5提示及ELIZA基线进行比较。

实验结果

研究问题

RQ1GPT-4在在线图灵测试中的成功率相对于人类和基线模型（ELIZA、GPT-3.5）是多少？
RQ2提示设计如何影响GPT-4在互动情境中通过为人类的能力？
RQ3哪些提问者策略和参与者人口统计预测AI vs 人类的正确识别？
RQ4提问者给出AI vs 人类裁决的理由是什么？这揭示了关于人类拟态线索的哪些方面？
RQ5任务经验（游戏数量）是否提升提问者的准确性或可靠性？

主要发现

Witness	Type	SR (Overall) (n)	SR (First Games) (n)
ELIZA	ELIZA	0.27 (60)	0.16 (32)
GPT-3.5	Victor	0.05 (21)	0.00 (7)
GPT-3.5	Juliet	0.14 (21)	0.17 (12)
GPT-4	Foxtrot	0.06 (17)	0.11 (9)
GPT-4	Victor	0.19 (32)	0.27 (15)
GPT-4	AI21	0.19 (42)	0.29 (21)
GPT-4	Juliet	0.25 (63)	0.33 (30)
GPT-4	Quebec	0.31 (59)	0.45 (20)
GPT-4	Papa	0.36 (87)	0.33 (33)
GPT-4	November	0.40 (140)	0.41 (69)
GPT-4	Sierra	0.41 (79)	0.40 (35)
Human	Human	0.63 (260)	0.64 (108)

GPT-4的最佳提示达到41%的成功率，优于ELIZA（27%）和GPT-3.5基线（5–14%），但远低于人类表现（63%）。
GPT-4在不同提示之间存在显著的提示对提示变异，某些提示约为6–26%，另一些约为40–41%。
对AI证人的提问者准确性不受人口统计特征或事前与LLM的互动影响；实验内没有显著的学习效应影响准确性。
语言风格和社会情感线索是AI-人类判断中的主要因素，而非对知识或推理的真实证据。
ELIZA尽管简单，在27%的对局中欺骗了人类提问者，凸显ELIZA效应以及图灵测试作为衡量智能的局限性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。