QUICK REVIEW

[论文解读] Human or Not? A Gamified Approach to the Turing Test

Daniel Jannai, Amos Meron|arXiv (Cornell University)|May 31, 2023

AI in Service Interactions被引用 17

一句话总结

本文描述了一个在线的游戏化图灵测试风格实验，覆盖一个月的150万用户，最终在识别伙伴时总体准确率为68%，当伙伴是机器人时准确率为60%。

ABSTRACT

We present "Human or Not?", an online game inspired by the Turing test, that measures the capability of AI chatbots to mimic humans in dialog, and of humans to tell bots from other humans. Over the course of a month, the game was played by over 1.5 million users who engaged in anonymous two-minute chat sessions with either another human or an AI language model which was prompted to behave like humans. The task of the players was to correctly guess whether they spoke to a person or to an AI. This largest scale Turing-style test conducted to date revealed some interesting facts. For example, overall users guessed the identity of their partners correctly in only 68% of the games. In the subset of the games in which users faced an AI bot, users had even lower correct guess rates of 60% (that is, not much higher than chance). This white paper details the development, deployment, and results of this unique experiment. While this experiment calls for many extensions and refinements, these findings already begin to shed light on the inevitable near future which will commingle humans and AI.

研究动机与目标

通过在现代人工智能环境中探索人类对拟人化对话与机器化对话的感知来为研究提供动机。
开发一个可扩展、吸引用户的平台，以大规模开展类似图灵测试的实验。
设计具备多样化人设的AI机器人，以挑战检测并研究人类在识别AI时的策略。
捕捉并分析人类在简短互动对话中传达“人性”信号与识别AI的策略。

提出的方法

创建一个在线两分钟的聊天游戏，设定20秒回应窗口和2分钟对话上限。
为AI机器人设定不同人设、背景故事以及仅限英语的约束，并变换底层模型（如 Jurassic-2、GPT-4、Cohere）。
结合实时、具有上下文相关的信息（天气、新闻）来支撑机器人回答。
随机化会话开场并实施监管以确保安全、防止滥用。
从超过150万用户那里收集超过1000万次猜测，以推导统计上稳健的分数。

实验结果

研究问题

RQ1在简短、开放式对话中，人类区分人类与AI的基线能力是多少？
RQ2机器人设计选择（人设、语言风格、信息 grounding）如何影响可检测性？
RQ3在类图灵设定中，哪些人类策略最有效地识别AI或传达“人性”？
RQ4当用户试图模仿AI或测试AI极限时，会出现哪些行为模式？

主要发现

主要结果	数值
总体正确猜测概率	68%
当伙伴是机器人	60%
当伙伴是人类	73%

总体正确猜测率为68%。
当伙伴是机器人时，正确猜测率为60%。
当伙伴是人类时，正确猜测率为73%。
人类采用多种策略（语法线索、个人/主观性提问、礼貌、最新信息等）来区分AI与人类，成功度各有差异。
机器人设计者使用多样化人设和实时信息 grounding 来降低可检测性，而人类有时利用对游戏环境的元参照来传达“人性”。
该研究展示了AI在模仄人类方面的显著进步，并为未来的类图灵评估提供了可扩展的基准。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。