QUICK REVIEW

[论文解读] A Test for Evaluating Performance in Human-Computer Systems

Andres Campero, Michelle Vaccaro|arXiv (Cornell University)|Jun 24, 2022

IoT and Edge/Fog Computing被引用 21

一句话总结

本文提出一种均值比测试（hat-rho），用于量化人机系统相较于人类或仅由人类/计算机完成时的提升程度，并在文献综述和三项实验（包括GPT-3辅助的软件任务）中对其进行了验证。

ABSTRACT

The Turing test for comparing computer performance to that of humans is well known, but, surprisingly, there is no widely used test for comparing how much better human-computer systems perform relative to humans alone, computers alone, or other baselines. Here, we show how to perform such a test using the ratio of means as a measure of effect size. Then we demonstrate the use of this test in three ways. First, in an analysis of 79 recently published experimental results, we find that, surprisingly, over half of the studies find a decrease in performance, the mean and median ratios of performance improvement are both approximately 1 (corresponding to no improvement at all), and the maximum ratio is 1.36 (a 36% improvement). Second, we experimentally investigate whether a higher performance improvement ratio is obtained when 100 human programmers generate software using GPT-3, a massive, state-of-the-art AI system. In this case, we find a speed improvement ratio of 1.27 (a 27% improvement). Finally, we find that 50 human non-programmers using GPT-3 can perform the task about as well as--and less expensively than--the human programmers. In this case, neither the non-programmers nor the computer would have been able to perform the task alone, so this is an example of a very strong form of human-computer synergy.

研究动机与目标

提出一种定量测试，用于评估人机协作相对于基线的性能提升。
定义均值比（rho）及其协同变体hat-rho，用于衡量联合性能。
在包含79个结果的文献综述以及三项涉及GPT-3的实验研究中演示该方法。
讨论潜在用途，包括竞赛、专业的集体智能，以及超越图灵式基准的应用。

提出的方法

将 X_i 定义为系统类型 i 的平均性能，rho = X_i / X_j，用于比较基线（H、C、HC 等）。
引入 hat_rho = X_HC / max(X_H, X_C) 作为人机协同的度量。
应用期望的变换（如 f(X)=1/X），以使“越低越好”的度量与“越高越好”的度量对齐。
使用均值比及其置信区间来评估显著性，并辅以回归方法来控制任务/顺序效应。
对2021年的25篇论文（共79个结果）进行文献综述，以跨越不同度量来计算 hat_rho。
进行两项原始研究：(a) 面向程序员的GPT-3软件生成（H、HC）以及 (b) 面向非程序员的研究（HC′），使用GPT-3，并包含成本分析。

实验结果

研究问题

RQ1在人机协同团队相对于相关基线时，是否实现正向协同（hat_rho > 1）？
RQ2在最近的人机实验中观察到的改进幅度（rho）是多少？
RQ3像GPT-3这样的强大AI是否能在软件生成任务中显著提高rho？
RQ4使用GPT-3的非程序员是否能够达到与程序员同样甚至更好的表现，并且成本可能更低？

主要发现

在研究1中，比例分布范围为0.44至1.36，平均值约为0.96，中位数约为0.99；38%的测量值显示正向协同（hat_rho > 1）。
文献中的最大观测比为1.36，意味着该样本中最多提升36%。
研究2发现 hat_rho = 1.27（CI [1.10, 1.48]），即在质量约束下，人类+GPT-3 相比单独人类提高了27%的效率。
在研究3中，使用GPT-3的非程序员显示出“infinity” hat_rho（强协同），因为单独任一方都无法完成任务；而使用GPT-3的程序员在简单的比值中未显示出明确的成本优势，但在回归分析中确实显示出成本优势。
成本分析表明，在一个比较中，使用GPT-3的非程序员在回归控制下可能比程序员成本更低（p = .010），而简单比值常常未显示出显著的成本节省。
总体而言，结果显示了从温和到强烈的人机协同谱，并强调GPT-3在取决于设置的情况下可以实现显著的性能和成本动态。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。