QUICK REVIEW

[论文解读] The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain

Arseny Moskvichev, Victor Vikram Odouard|arXiv (Cornell University)|May 11, 2023

Topic Modeling被引用 22

一句话总结

ConceptARC 系统地通过将任务聚类到概念组并将人类表现与 ARC-Kaggle 冠军和 GPT-4 进行比较来测试 ARC 的抽象与泛化；在人类在各概念上都超过机器。

ABSTRACT

The abilities to form and abstract concepts is key to human intelligence, but such abilities remain lacking in state-of-the-art AI systems. There has been substantial research on conceptual abstraction in AI, particularly using idealized domains such as Raven's Progressive Matrices and Bongard problems, but even when AI systems succeed on such problems, the systems are rarely evaluated in depth to see if they have actually grasped the concepts they are meant to capture. In this paper we describe an in-depth evaluation benchmark for the Abstraction and Reasoning Corpus (ARC), a collection of few-shot abstraction and analogy problems developed by Chollet [2019]. In particular, we describe ConceptARC, a new, publicly available benchmark in the ARC domain that systematically assesses abstraction and generalization abilities on a number of basic spatial and semantic concepts. ConceptARC differs from the original ARC dataset in that it is specifically organized around "concept groups" -- sets of problems that focus on specific concepts and that are vary in complexity and level of abstraction. We report results on testing humans on this benchmark as well as three machine solvers: the top two programs from a 2021 ARC competition and OpenAI's GPT-4. Our results show that humans substantially outperform the machine solvers on this benchmark, showing abilities to abstract and generalize concepts that are not yet captured by AI systems. We believe that this benchmark will spur improvements in the development of AI systems for conceptual abstraction and in the effective evaluation of such systems.

研究动机与目标

评估 AI 系统是否真正掌握 ARC 中的抽象概念，而非利用捷径。
创建一个以概念为中心的基准测试（ConceptARC），包含核心概念的多样化实例化。
将人类表现与最先进的 ARC 求解器和 GPT-4 在概念组上的表现进行比较。
分析每个概念组内变体的泛化能力。

提出的方法

定义 16 个核心概念，并为每个概念创建 10 个 ARC 任务，每个任务有三个测试输入。
手动设计任务，强调对概念的理解和泛化，而非捷径。
通过在线研究评估人类，并在同样的任务上测试 ARC-Kaggle 顶尖程序和 GPT-4。
每个测试输入使用三次猜测；若任意猜测与正确输出匹配即计分。
将结果以按概念的准确度给出，以衡量在变体上的泛化。

实验结果

研究问题

RQ1人类是否能够在多样化任务实例中对 ARC 的抽象概念进行泛化？
RQ2最先进的 ARC 求解器是否像人类一样对概念进行泛化？
RQ3相对人类和专业程序，GPT-4 在基于概念的 ARC 任务上的表现如何？
RQ4在基于概念的 ARC 任务中，人类与机器的错误类型呈现出哪些模式？

主要发现

人类在所有概念组上明显优于机器求解器。
平均人类准确率约比 ARC-Kaggle 最佳程序高出每个概念约 40 个百分点。
GPT-4 在 ConceptARC 上普遍表现不佳，16 个概念中的 15 个准确率低于 30%。
ARC-Kaggle 顶尖程序相对于其原始 ARC 表现有所提升，但仍远低于人类水平。
某些任务显示人类的“近失误”错误，而机器错误往往更难以解释。
ConceptARC 能更清晰地区分超出原始 ARC 数据集的泛化能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。