Skip to main content
QUICK REVIEW

[论文解读] Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models

Pan Lu, Baolin Peng|arXiv (Cornell University)|Apr 19, 2023
Topic Modeling被引用 91
一句话总结

Cha me leon 是一个即插即用的框架,让大型语言模型将多种工具(视觉模型、网络搜索、Python、启发式)组合成接近自然语言的程序,以解决多模态推理任务,在 GPT-4 下在 ScienceQA 和 TabMWP 取得新的最先进结果。

ABSTRACT

Large language models (LLMs) have achieved remarkable progress in solving various natural language processing tasks due to emergent reasoning abilities. However, LLMs have inherent limitations as they are incapable of accessing up-to-date information (stored on the Web or in task-specific knowledge bases), using external tools, and performing precise mathematical and logical reasoning. In this paper, we present Chameleon, an AI system that mitigates these limitations by augmenting LLMs with plug-and-play modules for compositional reasoning. Chameleon synthesizes programs by composing various tools (e.g., LLMs, off-the-shelf vision models, web search engines, Python functions, and heuristic-based modules) for accomplishing complex reasoning tasks. At the heart of Chameleon is an LLM-based planner that assembles a sequence of tools to execute to generate the final response. We showcase the effectiveness of Chameleon on two multi-modal knowledge-intensive reasoning tasks: ScienceQA and TabMWP. Chameleon, powered by GPT-4, achieves an 86.54% overall accuracy on ScienceQA, improving the best published few-shot result by 11.37%. On TabMWP, GPT-4-powered Chameleon improves the accuracy by 17.0%, lifting the state of the art to 98.78%. Our analysis also shows that the GPT-4-powered planner exhibits more consistent and rational tool selection via inferring potential constraints from instructions, compared to a ChatGPT-powered planner. The project is available at https://chameleon-llm.github.io.

研究动机与目标

  • 激发并解决标准大型语言模型在获取最新信息、外部工具和精确推理方面的局限性。
  • 提出一个灵活的、即插即用的框架,通过协同多样化工具集来合成接近自然语言的程序。
  • 在多模态的 ScienceQA 和表格化的 TabMWP 基准上展示有效性,并比较不同 LLMs 的规划质量。

提出的方法

  • 引入一个多样化工具模块的清单(包括 LLM、视觉模型、网络搜索、Python、启发式方法)。
  • 使用基于LLM的规划器生成接近自然语言的程序,将模块串联以解决查询。
  • 按顺序执行模块,使用缓存上下文并在步骤之间更新输入/缓存。
  • 通过生成易于理解和扩展的接近自然语言的计划,避免特定领域的编程语言。
  • 证明以 GPT-4 作为规划器比 ChatGPT 在工具选择上更一致。
Figure 1 : Examples from our Cha me leon approach with GPT-4 on ScienceQA [ 32 ] , a multi-modal question answering benchmark in scientific domains. Cha me leon is adaptive to different queries by synthesizing programs to compose various tools and executing them sequentially to get final answers.
Figure 1 : Examples from our Cha me leon approach with GPT-4 on ScienceQA [ 32 ] , a multi-modal question answering benchmark in scientific domains. Cha me leon is adaptive to different queries by synthesizing programs to compose various tools and executing them sequentially to get final answers.

实验结果

研究问题

  • RQ1基于 LLM 的规划器是否能够合成稳健的接近自然语言的程序,以整合异质工具完成真实世界的推理任务?
  • RQ2跨视觉、网络、知识检索和计算的即插即用模块是否能提升多模态与表格推理基准的性能?
  • RQ3规划器质量(GPT-4 与 ChatGPT)如何影响工具选择、计划有效性和最终准确性?

主要发现

Model#Tuned ParamsALLNATSOCLANTXTIMGNOG1-6G7-12
Cha me leon (GPT-4)0M86.5489.8374.1389.8288.2777.6492.1388.0383.72
Cha me leon (ChatGPT)0M79.9381.6270.6484.0079.7770.8086.6288.0383.72
  • Cha me leon 在 GPT-4 下在 ScienceQA 上达到 86.54% 的准确率,超过已发表的最佳少样本结果 11.37%。
  • 在 TabMWP 上,GPT-4 驱动的 Cha me leon 达到 98.78% 的准确率,较现有方法达到更高的 17.0% 的提升。
  • GPT-4 驱动的规划表现出比 ChatGPT 更一致、更理性的工具选择,来自指令推断的约束提升了规划。
  • 消融研究显示知识检索和领域/工具模块对 ScienceQA 和 TabMWP 的性能都是至关重要的。
  • Cha me leon 通过使用接近自然语言的程序来编排多样化工具而无需任务特定训练,在不同领域具有泛化能力。
Figure 2 : Two examples from our Cha me leon approach with GPT-4 on TabMWP [ 33 ] , a mathematical reasoning benchmark with tabular contexts. Cha me leon demonstrates flexibility and efficiency in adapting to different queries that require various reasoning abilities.
Figure 2 : Two examples from our Cha me leon approach with GPT-4 on TabMWP [ 33 ] , a mathematical reasoning benchmark with tabular contexts. Cha me leon demonstrates flexibility and efficiency in adapting to different queries that require various reasoning abilities.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。