QUICK REVIEW

[论文解读] LLMs for Science: Usage for Code Generation and Data Analysis

Mohamed Nejjar, Luca Zacharias|arXiv (Cornell University)|Nov 28, 2023

Scientific Computing and Data Management被引用 9

一句话总结

本论文实证评估了几种基于大模型的科学任务编码工具，聚焦代码生成、数据分析和数据可视化，并讨论优点、缺点及风险，如编造信息。

ABSTRACT

Large language models (LLMs) have been touted to enable increased productivity in many areas of today's work life. Scientific research as an area of work is no exception: the potential of LLM-based tools to assist in the daily work of scientists has become a highly discussed topic across disciplines. However, we are only at the very onset of this subject of study. It is still unclear how the potential of LLMs will materialise in research practice. With this study, we give first empirical evidence on the use of LLMs in the research process. We have investigated a set of use cases for LLM-based tools in scientific research, and conducted a first study to assess to which degree current tools are helpful. In this paper we report specifically on use cases related to software engineering, such as generating application code and developing scripts for data analytics. While we studied seemingly simple use cases, results across tools differ significantly. Our results highlight the promise of LLM-based tools in general, yet we also observe various issues, particularly regarding the integrity of the output these tools provide.

研究动机与目标

探索当前基于LLM的工具在科学工作中对编码相关任务（代码生成、数据分析、数据可视化）的支持程度。
评估在多种工具下生成代码与分析的正确性、效率和可读性。
识别跨工具差异、局限性与研究工作流中的风险（如输出完整性与编造信息）。

提出的方法

选择一系列基于LLM的工具（ChatGPT GPT-3.5, ChatGPT GPT-4, Google Bard, Bing Chat, YouChat, GitHub Copilot, GitLab Duo）。
在三种编码相关用例中定义：Java中多线程矩阵乘法、Python数据分析、R数据可视化。
每个用例使用两种提示变体，并基于正确性、效率、可读性等标准通过评估量表对输出进行评估。
进行多次运行以应对非确定性，并提供包含交互日志的复制包。

实验结果

研究问题

RQ1当前的LLM工具在生成针对典型科学编程任务的正确且高效的代码方面表现如何？
RQ2在不需要人工干预的情况下，LLMs在科学工作流中对数据分析和数据可视化任务的支持程度如何？
RQ3在这些用例的代码质量、文档和用户体验方面，各工具之间存在哪些定性差异？
RQ4在将LLM应用于科学编码任务时，会出现哪些风险（如编造信息、数据格式不匹配）？
RQ5对于相同任务，不同类型工具（基于GPT、基于PaLM、基于Claude）之间的结果有何差异？

主要发现

大多数工具在首次尝试中对矩阵乘法实现可执行代码正确，Google Bard需要人工干预，GitLab Duo输出为有限的单线程。
数据分析和可视化任务差异显著；GPT-4通常需要更少干预，产生更准确的分析和图形，而Bing Chat和Google Bard经常产生误导性结果。
非确定性和数据格式依赖性成为主要挑战，一些工具无法处理数据结构或需要事后修正。
可读性和文档质量存在差异；有些工具提供有用的注释和文档，而其他工具生成的代码简洁或未带文档。
整体上GPT-4.0在数据分析和可视化任务中表现最好，尤其在运行代码和图形质量方面具有显著优势；其他工具在准确性和可视化对齐方面存在问题。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。