QUICK REVIEW

[论文解读] Towards Optimizing with Large Language Models

Pei-Fu Guo, Ying‐Hsuan Chen|arXiv (Cornell University)|Oct 8, 2023

Natural Language Processing Techniques被引用 8

一句话总结

论文在多项优化任务中通过迭代提示评估大型语言模型（LLMs）的优化能力，提出三种评估指标并考察数据量对性能的影响。

ABSTRACT

In this work, we conduct an assessment of the optimization capabilities of LLMs across various tasks and data sizes. Each of these tasks corresponds to unique optimization domains, and LLMs are required to execute these tasks with interactive prompting. That is, in each optimization step, the LLM generates new solutions from the past generated solutions with their values, and then the new solutions are evaluated and considered in the next optimization step. Additionally, we introduce three distinct metrics for a comprehensive assessment of task performance from various perspectives. These metrics offer the advantage of being applicable for evaluating LLM performance across a broad spectrum of optimization tasks and are less sensitive to variations in test samples. By applying these metrics, we observe that LLMs exhibit strong optimization capabilities when dealing with small-sized samples. However, their performance is significantly influenced by factors like data size and values, underscoring the importance of further research in the domain of optimization tasks for LLMs.

研究动机与目标

评估LLMs在不同任务和数据规模下进行交互式优化的能力。
引入可量化LLM-based优化过程进展、对齐与稳定性的指标。
识别影响LLM优化性能的因素，如数据规模和任务类型。

提出的方法

将四种优化算法（梯度下降、爬山法、网格搜索、黑箱优化）作为LLMs的案例研究。
应用带有逐步推理链（Chain of Thought）的迭代提示框架，在每次迭代中生成并评估新解。
定义并计算三种指标（Goal、Policy、Uncertainty）以评估优化进展、与真实解的对齐与解的稳定性。
在[0,10]^d的合成数据集上生成不同维度以测试对数据规模的敏感性。
在五种数据集规模和每次重复十次迭代中，使用GPT-turbo-3.5（0613），温度0.8。

Figure 1: Overview of our prompting strategy. (1) LLMs formulate the loss function based on given samples. (2) Given algorithm instructions and past results, LLM generates new solution. (3) Calculate loss of new solution and add the solution-score pairs to the prompt of next iteration. (4) Repeat se

实验结果

研究问题

RQ1LLMs是否能够在不同优化范式的交互式、迭代提示设置中充当优化器？
RQ2数据规模和任务类型如何影响LLMs的优化性能、稳定性以及与真实算法的对齐？
RQ3所提出的Goal、Policy和Uncertainty指标是否能在不同任务和数据规模下稳健地捕捉优化性能？
RQ4在梯度基于和网格搜索任务中，LLMs在多大程度上达到或超越真实解的表现，在哪些方面（如元启发式的爬山法）会遇到困难？

主要发现

LLMs在多项任务上表现出稳健的优化能力，尤其是在小数据样本下。
梯度下降是最强的表现者，在某些数据规模下有时甚至超过真实解。
网格搜索在搜索空间较大时仍表现出强劲的性能，而爬山法则存在明显挑战。
小数据样本的黑箱优化显示出LLMs的内在优化能力，但随着数据量增大，性能下降。
不确定性在数据规模较小时往往更高，随着数据量增大而下降，表明稳定性随数据增多而提高。
自洽性提示对某些模型（如GPT-4）可以提高稳定性，而对其他模型（如GPT-turbo-3.5）则可能无效。

Figure 2: The Goal Metric and the Policy Metric hover from positive to near zero, signifying (1) substantial optimization capability (2) remarkable alignment between LLM’s output and ground truth. When the Goal Metric is low and the Policy Metric is close to zero, it signifies that the LLM performs

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。