QUICK REVIEW

[Paper Review] Towards Optimizing with Large Language Models

Pei-Fu Guo, Ying‐Hsuan Chen|arXiv (Cornell University)|Oct 8, 2023

Natural Language Processing Techniques8 citations

TL;DR

The paper evaluates the optimization capabilities of Large Language Models (LLMs) across multiple optimization tasks using iterative prompting, introducing three evaluation metrics and examining data-size effects on performance.

ABSTRACT

In this work, we conduct an assessment of the optimization capabilities of LLMs across various tasks and data sizes. Each of these tasks corresponds to unique optimization domains, and LLMs are required to execute these tasks with interactive prompting. That is, in each optimization step, the LLM generates new solutions from the past generated solutions with their values, and then the new solutions are evaluated and considered in the next optimization step. Additionally, we introduce three distinct metrics for a comprehensive assessment of task performance from various perspectives. These metrics offer the advantage of being applicable for evaluating LLM performance across a broad spectrum of optimization tasks and are less sensitive to variations in test samples. By applying these metrics, we observe that LLMs exhibit strong optimization capabilities when dealing with small-sized samples. However, their performance is significantly influenced by factors like data size and values, underscoring the importance of further research in the domain of optimization tasks for LLMs.

Motivation & Objective

Assess LLMs' ability to perform interactive optimization across diverse tasks and data sizes.
Introduce metrics that quantify progress, alignment, and stability of LLM-based optimization.
Identify factors such as data size and task type that influence LLM optimization performance.

Proposed method

Use four optimization algorithms (Gradient Descent, Hill Climbing, Grid Search, Black Box Optimization) as case studies for LLMs.
Apply an iterative prompting framework with Chain of Thought reasoning to generate and evaluate new solutions per iteration.
Define and compute three metrics (Goal, Policy, Uncertainty) to evaluate optimization progress, alignment with ground truth, and solution stability.
Generate synthetic datasets in [0,10]^d with varying dimensionality to test sensitivity to data size.
Utilize GPT-turbo-3.5 (0613) with temperature 0.8 across five dataset sizes and ten iterations per repetition.

Figure 1: Overview of our prompting strategy. (1) LLMs formulate the loss function based on given samples. (2) Given algorithm instructions and past results, LLM generates new solution. (3) Calculate loss of new solution and add the solution-score pairs to the prompt of next iteration. (4) Repeat se

Experimental results

Research questions

RQ1Can LLMs act as optimizers in interactive, iterative-prompting settings across different optimization paradigms?
RQ2How do data size and task type affect the optimization performance, stability, and alignment of LLMs with ground-truth algorithms?
RQ3Do the proposed Goal, Policy, and Uncertainty metrics robustly capture optimization performance across tasks and data sizes?
RQ4To what extent do LLMs achieve or surpass ground-truth performance in gradient-based and grid-search tasks, and where do they struggle (e.g., meta-heuristics like Hill-Climbing)?

Key findings

LLMs show robust optimization capabilities, particularly with small data samples across tasks.
Gradient Descent is the strongest performer, sometimes surpassing the ground-truth in certain data-size settings.
Grid Search yields strong performance despite large search spaces, while Hill-Climbing presents notable challenges.
Black-Box optimization with small data samples indicates inherent optimization ability of LLMs, though performance declines as data size grows.
Uncertainty tends to be higher with smaller data sizes and decreases as data size increases, indicating stability improves with more data.
Self-consistency prompting can improve stability for some models (e.g., GPT-4) but not others (e.g., GPT-turbo-3.5).

Figure 2: The Goal Metric and the Policy Metric hover from positive to near zero, signifying (1) substantial optimization capability (2) remarkable alignment between LLM’s output and ground truth. When the Goal Metric is low and the Policy Metric is close to zero, it signifies that the LLM performs

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.