[Paper Review] Towards Optimizing with Large Language Models
The paper evaluates the optimization capabilities of Large Language Models (LLMs) across multiple optimization tasks using iterative prompting, introducing three evaluation metrics and examining data-size effects on performance.
In this work, we conduct an assessment of the optimization capabilities of LLMs across various tasks and data sizes. Each of these tasks corresponds to unique optimization domains, and LLMs are required to execute these tasks with interactive prompting. That is, in each optimization step, the LLM generates new solutions from the past generated solutions with their values, and then the new solutions are evaluated and considered in the next optimization step. Additionally, we introduce three distinct metrics for a comprehensive assessment of task performance from various perspectives. These metrics offer the advantage of being applicable for evaluating LLM performance across a broad spectrum of optimization tasks and are less sensitive to variations in test samples. By applying these metrics, we observe that LLMs exhibit strong optimization capabilities when dealing with small-sized samples. However, their performance is significantly influenced by factors like data size and values, underscoring the importance of further research in the domain of optimization tasks for LLMs.
Motivation & Objective
- Assess LLMs' ability to perform interactive optimization across diverse tasks and data sizes.
- Introduce metrics that quantify progress, alignment, and stability of LLM-based optimization.
- Identify factors such as data size and task type that influence LLM optimization performance.
Proposed method
- Use four optimization algorithms (Gradient Descent, Hill Climbing, Grid Search, Black Box Optimization) as case studies for LLMs.
- Apply an iterative prompting framework with Chain of Thought reasoning to generate and evaluate new solutions per iteration.
- Define and compute three metrics (Goal, Policy, Uncertainty) to evaluate optimization progress, alignment with ground truth, and solution stability.
- Generate synthetic datasets in [0,10]^d with varying dimensionality to test sensitivity to data size.
- Utilize GPT-turbo-3.5 (0613) with temperature 0.8 across five dataset sizes and ten iterations per repetition.

Experimental results
Research questions
- RQ1Can LLMs act as optimizers in interactive, iterative-prompting settings across different optimization paradigms?
- RQ2How do data size and task type affect the optimization performance, stability, and alignment of LLMs with ground-truth algorithms?
- RQ3Do the proposed Goal, Policy, and Uncertainty metrics robustly capture optimization performance across tasks and data sizes?
- RQ4To what extent do LLMs achieve or surpass ground-truth performance in gradient-based and grid-search tasks, and where do they struggle (e.g., meta-heuristics like Hill-Climbing)?
Key findings
- LLMs show robust optimization capabilities, particularly with small data samples across tasks.
- Gradient Descent is the strongest performer, sometimes surpassing the ground-truth in certain data-size settings.
- Grid Search yields strong performance despite large search spaces, while Hill-Climbing presents notable challenges.
- Black-Box optimization with small data samples indicates inherent optimization ability of LLMs, though performance declines as data size grows.
- Uncertainty tends to be higher with smaller data sizes and decreases as data size increases, indicating stability improves with more data.
- Self-consistency prompting can improve stability for some models (e.g., GPT-4) but not others (e.g., GPT-turbo-3.5).

Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.