Skip to main content
QUICK REVIEW

[Paper Review] LLM-Select: Feature Selection with Large Language Models

Daniel P. Jeong, Zachary C. Lipton|arXiv (Cornell University)|Jul 2, 2024
Natural Language Processing Techniques7 citations
TL;DR

The paper shows that large language models can perform feature selection for supervised tasks using only feature names and a description of the prediction task, achieving competitive performance with data-driven methods like LASSO across multiple datasets and prompting strategies.

ABSTRACT

In this paper, we demonstrate a surprising capability of large language models (LLMs): given only input feature names and a description of a prediction task, they are capable of selecting the most predictive features, with performance rivaling the standard tools of data science. Remarkably, these models exhibit this capacity across various query mechanisms. For example, we zero-shot prompt an LLM to output a numerical importance score for a feature (e.g., "blood pressure") in predicting an outcome of interest (e.g., "heart failure"), with no additional context. In particular, we find that the latest models, such as GPT-4, can consistently identify the most predictive features regardless of the query mechanism and across various prompting strategies. We illustrate these findings through extensive experiments on real-world data, where we show that LLM-based feature selection consistently achieves strong performance competitive with data-driven methods such as the LASSO, despite never having looked at the downstream training data. Our findings suggest that LLMs may be useful not only for selecting the best features for training but also for deciding which features to collect in the first place. This could benefit practitioners in domains like healthcare and the social sciences, where collecting high-quality data comes at a high cost.

Motivation & Objective

  • Demonstrate that LLMs can identify informative features for supervised learning using only feature names and a target description.
  • Propose three LLM-based feature selection methods and compare them to traditional baselines across datasets.
  • Evaluate robustness of prompting strategies and decoding methods on small- and large-scale datasets.
  • Analyze how LLM-generated feature importance correlates with standard feature-importance metrics.

Proposed method

  • Propose three prompting-based feature selection approaches: LLM-Score (importance scores), LLM-Rank (ranking), and LLM-Seq (sequential dialogue).
  • Prompt LLMs (GPT-4, GPT-3.5, Llama-2) with concepts c and target c_y to obtain scores, ranks, or dialogue-driven selections.
  • Evaluate on real-world datasets using downstream models (logistic/linear regression, LightGBM, MLP) with varying feature subsets.
  • Compare against data-driven baselines (LASSO, LassoNet, MRMR, MI, forward/backward selection, RFE, random).
  • Test zero-shot prompts and decoding strategies (greedy vs. self-consistency) and analyze prompt variations and model scale.

Experimental results

Research questions

  • RQ1Can an LLM identify the most predictive features for a target outcome using only feature names and a task description without access to downstream training data?
  • RQ2How do three prompting strategies (scores, ranking, sequential dialogue) compare in effectiveness for feature selection across datasets?
  • RQ3What is the impact of model scale and prompting variations on feature selection performance and stability?
  • RQ4Do LLM-generated feature importance scores correlate with standard feature-importance metrics such as SHAP, Fisher score, and mutual information?
  • RQ5Is LLM-based feature selection viable on large-scale, high-dimensional real-world datasets (thousands of features)?

Key findings

  • LLMs with sufficient scale (e.g., GPT-4) achieve strong feature selection performance competitive with data-driven baselines like LASSO on real-world data.
  • All three LLM-based methods (Score, Rank, Seq) yield similar strong performance, with GPT-4 showing consistent results across mechanisms.
  • Zero-shot prompting with greedy decoding often matches or exceeds more complex prompting variations, indicating a strong baseline.
  • LLM-Score scores show higher correlation with standard feature-importance metrics as model scale increases, though no single metric consistently dominates.
  • On large-scale datasets (≈3000 features), GPT-4 LLM-Score remains competitive against baselines like MRMR and outperforms random selection, particularly at low feature percentages (e.g., top 10-30%).
  • Results generalize across domains, including healthcare (MIMIC-IV) and folktables datasets, suggesting practicality for domains with costly data collection.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.