QUICK REVIEW

[논문 리뷰] LESS: Selecting Influential Data for Targeted Instruction Tuning

Mengzhou Xia, Sadhika Malladi|arXiv (Cornell University)|2024. 02. 06.

Intelligent Tutoring Systems and Adaptive Learning인용 수 14

한 줄 요약

LESS는 Adam-compatible 영향 추정과 저랭크 그래디언트 저장소를 사용하여 작은 고효율 데이터 하위집합을 선택하는 최적화기 반응형 데이터 선택 방법이다. 전체 데이터로 학습하는 것보다 데이터의 약 5%만으로도 종종 더 우수하며 모델 크기와 패밀리 간에 전이된다.

ABSTRACT

Instruction tuning has unlocked powerful capabilities in large language models (LLMs), effectively using combined datasets to develop generalpurpose chatbots. However, real-world applications often require a specialized suite of skills (e.g., reasoning). The challenge lies in identifying the most relevant data from these extensive datasets to effectively develop specific capabilities, a setting we frame as targeted instruction tuning. We propose LESS, an optimizer-aware and practically efficient algorithm to effectively estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection. Crucially, LESS adapts existing influence formulations to work with the Adam optimizer and variable-length instruction data. LESS first constructs a highly reusable and transferable gradient datastore with low-dimensional gradient features and then selects examples based on their similarity to few-shot examples embodying a specific capability. Experiments show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. Furthermore, the selected data is highly transferable: smaller models can be leveraged to select useful data for larger models and models from different families. Our qualitative analysis shows that our method goes beyond surface form cues to identify data that exemplifies the necessary reasoning skills for the intended downstream application.

연구 동기 및 목표

Frame targeted instruction tuning as selecting data that minimizes loss on a specific downstream task.
Adapt influence-based data selection to Adam and variable-length instruction data.
Develop a scalable gradient datastore using LoRA and random projections for efficient data selection.
Demonstrate transferability of selected data across model sizes and families.
Provide qualitative evidence that LESS selects data aligning with the needed reasoning skills for target tasks.

제안 방법

Adapts a first-order training-influence formulation to Adam, defining Inf_Adam as a gradient-based influence metric.
Uses LoRA to enable parameter-efficient warmup training for gradient feature extraction.
Constructs a gradient datastore by projecting gradients into a low-dimensional space via random projections (Johnson–Lindenstrauss) to enable efficient similarity computations.
Computes per-subtask validation gradient averages and scores candidate data using a max over subtasks of Inf_Adam to select a 5% training subset.
Performs data selection offline with a selection model M_S and trains the target model M_T on the chosen subset, enabling transfer (LESS-T).
Evaluates using three downstream datasets (MMLU, TydiQA, BBH) across multiple base models (Llama-2-7B, Llama-2-13B, Mistral-7B).

실험 결과

연구 질문

RQ1Can targeted instruction tuning be improved by selecting data that directly minimizes loss on a target validation task?
RQ2How can influence-based data selection be made compatible with Adam and variable-length instruction data?
RQ3Is a low-dimensional gradient datastore sufficient and efficient for selecting influential data?
RQ4Does data selected by a small model transfer effectively to larger models or different model families?
RQ5Does LESS select data based on underlying reasoning skills rather than surface cues?

주요 결과

Training on a 5% LESS-selected subset often outperforms training on the full dataset across diverse tasks and models.
Data selected by LESS transfers well: data chosen by a small model boosts performance for larger models and for models from different families.
LESS consistently outperforms baselines like random selection, BM25, DSIR, and RDS across MMLU, TydiQA, and BBH.
Using a small warmup subset (5%) with multiple gradient checkpoints improves influence estimation and final accuracy; more warmup data and more checkpoints typically help.
Qualitative analysis shows LESS selects data that align with the required reasoning skills for the target task, not just surface text similarity.
In the transfer setting (LESS-T), data selected with Llama-2-7B gradients yields strong results when training Llama-2-13B or Mistral-7B.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.