QUICK REVIEW

[논문 리뷰] OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization

Srinivasan Iyer, Xi Victoria Lin|arXiv (Cornell University)|2022. 12. 22.

Topic Modeling인용 수 85

한 줄 요약

본 논문은 지시-조정(instruction-tuning) 결정들을 연구하기 위해 2000개의 NLP 작업으로 OPT-IML Bench를 구축하고, 여러 벤치마크에 걸쳐 세 가지 수준의 일반화를 달성하기 위해 OPT-IML 30B와 175B를 훈련한다.

ABSTRACT

Recent work has shown that fine-tuning large pre-trained language models on a collection of tasks described via instructions, a.k.a. instruction-tuning, improves their zero and few-shot generalization to unseen tasks. However, there is a limited understanding of the performance trade-offs of different decisions made during the instruction-tuning process. These decisions include the scale and diversity of the instruction-tuning benchmark, different task sampling strategies, fine-tuning with and without demonstrations, training using specialized datasets for reasoning and dialogue, and finally, the fine-tuning objectives themselves. In this paper, we characterize the effect of instruction-tuning decisions on downstream task performance when scaling both model and benchmark sizes. To this end, we create OPT-IML Bench: a large benchmark for Instruction Meta-Learning (IML) of 2000 NLP tasks consolidated into task categories from 8 existing benchmarks, and prepare an evaluation framework to measure three types of model generalizations: to tasks from fully held-out categories, to held-out tasks from seen categories, and to held-out instances from seen tasks. Through the lens of this framework, we first present insights about instruction-tuning decisions as applied to OPT-30B and further exploit these insights to train OPT-IML 30B and 175B, which are instruction-tuned versions of OPT. OPT-IML demonstrates all three generalization abilities at both scales on four different evaluation benchmarks with diverse tasks and input formats -- PromptSource, FLAN, Super-NaturalInstructions, and UnifiedSKG. Not only does it significantly outperform OPT on all benchmarks but is also highly competitive with existing models fine-tuned on each specific benchmark. We release OPT-IML at both scales, together with the OPT-IML Bench evaluation framework.

연구 동기 및 목표

모델 및 벤치마크 크기를 확장할 때 지시-조정 결정이 다운스트림 일반화에 어떤 영향을 미치는지 특성화한다.
8개의 데이터세트에서 consolidating한 대규모 2000-task NLP 벤치마크인 OPT-IML Bench를 만들어 교차-작업, 같은 범주 내, 인스턴스 수준 일반화를 연구한다.
지시-조정된 OPT-IML 모델(30B 및 175B)을 학습하고 다양한 벤치마크에서 평가하여 모범 사례를 확립한다.

제안 방법

8개의 지시-조정 벤치마크를 약 1991개의 작업과 100개 이상의 범주로 OPT-IML Bench로 통합한다.
지시 형식을 이분형 스키마(지시/출력)로 통일하고 완전 보유(out), 부분 보유, 완전 감독 평가 설정을 갖춘 학습/검증/테스트 분할을 구성한다.
원천(source)(지시/입력)와 대상(target)(레이블) 시퀀스에 조건부로 설정된 다음-토큰 예측 목적을 갖고 OPT-30B와 OPT-175B를 미세조정한다.
시퀀스 패킹 중 문서-주목(masking)을 사용하여 각 예시의 주의(attention)를 유지한다.
일반화 수준에 미치는 영향을 분석하기 위해 데이터세트 혼합, 작업/다양성, 시범, 벤치마크 비율을 실험한다.
OPT-IML 모델과 OPT-IML Bench 평가 프레임워크를 공유한다.

실험 결과

연구 질문

RQ1작업 수, 벤치마크 다양성, 지시 형식과 같은 스케일링 요소가 보유되지 않은(task categories)으로 일반화된 일반화에 어떤 영향을 미치는가?
RQ2모델 및 벤치마크 크기를 확장할 때 데모, 추론 데이터, 대화 데이터 등 다양한 지시-조정 결정 간의 trade-off는 무엇인가?
RQ3최대 작업 혼합 비율(EPS)을 다르게 설정하면 제로샷 및 소수샷 성능이 일반화 수준에 따라 어떻게 달라지는가?
RQ4벤치마크 비율 조정(다른 데이터세트)이 교차 벤치마크 일반화에 미치는 영향은 무엇인가?

주요 결과

OPT-IML은 제로샷 및 소수샷 시나리오에서 네 가지 지시-조정 벤치마크 모두에서 기본 OPT 모델을 능가한다.
다양한 벤치마크와 큰 작업 범위를 사용하면 보유하지 않은 범주 및 작업에 대한 일반화가 향상된다.
최대 혼합 비율을 다르게 설정하면 EPS의 이점이 임계점까지 나타나지만 특정 값 이후에는 수익이 감소한다.
벤치마크 비율의 균형은 보유되지 않거나 부분적으로 감독된 설정에서 성능을 향상시킬 수 있으며, 다양한 학습 데이터의 가치가 강조된다.
OPT-IML은 개별 벤치마크에서 미세조정된 모델에 비해 경쟁력 있는 성능을 달성한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.