QUICK REVIEW

[논문 리뷰] Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers

Damai Dai, Yutao Sun|arXiv (Cornell University)|2022. 12. 20.

Topic Modeling인용 수 25

한 줄 요약

본 논문은 컨텍스트 내 학습(ICL)을 트랜스포머 어텐션과 경사 하강법 간의 이중 형태를 드러내는 암묵적 미세조정으로 설명하고, 모멘텀 기반 어텐션이 ICL 및 언어 모델링을 개선한다는 것을 보여준다.

ABSTRACT

Large pretrained language models have shown surprising in-context learning (ICL) ability. With a few demonstration input-label pairs, they can predict the label for an unseen input without parameter updates. Despite the great success in performance, its working mechanism still remains an open question. In this paper, we explain language models as meta-optimizers and understand in-context learning as implicit finetuning. Theoretically, we figure out that Transformer attention has a dual form of gradient descent. On top of it, we understand ICL as follows: GPT first produces meta-gradients according to the demonstration examples, and then these meta-gradients are applied to the original GPT to build an ICL model. We comprehensively compare the behaviors of in-context learning and explicit finetuning on real tasks to provide empirical evidence that supports our understanding. Experimental results show that in-context learning behaves similarly to explicit finetuning from multiple perspectives. Inspired by the dual form between Transformer attention and gradient descent, we design a momentum-based attention by analogy with gradient descent with momentum. The improved performance over vanilla attention further supports our understanding from another perspective, and more importantly, shows the potential to utilize our understanding for future model design. The code is available at \url{https://aka.ms/icl}.

연구 동기 및 목표

동기: 매개변수 업데이트 없이 대형 GPT가 컨텍스트 내 학습을 어떻게 수행하는지 이해한다.
트랜스포머 어텐션이 경사하강법의 이중 형태를 구현한다는 이론적 관점을 제시한다.
암묵적 미세조정 관점을 검증하기 위하여 실제 NLP 과제에서 ICL과 명시적 미세조정을 경험적으로 비교한다.
성능 향상을 위해 모멘텀을 갖춘 경사하강법에서 영감을 얻은 모멘텀 기반 어텐션 메커니즘을 도입한다.

제안 방법

트랜스포머 어텐션과 경사하강법 사이의 이중 형태를 도출하여 어텐션이 그래디언트 기반 업데이트처럼 작용할 수 있음을 보인다.
ICL을 프리트레인된 GPT가 시演데몬스트레이션으로 메타-그래디언트를 생성하고 이를 어텐션을 통해 적용하는 메타-최적화로 정의한다.
예측, 어텐션 출력, 토큰 중심 어텐션의 유사성을 보여주기 위해 여섯 개의 분류 과제에서 ICL과 미세조정을 비교한다.
어텐션 Value에 EMA를 적용하여 그래디언트 모멘텀 업데이트를 시뮬레이션하는 모멘텀 기반 어텐션(MoAttn)을 설계하고 평가한다.
언어 모델링 실험을 수행하여 모멘텀 기반 어텐션이 퍼플렉시티를 감소시키고 다운스트림 ICL 과제를 개선하는지 테스트한다.

실험 결과

연구 질문

RQ1트랜스포머 어텐션이 ICL의 기저가 되는 그래디언트-하강법과 유사한 업데이트(이중 형태)를 수행한다고 해석될 수 있는가?
RQ2ICL의 동작이 예측 및 내부 표현 측면에서 실험적으로 명시적 미세조정과 유사한가?
RQ3어텐션에 모멘텀을 도입하는 것이 ICL 및 언어 모델링을 더 개선하여 메타-최적화 관점을 뒷받침하는가?

주요 결과

ICL과 명시적 미세조정은 경사하강법의 이중 관점을 공유하며, ICL은 순전파 계산에 의해 생성된 메타-그래디언트에 의존한다.
여섯 개의 분류 과제에서의 경험적 증거는 예측 및 어텐션 다이내믹스 측면에서 ICL 동작이 미세조정과 유사하다는 것을 보여준다.
ICL은 미세조정으로 생성된 것과 닮은 어텐션 업데이트와 어텐션 가중치를 생성하는 경향이 있어 유사한 표현 변화가 있음을 시사한다.
MoAttn은 일반 어텐션에 비해 언어 모델링 퍼플렉시티와 ICL 정확도를 일관되게 향상시킨다.
모멘텀 기반 어텐션은 미래 모델 설계에서 메타-최적화 관점의 실용적 유용성을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.