QUICK REVIEW

[논문 리뷰] Provably Learning Attention with Queries

Satwik Bhattamishra, Kulin Shah|arXiv (Cornell University)|2026. 01. 23.

Stochastic Gradient Optimization Techniques인용 수 0

한 줄 요약

이 연구는 값 쿼리에서 단일-head 소프트맥스 어텐션 파라미터를 회복하기 위한 증명 가능한 알고리즘을 제공하며, 저랭크 및 강건성 설정으로의 확장을 보이고, 추가 구조 없이 다중-head 어텐션의 비식별성(식별 불가능성)을 보여준다.

ABSTRACT

We study the problem of learning Transformer-based sequence models with black-box access to their outputs. In this setting, a learner may adaptively query the oracle with any sequence of vectors and observe the corresponding real-valued output. We begin with the simplest case, a single-head softmax-attention regressor. We show that for a model with width $d$, there is an elementary algorithm to learn the parameters of single-head attention exactly with $O(d^2)$ queries. Further, we show that if there exists an algorithm to learn ReLU feedforward networks (FFNs), then the single-head algorithm can be easily adapted to learn one-layer Transformers with single-head attention. Next, motivated by the regime where the head dimension $r \ll d$, we provide a randomised algorithm that learns single-head attention-based models with $O(rd)$ queries via compressed sensing arguments. We also study robustness to noisy oracle access, proving that under mild norm and margin conditions, the parameters can be estimated to $\varepsilon$ accuracy with a polynomial number of queries even when outputs are only provided up to additive tolerance. Finally, we show that multi-head attention parameters are not identifiable from value queries in general -- distinct parameterisations can induce the same input-output map. Hence, guarantees analogous to the single-head setting are impossible without additional structural assumptions.

연구 동기 및 목표

블랙박스 값 쿼리로 어텐션 기반 시퀀스 모델 학습을 동기 부여하고 형식화한다.
다항식 쿼리 복잡도로 단일-head 어텐션에 대해 정확한 파라미터 회복을 보인다.
ReLU FFN 학습 가정 하에서 한 층 Transformer에 대한 확장을 보인다.
압축 센싱을 통한 쿼리 복잡도 감소를 위한 저랭크(regime) 알고리즘을 개발한다.
합성 가능한 노이즈가 있는 경우의 강건성 및 다중-head 어텐션의 식별 가능성에 대한 분석을 수행한다.

제안 방법

단일-head 어텐션을 f_{W,v}(X) = alpha(X,W)^{T}(Xv)로 모델링하고, alpha는 s_i = x_i^T W x_N의 점수들에 대한 softmax에서 얻어낸다.
짧은 시퀀스를 통해 softmax를 구분하고 선형 방정식으로 전환하여 O(d^2) 값 쿼리로 (W*,v*)의 정확한 회복 가능성을 보인다.
FFN 학습자를 가진 이차-단계 접근법을 결합하여 단일-head 어텐션을 갖는 한 층 Transformer를 얻는 방법을 보인다.
저랭크(regime)에서 (rank(W*) ≤ r)인 경우, 랭크-원 측정치를 설계하고 압축 센싱을 적용하여 O(rd) 쿼리로 회복 가능성을 보인다.
오직 근사적 값 쿼리에 대해 ε-정확 회복을 완만한 노름 한도와 여유 마진 가정 하에서 도출하여 강건성을 분석한다.
다양한 경우에 대해 값 쿼리로부터 다중-head 어텐션 파라미터의 비식별성을 입증한다.

실험 결과

연구 질문

RQ1단일-head 소프트맥스 어텐션 파라미터를 값 쿼리로부터 정확하게 회복할 수 있는가?
RQ2단일-head 대 저랭크 W*에서 임베딩 차원 d에 따라 쿼리 복잡도는 어떻게 증가하고, 압축으로 이를 줄일 수 있는가?
RQ3FFN 학습을 하위 루틴으로 활용하여 한 층 Transformer를 값-쿼리 접근으로 학습할 수 있는가?
RQ4오라클이 노이즈가 있는 출력이나 근사치를 제공할 때 회복 보장의 강건성은 어떠한가?
RQ5다중-head 어텐션 파라미터는 값 쿼리로부터 식별 가능한가, 특정 구조적 가정에서 식별 가능성이 달성될 수 있는가?

주요 결과

단일-head 어텐션 파라미터는 O(d^2) 값 쿼리로 정확히 회복될 수 있다(정리 4.1).
FFN 값-쿼리 학습기가 존재하는 가정 하에 2단계 방법은 단일-head 어텐션을 갖는 한 층 Transformer를 얻는 방법을 제공한다.
저랭크 영역에서 rank(W*) ≤ r일 때, 압축 센싱을 통해 O(rd) 쿼리로 회복 가능하다.
근사 값 쿼리의 경우도 완만한 노름 한도와 여유 마진 조건에서 ε-정확 회복이 가능하다.
일반적으로 다중-head 어텐션 파라미터는 값 쿼리로부터 식별 가능하지 않으며, 추가 구조 없이 단일-head 보증은 성립하지 않는다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.