QUICK REVIEW

[论文解读] Provably Learning Attention with Queries

Satwik Bhattamishra, Kulin Shah|arXiv (Cornell University)|Jan 23, 2026

Stochastic Gradient Optimization Techniques被引用 0

一句话总结

论文提供可证明的算法，通过值查询从单头softmax注意力参数中恢复参数，并扩展到低秩与鲁棒性设置；并且在没有额外结构时，多头注意力存在不可鉴定性。

ABSTRACT

We study the problem of learning Transformer-based sequence models with black-box access to their outputs. In this setting, a learner may adaptively query the oracle with any sequence of vectors and observe the corresponding real-valued output. We begin with the simplest case, a single-head softmax-attention regressor. We show that for a model with width $d$, there is an elementary algorithm to learn the parameters of single-head attention exactly with $O(d^2)$ queries. Further, we show that if there exists an algorithm to learn ReLU feedforward networks (FFNs), then the single-head algorithm can be easily adapted to learn one-layer Transformers with single-head attention. Next, motivated by the regime where the head dimension $r \ll d$, we provide a randomised algorithm that learns single-head attention-based models with $O(rd)$ queries via compressed sensing arguments. We also study robustness to noisy oracle access, proving that under mild norm and margin conditions, the parameters can be estimated to $\varepsilon$ accuracy with a polynomial number of queries even when outputs are only provided up to additive tolerance. Finally, we show that multi-head attention parameters are not identifiable from value queries in general -- distinct parameterisations can induce the same input-output map. Hence, guarantees analogous to the single-head setting are impossible without additional structural assumptions.

研究动机与目标

为带黑箱值查询的注意力序列模型学习提供动机与正式化框架。
证明单头注意力在多项式查询复杂度下可精确恢复参数。
在ReLU FFN学习假设下，将其推广到单层Transformer。
在低秩情形下通过压缩感知设计降低查询复杂度以实现低秩恢复。
分析对附加oracle噪声的鲁棒性以及多头注意力的可识别性限制。

提出的方法

将单头注意力建模为 f_{W,v}(X) = alpha(X,W)^{T}(Xv)，其中 alpha 来自对分数 s_i = x_i^T W x_N 的softmax。
通过截断序列将softmax孤立出来并转化为线性方程，证明在 O(d^2) 次值查询内能够精确恢复 (W*,v*)。
展示将两步法与FFN学习器相结合以获得具有单头注意力的一层Transformer。
在低秩情形（rank(W*) ≤ r），设计秩一测量并应用压缩感知，以 O(rd) 次查询实现恢复。
通过分析近似值查询来研究鲁棒性，在温和范数界和边界条件下实现 ε-精确恢复。
证明在一般情形下，多头注意力参数从值查询中不可鉴定。

实验结果

研究问题

RQ1单头softmax注意力参数是否可以通过值查询被精确恢复？
RQ2在单头与低秩 W* 情况下，查询复杂度如何随嵌入维度 d 增长，压缩是否能降低？
RQ3是否可以利用值查询访问通过将FFN学习作为子程序来学习单层Transformer？
RQ4当oracle输出存在噪声或近似值时，恢复保证的鲁棒性如何？
RQ5从值查询中是否能辨识多头注意力参数，在什么结构假设下可实现可识别性？

主要发现

单头注意力参数可以在 O(d^2) 次值查询内被精确恢复（定理 4.1）。
两步法在存在 FFN 值查询学习器的假设下可得到一个具有单头注意力的单层Transformer。
在低秩情形 rank(W*) ≤ r，通过压缩感知可用 O(rd) 次查询实现恢复。
在近似值查询情况下，在温和的范数界限和边际条件下仍可实现 ε-精确恢复。
一般情况下，多头注意力参数从值查询不可鉴定，意味着在没有额外结构下不存在等价的单头保证。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。