QUICK REVIEW

[論文レビュー] Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing

Yelysei Bondarenko, Markus Nagel|arXiv (Cornell University)|Jun 22, 2023

Domain Adaptation and Few-Shot Learning被引用数 10

ひとこと要約

本論文は、変換器の活性化の外れ値を抑制し、重訓練なしで直感的な INT8 ポストトレーニング量子化を可能にする、clipped softmax と gated attention の二つのアーキテクチャ的改良を導入する。BERT、OPT、ViT にまたがる量子化適性の高い性能向上を実証する。

ABSTRACT

Transformer models have been widely adopted in various domains over the last years, and especially large language models have advanced the field of AI significantly. Due to their size, the capability of these networks has increased tremendously, but this has come at the cost of a significant increase in necessary compute. Quantization is one of the most effective ways to reduce the computational time and memory consumption of neural networks. Many studies have shown, however, that modern transformer models tend to learn strong outliers in their activations, making them difficult to quantize. To retain acceptable performance, the existence of these outliers requires activations to be in higher bitwidth or the use of different numeric formats, extra fine-tuning, or other workarounds. We show that strong outliers are related to very specific behavior of attention heads that try to learn a "no-op" or just a partial update of the residual. To achieve the exact zeros needed in the attention matrix for a no-update, the input to the softmax is pushed to be larger and larger during training, causing outliers in other parts of the network. Based on these observations, we propose two simple (independent) modifications to the attention mechanism - clipped softmax and gated attention. We empirically show that models pre-trained using our methods learn significantly smaller outliers while maintaining and sometimes even improving the floating-point task performance. This enables us to quantize transformers to full INT8 quantization of the activations without any additional effort. We demonstrate the effectiveness of our methods on both language models (BERT, OPT) and vision transformers.

研究の動機と目的

視覚と言語モデルを横断する変換器の量子化を妨げる活性化外れ値の原因を特定する。
外れ値を抑制しつつ FP パフォーマンスを損なわない注意機構のアーキテクチャ的修正を提案する。
複数のモデルファミリに対するポストトレーニング量子化 (PTQ) パフォーマンスへの影響を評価する。

提案手法

変換器の attention head における外れ値現象と softmax 及び残差接続との相互作用を分析する。
drop-in replacement として exact zeros を生み出し外れ値を減少させる clipped softmax を提案する。
軽量な head 固有のゲーティングネットワークを用いた gated attention を提案し、注意出力を調節する。
INT8 へのポストトレーニング量子化を実施し、量子化後と FP の性能および外れ値指標を比較する。
標準ベンチマークを用いて BERT-base、OPT-125M、ViT-S/16 を評価する。

実験結果

リサーチクエスチョン

RQ1NLP および CV アーキテクチャにおける transformer 注意機構の強い活性化外れ値の原因は何か。
RQ2注意機構のアーキテクチャ的変更で外れ値を抑制し、再訓練なしで量子化を改善できるか。
RQ3clipped softmax と gated attention は FP パフォーマンスを保持・向上させつつ、PTQ 結果を改善できるか。

主な発見

モデル	手法	FP16	最大無限ノルム	平均尖度	W8A8
BERT-base	Vanilla	4.49	735	3076	1294
BERT-base	Clipped softmax	4.39	21.5	80	4.52
BERT-base	Gated attention	4.45	39.2	201	4.65
OPT-125m	Vanilla	15.84	340	1778	21.18
OPT-125m	Clipped softmax	16.29	63.2	19728	37.20
OPT-125m	Gated attention	15.55	8.7	18.9	16.02
ViT-S/16	Vanilla	80.75	359	1018	69.24
ViT-S/16	Clipped softmax	80.89	73.7	22.9	79.77
ViT-S/16	Gated attention	81.01	79.8	19.9	79.82

Clipped softmax と gated attention は、注意層の出力の外れ値と尖度を大幅に低減する。
両手法とも、モデル全体で FP パフォーマンスを維持するか、時に改善しつつ、W8A8 量子化性能を向上させる。
Gated attention は報告された実験において OPT および ViT で最も強い PTQ 効果を発揮する。
OPT の gated attention は最大無限ノルムと尖度を著しく低減し、量子化された perplexity を改善する。
BERT と ViT は両手法ともに量子化指標が有利に働くが、本研究の一部の設定では OPT の clipped softmax は期待以下であった。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。