QUICK REVIEW

[論文レビュー] Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models

Xiuying Wei, Yuncheng Zhang|arXiv (Cornell University)|Sep 27, 2022

Topic Modeling被引用数 28

ひとこと要約

論文は Gamma Migration と Token-Wise Clipping を用いたアウトライア抑制フレームワークを導入し、6-bit PTQ および 4-bit QAT で Transformer モデルの低ビット量子化の精度をほぼフル精度に近づけ、BERT, RoBERTa, BART のようなモデルで達成。

ABSTRACT

Transformer architecture has become the fundamental element of the widespread natural language processing~(NLP) models. With the trends of large NLP models, the increasing memory and computation costs hinder their efficient deployment on resource-limited devices. Therefore, transformer quantization attracts wide research interest. Recent work recognizes that structured outliers are the critical bottleneck for quantization performance. However, their proposed methods increase the computation overhead and still leave the outliers there. To fundamentally address this problem, this paper delves into the inherent inducement and importance of the outliers. We discover that $\boldsymbol γ$ in LayerNorm (LN) acts as a sinful amplifier for the outliers, and the importance of outliers varies greatly where some outliers provided by a few tokens cover a large area but can be clipped sharply without negative impacts. Motivated by these findings, we propose an outlier suppression framework including two components: Gamma Migration and Token-Wise Clipping. The Gamma Migration migrates the outlier amplifier to subsequent modules in an equivalent transformation, contributing to a more quantization-friendly model without any extra burden. The Token-Wise Clipping takes advantage of the large variance of token range and designs a token-wise coarse-to-fine pipeline, obtaining a clipping range with minimal final quantization loss in an efficient way. This framework effectively suppresses the outliers and can be used in a plug-and-play mode. Extensive experiments prove that our framework surpasses the existing works and, for the first time, pushes the 6-bit post-training BERT quantization to the full-precision (FP) level. Our code is available at https://github.com/wimh966/outlier_suppression.

研究の動機と目的

構造化されたアウトライヤーの根本原因と影響を Transformer 量子化で調査する。
推論コストを増やさずにアウトライヤーを抑制するプラグアンドプレイ型フレームワークを開発する。
BERT、RoBERTa、BART の分類、QA、要約タスクで改善を示す。

提案手法

LayerNorm の gamma が Transformer 活性化におけるアウトライヤー増幅作用を持つことを同定する。
Gamma Migration を提案し、gamma の増幅を等価な FP 変換の後続モジュールへ移す。
粗–細のトークンごとパイプラインを用いた Token-Wise Clipping を導入し、最小限の量子化損失クリッピング範囲を見つける。
Gamma Migration 後の出力を量子化して、追加の計算なしで活性化の量子化誤差を減らす。
トークンごとの代表を用いた粗いステップでクリッピング範囲を迅速に推定し、その後クリッピングパラメータのファインチューニングを行う。
既存の PTQ/QAT 手法とのプラグアンドプレー互換性を示す。

実験結果

リサーチクエスチョン

RQ1Transformer 活性化でアウトライヤーがどのように発生し、LayerNorm gamma がそれにどう寄与するのか。
RQ2追加コストをかけずに後段の層へアウトライヤー増幅を移行して量子化の頑健性を改善できるか。
RQ3トークンごとに粗–細に分かれたクリッピングを用いて、可変範囲を持つトークン全体で量子化損失を効率的に最小化できるか。
RQ4Gamma Migration と Token-Wise Clipping は PTQ および QAT の設定で BERT、RoBERTa、BART の量子化性能を改善するか。

主な発見

Gamma Migration は重みへの影響を最小限に抑えつつ活性化の量子化負担を軽減する。
Token-Wise Clipping は最終的な量子化損失を最小限に抑えつつクリッピング範囲を効率的に特定する。
このフレームワークは BERT、RoBERTa、BART で複数タスクにおいて PTQ/QAT の性能を改善する。
6-bit PTQ では性能をフル精度に近づけ、4-bit QAT ではいくつかの設定でほぼ FP 精度を達成する。
このフレームワークはトランスフォーマー量子化の最先端の成果を達成し、6-bit PTQ で一部モデルに対してフル精度レベルの性能を実現する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。