QUICK REVIEW

[論文レビュー] Zipformer: A faster and better encoder for automatic speech recognition

Zengwei Yao, Liyong Guo|arXiv (Cornell University)|Oct 17, 2023

Speech Recognition and Synthesis被引用数 28

ひとこと要約

Zipformer は、U-Net に似たダウンサンプリング構造、BiasNorm、Swoosh アクティベーション、ScaledAdam を備え、LibriSpeech、Aishell-1、WenetSpeech で最先端の結果を達成する、より高速でメモリ効率の良い ASR エンコーダを提供します。

ABSTRACT

The Conformer has become the most popular encoder model for automatic speech recognition (ASR). It adds convolution modules to a transformer to learn both local and global dependencies. In this work we describe a faster, more memory-efficient, and better-performing transformer, called Zipformer. Modeling changes include: 1) a U-Net-like encoder structure where middle stacks operate at lower frame rates; 2) reorganized block structure with more modules, within which we re-use attention weights for efficiency; 3) a modified form of LayerNorm called BiasNorm allows us to retain some length information; 4) new activation functions SwooshR and SwooshL work better than Swish. We also propose a new optimizer, called ScaledAdam, which scales the update by each tensor's current scale to keep the relative change about the same, and also explictly learns the parameter scale. It achieves faster convergence and better performance than Adam. Extensive experiments on LibriSpeech, Aishell-1, and WenetSpeech datasets demonstrate the effectiveness of our proposed Zipformer over other state-of-the-art ASR models. Our code is publicly available at https://github.com/k2-fsa/icefall.

研究の動機と目的

エンドツーエンド ASR システムにおける、より速く、よりメモリ効率の高いエンコーダの必要性を動機づける。
時間的ダウンサンプリングと注意重みの再利用による効率性を持つ Zipformer アーキテクチャを提案する。
BiasNorm、SwooshR/SwooshL アクティベーション、および ScaledAdam オプティマイザを導入して、トレーニングと推論を改善する。
LibriSpeech、Aishell-1、WenetSpeech で Zipformer を評価し、部品寄与を理解するためのアブレーションを行う。

提案手法

入力を段階的に低いフレームレートへダウンサンプリングする、複数のスタックを持つ U-Net 風エンコーダを提案する。
拡張されたモジュールセットを用いて注意重みを再利用する、Non-Linear Attention (NLA) および ByPass 接続を含む Zipformer ブロックを再設計する。
正規化中に長さ情報を保持するため、LayerNorm を BiasNorm に置換する。
異なるモジュールのニーズに合わせて、2つのアクティベーション関数（SwooshR と SwooshL）を導入する。
ScaledAdam、パラメータのスケールを学習し、パラメータ RMS によって更新をスケールする、スケール認識型オプティマイザを開発し、より速い収束を可能にする。
LibriSpeech、Aishell-1、WenetSpeech で広範な実験とアブレーション研究を実施し、最先端モデルと比較する。

Figure 1: Overall architecture of Zipformer.

実験結果

リサーチクエスチョン

RQ1エンドツーエンドASRにおけるエンコーダーアーキテクチャを、精度を犠牲にすることなく、より高速でメモリ効率良くするには？
RQ2Zipformer ブロックで時間的ダウンサンプリングと注意重みの共有は、効率と性能を改善するか？
RQ3正規化と活性化関数の選択（BiasNorm、SwooshR、SwooshL）は、トレーニングの安定性と精度を改善するか？
RQ4スケール認識型オプティマイザ（ScaledAdam）は、Zipformer モデルのトレーニングにおいて Adam を上回るか？
RQ5LibriSpeech、Aishell-1、WenetSpeech での Zipformer の性能は、最先端モデルと比較してどうか？

主な発見

Zipformer-S/M/L は、FLOPs とパラメータを削減しつつ、LibriSpeech、Aishell-1、WenetSpeech で最先端の結果に競合する。
Zipformer-L および Zipformer-L* は、LibriSpeech で Conformer-L に近い WER を、FLOPs とメモリ使用の約半分で達成。
Zipformer はトレーニング中の収束を早め、推論速度を GPU で 50% 以上向上させつつ、過度なメモリ使用を避けられる。
アブレーション研究は、ダウンサンプリング、共有注意重み、BiasNorm、Swoosh アクティベーション、ScaledAdam が性能と効率に正の寄与をすることを示している。
ScaledAdam は、LibriSpeech の収束と最終的な WER/CER において Adam を上回り、テストクリアとテストオーサの指標で大きな利得を示す。

Zipformer: A faster and better encoder for automatic speech recognition

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。