QUICK REVIEW

[논문 리뷰] ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context

Wei Han, Zhengdong Zhang|arXiv (Cornell University)|2020. 05. 07.

Speech Recognition and Synthesis참고 문헌 36인용 수 72

한 줄 요약

ContextNet은 CNN 기반의 음성 인코더에서 squeeze-and-excitation을 통해 글로벌 컨텍스트를 도입하고 RNN-T 프레임워크 내에서 파라미터 수를 줄이면서 LibriSpeech에서 SOTA에 근접한 혹은 SOTA를 능가하는 WER을 달성합니다. 또한 속도-정확도 트레이드오프를 위한 다운샘플링이 효과적임을 보여줍니다.

ABSTRACT

Convolutional neural networks (CNN) have shown promising results for end-to-end speech recognition, albeit still behind other state-of-the-art methods in performance. In this paper, we study how to bridge this gap and go beyond with a novel CNN-RNN-transducer architecture, which we call ContextNet. ContextNet features a fully convolutional encoder that incorporates global context information into convolution layers by adding squeeze-and-excitation modules. In addition, we propose a simple scaling method that scales the widths of ContextNet that achieves good trade-off between computation and accuracy. We demonstrate that on the widely used LibriSpeech benchmark, ContextNet achieves a word error rate (WER) of 2.1%/4.6% without external language model (LM), 1.9%/4.1% with LM and 2.9%/7.0% with only 10M parameters on the clean/noisy LibriSpeech test sets. This compares to the previous best published system of 2.0%/4.6% with LM and 3.9%/11.3% with 20M parameters. The superiority of the proposed ContextNet model is also verified on a much larger internal dataset.

연구 동기 및 목표

CNN 기반 ASR의 격차를 RNN/Transformer 모델과의 차이를 줄이고 글로벌 컨텍스트를 도입하여 개선한다.
CNN 인코더에 squeeze-and-excitation 모듈을 도입한 ContextNet 아키텍처를 제안한다.
정확도와 효율성의 균형을 맞추기 위한 모델 확장과 점진적 다운샘플링을 탐구한다.

제안 방법

깊이별 분리 합성곱과 Swish 활성화가 적용된 완전 합성곱 음성 인코더를 사용한다.
각 합성곱 블록에 글로벌 컨텍스트를 주입하기 위해 1D squeeze-and-excitation을 도입한다.
엔드 투 엔드 CNN-RNN-T 아키텍처를 구성하기 위해 RNN-T 디코더를 채택한다.
계산량을 줄이기 위해 점진적 8x 시간 축 다운샘플링을 적용한다.
오차량과 정확도 사이의 트레이드를 위한 알파(alpha) 매개변수로 너비를 확장한다.
SpecAugment와 Transformer/LSTM 언어 모델을 이용한 얕은 융합으로 LibriSpeech에서 학습 및 평가를 수행한다.

실험 결과

연구 질문

RQ1Squeeze-and-Excitation을 추가한 CNN 인코더에서 글로벌 컨텍스트를 도입하는 것이 이전의 CNN 모델과 비교하여 LibriSpeech에서 WER를 감소시키는가?
RQ2ContextNet에서 점진적 다운샘플링이 계산량과 정확도에 미치는 영향은 무엇인가?
RQ3ContextNet의 너비 증가(alpha)에 따른 확장은 어떻게 되고 LibriSpeech에서 Transformer/LSTM 베이스라인 및 기존 CNN 모델과의 비교는 어떠한가?
RQ4외부 언어 모델 없이 평가하고 노이즈가 더 많은 테스트 세트에서 ContextNet은 얼마나 견고한가?
RQ5LibriSpeech를 넘어 더 큰 데이터셋으로 일반화가 가능한가?

주요 결과

ContextNet(L)은 LM 없이 LibriSpeech에서 1.9% test-clean 및 4.1% test-other WER를 달성하고 LM과 함께 4.6%/4.1%를 달성합니다(표의 수치 참조).
ContextNet(M)은 LM 없이 dev-clean 2.4%, dev-other 5.4%, test-clean 2.0%, test-other 4.5%를 달성합니다(표의 수치 참조).
ContextNet(S)은 LM 없이 dev-clean 2.9%, dev-other 7.0%, test-clean 2.3%, test-other 5.5%를 달성합니다(표의 수치 참조).
ContextNet은 QuartzNet과 같은 이전 CNN 모델보다 LibriSpeech에서 우수한 성능을 보이며 여러 Transformer/LSTM 기반 베이스라인보다 WER 및 파라미터 효율성 면에서 우수합니다(표 2).
점진적 8x 다운샘플링은 정확도에 미치는 영향이 작거나 양의 방향으로 크게 감소된 FLOP를 통해 계산량을 크게 줄입니다(표 4).
모델 너비(alpha) 증가가 더 큰 파라미터 예산에서 WER 개선을 가져옵니다(표 5).
YouTube-유사 데이터에서의 대규모 실험에서 ContextNet은 더 적은 파라미터와 더 낮은 FLOP로 prior TDNN 기반 아키텍처를 능가하는 WER을 달성합니다(표 6).

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.