QUICK REVIEW

[논문 리뷰] Depthwise Separable Convolutions for Neural Machine Translation

Łukasz Kaiser, Aidan N. Gomez|arXiv (Cornell University)|2017. 06. 09.

Multimodal Machine Learning Applications참고 문헌 17인용 수 244

한 줄 요약

tldr: SliceNet를 소개합니다. 이는 깊이별 분리(depthwise separable) 및 초분리(super-separable) 컨볼루션을 사용하는 컨볼루션 기반 시퀀스-투-시퀀스 모델로, 파라미터 수를 줄이고 dilation 없이도 신경 기계 번역에서 최첨단 결과를 달성합니다.

ABSTRACT

Depthwise separable convolutions reduce the number of parameters and computation used in convolutional operations while increasing representational efficiency. They have been shown to be successful in image classification models, both in obtaining better models than previously possible for a given parameter count (the Xception architecture) and considerably reducing the number of parameters required to perform at a given level (the MobileNets family of architectures). Recently, convolutional sequence-to-sequence networks have been applied to machine translation tasks with good results. In this work, we study how depthwise separable convolutions can be applied to neural machine translation. We introduce a new architecture inspired by Xception and ByteNet, called SliceNet, which enables a significant reduction of the parameter count and amount of computation needed to obtain results like ByteNet, and, with a similar parameter count, achieves new state-of-the-art results. In addition to showing that depthwise separable convolutions perform well for machine translation, we investigate the architectural changes that they enable: we observe that thanks to depthwise separability, we can increase the length of convolution windows, removing the need for filter dilation. We also introduce a new "super-separable" convolution operation that further reduces the number of parameters and computational cost for obtaining state-of-the-art results.

연구 동기 및 목표

합성곱 기반 NMT 아키텍처에서 파라미터 수와 계산량 감소를 동기 부여한다.
깊이별 분리(convolution)와 그룹화된 컨볼루션을 시퀀스-투-시퀀스 모델에 적용하는 것을 탐구한다.
필터 팽창(dilation)을 제거하고 더 큰 합성곱 윈도우를 사용하는 것이 미치는 영향을 평가한다.
새로운 초분리(super-separable) 컨볼루션 연산을 도입하고 평가한다.
제한된 자원에서 SliceNet으로 최첨단 번역 성과를 시연한다.

제안 방법

SliceNet, 잔차 연결이 있는 깊이별 분리 컨볼루션 층의 스택의 제안을 하고, 선택적으로 그룹화된 및 초분리 컨볼루션을 사용한다.
전통적인 일반 합성곱을 깊이별 분리 합성곱으로 대체하여 파라미터 수와 연산량을 줄인다.
입력과 출력을 인코딩하는 두 개의 서브 네트워크를 사용하고, 주의(attention)가 있는 자기회귀 디코더 앞에 연결한다.
컨볼루션 모듈 내에서 층 정규화(layer normalization)와 ReLU 활성화를 사용한다.
수용 영역 증가를 위해 dilation과 더 큰 합성곱 윈도우를 탐구하고 비교한다.
TensorFlow Tensor2Tensor 구현에 대한 코드 참조를 제공한다.

실험 결과

연구 질문

RQ1깊이별 분리 컨볼루션이 ByteNet 유사 아키텍처에서 일반 컨볼루션보다 번역 품질을 향상시키는가?
RQ2dilation을 제거하고 더 큰 합성곱 윈도우에 의존하는 것이 NMT의 성능을 유지하거나 향상시킬 수 있는가?
RQ3중간 단계의 그룹화된(서브-세포) 컨볼루션이 전체 깊이별 분리 컨볼루션에 비해 어떤 영향을 미치는가?
RQ4제안된 초분리 컨볼루션이 표준 깊이별 분리 컨볼루션보다 추가적인 성능 향상을 제공하는가?

주요 결과

깊이별 분리 컨볼루션은 ByteNet 유사 NMT 모델에서 일반 컨볼루션보다 적은 파라미터와 더 낮은 계산 비용으로 더 높은 정확도를 제공한다.
깊이별 분리 컨볼루션에서 dilation을 더 큰 합성곱 윈도우로 대체하면 비슷하거나 더 나은 결과를 얻을 수 있으며, dilation은 필요하지 않다.
그룹화된 컨볼루션(16 그룹)을 사용하는 것은 전체 깊이별 분리 컨볼루션보다 성능이 떨어져, 더 높은 분리가 이점이 있음을 시사한다.
초분리 컨볼루션은 표준 깊이별 분리 컨볼루션에 비해 점진적인 성능 향상을 제공합니다.
깊이별 분리 또는 초분리 컨볼루션을 적용한 더 큰 SliceNet 모델은 WMT EN-DE에서 최첨단 BLEU 점수를 달성하며, 예를 들어 더 큰 Super 2/3 모델의 newstest14에서 26.1, 이전 연구와 비교했을 때 newstest14/2014에서 25.5–26.1에 이르는 성과를 보인다.
SliceNet 모델은 ByteNet보다 비임베딩 파라미터와 FLOPs가 두 배 이상 적으면서 더 우수한 번역 품질을 달성한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.