QUICK REVIEW

[논문 리뷰] Universal Transformers

Mostafa Dehghani, Stephan Gouws|arXiv (Cornell University)|2018. 07. 10.

Topic Modeling참고 문헌 15인용 수 396

한 줄 요약

유니버설 트랜스포머는 시간 축에서의 병렬 재귀와 위치별 동적 중단을 추가하여 트랜스포머를 일반화하고, 여러 알고리즘 및 언어 과제에서 최첨단 성과를 달성하며 일반화와 표현력을 향상시킨다.

ABSTRACT

Recurrent neural networks (RNNs) sequentially process data by updating their state with each new data point, and have long been the de facto choice for sequence modeling tasks. However, their inherently sequential computation makes them slow to train. Feed-forward and convolutional architectures have recently been shown to achieve superior results on some sequence modeling tasks such as machine translation, with the added advantage that they concurrently process all inputs in the sequence, leading to easy parallelization and faster training times. Despite these successes, however, popular feed-forward sequence models like the Transformer fail to generalize in many simple tasks that recurrent models handle with ease, e.g. copying strings or even simple logical inference when the string or formula lengths exceed those observed at training time. We propose the Universal Transformer (UT), a parallel-in-time self-attentive recurrent sequence model which can be cast as a generalization of the Transformer model and which addresses these issues. UTs combine the parallelizability and global receptive field of feed-forward sequence models like the Transformer with the recurrent inductive bias of RNNs. We also add a dynamic per-position halting mechanism and find that it improves accuracy on several tasks. In contrast to the standard Transformer, under certain assumptions, UTs can be shown to be Turing-complete. Our experiments show that UTs outperform standard Transformers on a wide range of algorithmic and language understanding tasks, including the challenging LAMBADA language modeling task where UTs achieve a new state of the art, and machine translation where UTs achieve a 0.9 BLEU improvement over Transformers on the WMT14 En-De dataset.

연구 동기 및 목표

시퀀스 모델링을 위해 병렬 처리를 순환적 귀납 편향과 결합한 모델의 필요성을 제시한다.
위치를 따라 병렬로 깊이에 걸쳐 표현을 다듬는 트랜스포머의 일반화로서 유니버설 트랜스포머(UT)를 도입한다.
특정 가정 하에서 UT가 튜링 완전성 될 수 있음을 입증하고 다양한 과제에서의 실험적 성능을 평가한다.
개별 위치의 동적 중단이 여러 과제의 정확도를 향상시키는 것을 보여주고 성능 및 계산 측면에서의 영향을 분석한다.]
method: ["UT는 위치와 시간 단계에 걸쳐 공유되는 자기주목(self-attention)과 순환 전이 함수가 있는 인코더-디코더를 사용한다.","각 순환 단계에서 UT는 다중 헤드 셀프 어텐션으로 표현을 병렬로 갱신한 뒤, 깊이 분해 합성(convolution) 또는 위치별 피드포워드(피드포워드 네트워크)로 이루어진 전이 함수와 잔차 연결을 적용한다.","기호당 깊이는 원칙적으로 무제한이며, ACT 스타일의 중지를 통한 동적 계산 깊이를 가능하게 한다.","각 깊이 단계에서의 처리를 안내하기 위해 위치 인코딩과 시간 단계 인코딩을 추가한다.","모델은 트랜스포머와 유사한 인코더-디코더 구조로 학습되며, 디코더에 대해 교사 강제(teacher forcing)를 사용한다.","UT는 가중치가 공유되고 깊이가 펼쳐진 트랜스포머 블록으로 볼 수 있으며, 시퀀스 길이가 아닌 깊이에 걸친 재발을 가능하게 한다."]
research_questions:[

실험 결과

연구 질문

RQ1공유된 전이 함수를 가진 자기 주의와 시간에 걸친 병렬 재귀가 표준 트랜스포머를 넘는 일반화성과 표현력을 개선할 수 있는가?
RQ2개별 위치의 동적 중단(적응 계산 시간)이 알고리즘적 및 언어 과제에서 성능을 향상시키는가?
RQ3트랜스포머에 비해 UT가 어떤 조건에서 계산적으로 보편적(튜링 완전)한가?
RQ4표준 트랜스포머와 LSTM과 비교할 때 UT의 언어 이해 및 대규모 과제에서의 성능은 어떤가?
RQ5장기적 추론 및 구문적 일반화가 필요한 과제에서 순환 깊이가 미치는 영향은 무엇인가?

주요 결과

UT는 여러 알고리즘 및 언어 과제에서 표준 트랜스포머와 LSTM을 능가한다.
LAMBADA 언어 모델링에서 UT가 최첨단 성과를 달성한다.
WMT14 영어-독일어 번역에서, ACT 없이 완전히 연결된 전이 구조를 가진 UT가 비슷한 규모의 트랜스포머보다 BLEU 점수를 향상시킨다.
동적 중단(ACT)은 여러 작은 과제에서 정확도를 향상시키고 필요한 경우 기호당 더 깊은 처리를 보여주며 정규화 효과를 한다.
UT는 적응 깊이 하에서 학습할 수 있어 더 어려운 입력에 더 많은 단계를 수행하고 쉬운 입력에는 더 적은 단계를 수행하되, 시퀀스 위치 간 병렬 계산은 유지한다.
The UT 프레임워크는 이론적으로 고정 깊이 트랜스포머보다 더 강력하며 특정 매개변수화에서 Neural GPU 및 Neural Turing Machine를 모사할 수 있다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.