QUICK REVIEW

[논문 리뷰] Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing

Zihang Dai, Guokun Lai|arXiv (Cornell University)|2020. 06. 05.

Topic Modeling참고 문헌 30인용 수 104

한 줄 요약

Funnel-Transformer는 토큰 시퀀스를 점진적으로 압축하여 계산을 줄이는 한편 필요 시 디코더를 사용해 토큰 수준 표현을 회복하고, 시퀀스 수준 작업에서 비슷하거나 더 나은 성능으로 효율성을 향상시킵니다.

ABSTRACT

With the success of language pretraining, it is highly desirable to develop more efficient architectures of good scalability that can exploit the abundant unlabeled data at a lower cost. To improve the efficiency, we examine the much-overlooked redundancy in maintaining a full-length token-level presentation, especially for tasks that only require a single-vector presentation of the sequence. With this intuition, we propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one and hence reduces the computation cost. More importantly, by re-investing the saved FLOPs from length reduction in constructing a deeper or wider model, we further improve the model capacity. In addition, to perform token-level predictions as required by common pretraining objectives, Funnel-Transformer is able to recover a deep representation for each token from the reduced hidden sequence via a decoder. Empirically, with comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks, including text classification, language understanding, and reading comprehension. The code and pretrained checkpoints are available at https://github.com/laiguokun/Funnel-Transformer.

연구 동기 및 목표

Transformer 기반 언어 모델에서 전체 길이 토큰 표현의 중복 제거를 촉진한다.
블록 간 시퀀스 길이를 압축하는 계층식 인코더를 제안하여 FLOPs와 메모리를 절약한다.
절약된 계산을 재투자하여 모델을 더 깊게 또는 더 넓게 만들어 더 높은 용량을 확보한다.
사전 학습 목표를 변경하지 않고 압축 인코딩에서 전체 길이 표현을 디코딩하여 토큰 수준 예측을 가능하게 한다.

제안 방법

Transformer 백본을 유지하되, 풀링을 통해 블록 사이에서 시퀀스 길이를 점진적으로 반으로 줄이는 인코더를 추가한다.
쿼리에 풀링이 적용되고 키/밸류는 풀링되지 않은 시퀀스에서 오도록 하는 풀-쿼리 전용 어텐션 설계를 사용한다.
블록 경계에서 시퀀스 길이를 절반으로 줄이기 위해 간단한 스트라이드 평균 풀링(window size 2, stride 2)을 적용한다.
압축된 인코더 출력을 업샘플링하고 이를 첫 번째 블록의 히든 상태와 융합하여 사전 학습 목표를 위한 토큰 수준 표현을 회복하는 디코더를 구현한다.
MLM 또는 ELECTRA 목표로 학습하여 사전 학습 패러다임 전반에 걸친 일반성을 보인다.
각 반으로 줄어드는 단계는 초선형 FLOP 감소를 가져와 비슷한 계산량에서 더 깊거나 넓은 모델을 가능하게 한다.

실험 결과

연구 질문

RQ1계속적으로 은닉 상태 시퀀스를 압축하는 것이 표준 Transformer보다 더 적은 FLOP로 비슷하거나 더 나은 성능을 달성할 수 있는가?
RQ2인코더가 축소된 길이의 시퀀스로 작동할 때 토큰 수준 표현을 어떻게 회복할 수 있는가?
RQ3저장된 FLOP를 심도(depth)나 너비(width)에 재투자하는 것이 시퀀스 수준 작업에 대한 모델 용량을 향상시키는가?
RQ4표준 Transformer와 비교했을 때 Funnel-Transformer는 시퀀스 수준 작업(분류, GLUE, RACE)과 토큰 수준 작업(SQuAD)에서 어떻게 성능을 발휘하는가?

주요 결과

F-TFM은 GLUE와 텍스트 분류에서 비슷하거나 더 낮은 FLOP에서 표준 Transformer를 자주 능가하며, 특히 작은 모델에서 그렇다.
시퀀스 길이를 줄이면서 깊이를 늘리면 절약된 계산으로 얻은 추가 용량으로 성능 향상이 나타난다.
부분 매개변수 공유는 성능을 해칠 수 있으며, 실제로는 신중하게 설계된 일반 레이아웃이 최상의 성능을 보인다.
대규모 사전학습에서 F-TFM은 GLUE 점수가 경쟁력 있고 RACE에서 강력한 성능을 보여주며, 대개 비슷한 FLOP에서 기준선을 능가한다.
토큰 수준 감독이 필수적일 때(SQuAD, 디코더 포함), 매우 큰 F-TFM 모델도 여전히 전체 시퀀스 Transformer보다 뒤처질 수 있어 토큰 수준 작업의 절충점을 강조한다.
변인 실험은 풀-쿼리 전용 설계, 분리된 [cls] 토큰 처리, 상대 위치 인코딩이 성능에 중요한 영향을 미친다는 것을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.