QUICK REVIEW

[논문 리뷰] Fast Structured Decoding for Sequence Models

Zhiqing Sun, Zhuohan Li|arXiv (Cornell University)|2019. 10. 25.

Algorithms and Data Compression참고 문헌 27인용 수 61

한 줄 요약

논문은 CRF 기반의 구조화된 추론 모듈(NART-CRF 및 NART-DCRF)을 활용하여 비자귀형(비자 autoregressive) 번역 모델에서 목표 어휘의 동시출현을 모델링하고, 거의 자가회귀 수준의 정확도와 큰 속도향상을 달성한다.

ABSTRACT

Autoregressive sequence models achieve state-of-the-art performance in domains like machine translation. However, due to the autoregressive factorization nature, these models suffer from heavy latency during inference. Recently, non-autoregressive sequence models were proposed to reduce the inference time. However, these models assume that the decoding process of each token is conditionally independent of others. Such a generation process sometimes makes the output sentence inconsistent, and thus the learned non-autoregressive models could only achieve inferior accuracy compared to their autoregressive counterparts. To improve then decoding consistency and reduce the inference cost at the same time, we propose to incorporate a structured inference module into the non-autoregressive models. Specifically, we design an efficient approximation for Conditional Random Fields (CRF) for non-autoregressive sequence models, and further propose a dynamic transition technique to model positional contexts in the CRF. Experiments in machine translation show that while increasing little latency (8~14ms), our model could achieve significantly better translation performance than previous non-autoregressive models on different translation datasets. In particular, for the WMT14 En-De dataset, our model obtains a BLEU score of 26.80, which largely outperforms the previous non-autoregressive baselines and is only 0.61 lower in BLEU than purely autoregressive models.

연구 동기 및 목표

자가회귀 시퀀스 모델의 추론 지연 시간을 줄이되 정확도를 희생하지 않는다.
비자 autoregressive 디코딩에서 다중모달 타깃 분포를 포착하기 위한 구조화된 추론 모듈을 통합한다.
신경망 MT에서 큰 어휘에 적합한 확장 가능한 CRF 근사법을 개발한다.
CRF에 위치 맥락을 풍부하게 반영하기 위한 동적 전이(dynamic transitions) 도입한다.
표준 MT 벤치마크에서 비-autoregressive 모델 가운데 최첨단 성능을 입증한다.

제안 방법

비 autoregressive 번역을 시퀀스 라벨링으로 형식화하고 인접 토큰 의존성을 모델링하기 위해 선형-연쇄 CRF를 적용한다.
아키텍처를 단순화하기 위해 간단한 NART 디코더 입력(패딩 토큰 뒤에 eos)을 사용한다.
M = E1 E2^T 형태로 CRF 전이 행렬에 대한 저랭크 근사를 도입하여 두 개의 전이 임베딩(E1, E2)을 사용한다.
CRF 디코딩 복잡도를 O(n|V|^2)에서 O(n k^2)로 감소시키는 빔 근사를 적용한다.
M^i = E1 M_dynamic^i E2^T 형태의 동적 전이를 도입하여 인접 디코더 상태에 따라 위치 context를 풍부하게 한다.
학습 시 CRF 손실과 일반 NART 손실을 결합한다: L = L_CRF + λ L_NAR (λ = 0.5).
distillation 및 재평가를 위한 Transformer 교사 모델을 사용하여 WMT14 En-De/De-En 및 IWSLT14 De-En에서 평가한다.

실험 결과

연구 질문

RQ1CRF 기반의 구조화된 추론 모듈이 지역 라벨 의존성을 모델링함으로써 비-autoregressive MT의 디코딩 일관성 및 정확성을 향상시킬 수 있는가?
RQ2저랭크 및 빔 근사가 큰 어휘에서 CRF 디코딩의 문제를 해결할 수 있으며 성능 저하 없이 가능한가?
RQ3동적 CRF 전이가 위치 맥락을 포함시켜 번역 품질을 개선하는가?
RQ4NART-CRF/NART-DCRF가 BLEU에서 자가회귀 대비 속도향상을 유지하면서 얼마나 AR 베이스라인에 근접하는가?

주요 결과

모델	En-De BLEU	De-En BLEU	IWSLT De-En BLEU	지연 시간(ms)	ART 대비 속도향상
NART	20.27 (7.14)	22.02 (9.27)	23.04 (10.22)	26	11.1x
NART-CRF	23.32 (4.09)	25.75 (5.54)	26.39 (6.87)	35	11.1x
NART-CRF (rescoring 9)	26.04 (1.37)	28.88 (2.41)	29.21 (4.05)	60	6.45x
NART-CRF (rescoring 19)	26.68 (0.73)	29.26 (2.03)	29.55 (3.71)	87	4.45x
NART-DCRF	23.44 (3.97)	27.22 (4.07)	27.44 (5.82)	37	10.4x
NART-DCRF (rescoring 9)	26.07 (1.34)	29.68 (1.61)	29.99 (3.27)	63	6.14x
NART-DCRF (rescoring 19)	26.80 (0.61)	30.04 (1.25)	30.36 (2.90)	88	4.39x
CRF beam size ablation (k varies)	—	—	—	varies with k	—
Rescoring impact (9)	—	—	—	—	—

NART-CRF/NART-DCRF가 기준선 비자 autoregressive 모델들보다 벤치마크 전반에서 상당히 우수했다.
WMT14 En-De에서 NART-CRF는 26.80 BLEU를 달성(AR 모델과 비교하여 보고된 설정에서 AR Transformer보다 0.61 BLEU 낮음).
NART-CRF/ NART-DCRF가 ART 대비 상당한 속도향상을 달성(약 11배의 그레이디 디코딩; 재평가로 약 4.4배).
빔 크기 실험에서 k=16이 이미 강력한 근사를 제공하며, 더 큰 k에서는 수익이 줄어듦.
동적 전이는 En-De, De-En, IWSLT De-En 작업 전반에서 BLEU 증가를 제공(작지만 일관된 개선).
재평가가 포함된 NART-CRF/NART-DCRF는 자가회귀 모델에 비해 지연 시간을 줄이면서도 정확성을 강하게 유지한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.