QUICK REVIEW

[논문 리뷰] N-gram Language Modeling using Recurrent Neural Network Estimation

Ciprian Chelba, Mohammad Norouzi|arXiv (Cornell University)|2017. 03. 31.

Topic Modeling참고 문헌 6인용 수 33

한 줄 요약

이 논문은 기존의 Kneser-Ney와 같은 전통적인 백오프 방법 대신 LSTM 기반 신경망을 사용하여 n-gram 언어 모델을 스무딩하는 것을 제안한다. LSTM은 장기적 의존성을 효과적으로 포착하여 n=13일 때 전체 순환 LSTM과 유사한 성능을 달성하며, n-gram 순서가 증가할수록 퍼플렉서티가 향상되어 고전적 스무딩 기법을 능가한다.

ABSTRACT

We investigate the effective memory depth of RNN models by using them for $n$-gram language model (LM) smoothing. Experiments on a small corpus (UPenn Treebank, one million words of training data and 10k vocabulary) have found the LSTM cell with dropout to be the best model for encoding the $n$-gram state when compared with feed-forward and vanilla RNN models. When preserving the sentence independence assumption the LSTM $n$-gram matches the LSTM LM performance for $n=9$ and slightly outperforms it for $n=13$. When allowing dependencies across sentence boundaries, the LSTM $13$-gram almost matches the perplexity of the unlimited history LSTM LM. LSTM $n$-gram smoothing also has the desirable property of improving with increasing $n$-gram order, unlike the Katz or Kneser-Ney back-off estimators. Using multinomial distributions as targets in training instead of the usual one-hot target is only slightly beneficial for low $n$-gram orders. Experiments on the One Billion Words benchmark show that the results hold at larger scale: while LSTM smoothing for short $n$-gram contexts does not provide significant advantages over classic N-gram models, it becomes effective with long contexts ($n > 5$); depending on the task and amount of data it can match fully recurrent LSTM models at about $n=13$. This may have implications when modeling short-format text, e.g. voice search/query LMs. Building LSTM $n$-gram LMs may be appealing for some practical situations: the state in a $n$-gram LM can be succinctly represented with $(n-1)*4$ bytes storing the identity of the words in the context and batches of $n$-gram contexts can be processed in parallel. On the downside, the $n$-gram context encoding computed by the LSTM is discarded, making the model more expensive than a regular recurrent LSTM LM.

연구 동기 및 목표

RNN 모델의 효과적 기억 깊이를 n-gram 언어 모델을 스무딩하는 데 사용함으로써 조사한다.
LSTM 기반 모델이 퍼플렉서티 및 확장성 측면에서 Katz 및 Kneser-Ney와 같은 전통적 n-gram 스무딩 기법을 능가하는지 평가한다.
LSTM을 사용해 n-gram 컨텍스트를 인코딩할 때 훈련 효율성, 추론 속도, 모델 성능 간의 상충 관계를 탐색한다.
고전적 백오프 기법과 달리 신경 스무딩이 n-gram 순서가 증가함에 따라 효과적으로 향상되는지 여부를 확인한다.
음성 검색과 같은 저자원 또는 짧은 시퀀스 응용 분야에서 LSTM n-gram 모델의 실용적 타당성을 평가한다.

제안 방법

LSTM 네트워크는 기존의 n-gram 확률 추정 대신 고정 길이의 n-gram 컨텍스트를 바탕으로 다음 단어를 예측하도록 훈련된다.
LSTM은 n-gram 컨텍스트 내 각 단어의 임베딩을 순차적으로 처리하며, 컨텍스트 역사 정보를 유지하는 은닉 상태를 유지한다.
훈련 중 일반화를 향상시키고 과적합을 줄이기 위해 LSTM 셀에 드롭아웃을 적용한다.
학습 효율성을 향상시키기 위해 원-핫 벡터 대신 다중항표적(soft labels)을 사용하여 모델을 훈련시킨다.
추론 시, LSTM 상태는 각 n-gram 컨텍스트당 한 번 계산되어 단일 4*(n-1)-바이트 표현으로 저장되며, 이는 단어 식별자의 압축된 표현이다.
실험은 문장 간독립성(시작 태그 <S>에서 리셋)과 문장 경계를 넘어서는 컨텍스트 허용이라는 두 가지 설정에서 성능을 비교한다.

실험 결과

연구 질문

RQ1n-gram 언어 모델을 스무딩하는 데 사용할 때 LSTM의 효과적 기억 깊이는 얼마인가?
RQ2LSTM 기반 n-gram 스무딩은 Kneser-Ney나 Katz 스무딩과 같은 고전적 백오프 방법과 비교해 성능가능성은 어떠한가?
RQ3LSTM 스무딩을 적용한 n-gram 모델의 성능은 기존 스무딩 기법과 달리 n-gram 순서가 증가함에 따라 향상되는가?
RQ4LSTM n-gram 모델는 전체 순환 LSTM 언어 모델 성능에 근접할 수 있으며, 어느 n-gram 순서에서 이를 달성하는가?
RQ5표준 순환 LSTM과 비교했을 때 LSTM 기반 n-gram 모델 사용 시 훈련 및 추론 효율성의 상충 관계는 어떠한가?

주요 결과

드롭아웃을 적용한 LSTM은 피드포워드 및 보통 RNN 모델보다 n-gram 상태를 인코딩하는 데 더 우수한 성능을 보이며, UPenn Treebank에서 가장 낮은 퍼플렉서티를 달성한다.
n=9일 때, LSTM n-gram는 문장 간독립성 조건에서 전체 순환 LSTM LM과 동일한 성능을 보이며, n=13일 땐 약간 뛰어난 성능을 기록한다.
문장 경계를 넘는 컨텍스트를 允허할 경우, LSTM 13-gram은 One Billion Words 벤치마크에서 퍼플렉서티 49를 기록하여 전체 순환 LSTM LM(48)에 거의 근접한 성능을 보였다.
LSTM n-gram 모델는 n-gram 순서가 증가함에 따라 일관된 성능 향상을 보였으며, Kneser-Ney나 Katz 백오프와 달리 낮은 n에서 포화 상태에 이르지 않는다.
One Billion Words 벤치마크에서 LSTM 스무딩은 n>5일 때 효과적으로 작동하며, n≈13일 때 전체 순환 LSTM 성능에 도달한다.
원-핫 레이블 대신 다중항표적을 사용할 경우 성능 향상이 미미하며, 특히 고차수 n-gram에서 그러한 효과는 더욱 작다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.