QUICK REVIEW

[논문 리뷰] TwinBERT: Distilling Knowledge to Twin-Structured BERT Models for Efficient Retrieval

Wenhao Lu, Jian Jiao|arXiv (Cornell University)|2020. 02. 14.

Topic Modeling참고 문헌 37인용 수 28

한 줄 요약

TwinBERT는 쿼리와 문서 인코딩을 분리하는 이중 구조 BERT 모델을 제안하여 문서 임베딩을 사전에 계산하고 캐시함으로써 CPU에서 추론 시간을 약 20ms로 줄입니다. 지식 정착과 효율적인 네트워크 설계를 통해 TwinBERT는 BERT-Base 수준의 성능을 달성하면서도 BERT-Base 및 BERT-Large 대비 77~663배 빠른 추론 속도를 기록하여 생산 환경의 저지연 추론 시스템에 적합합니다.

ABSTRACT

Pre-trained language models like BERT have achieved great success in a wide variety of NLP tasks, while the superior performance comes with high demand in computational resources, which hinders the application in low-latency IR systems. We present TwinBERT model for effective and efficient retrieval, which has twin-structured BERT-like encoders to represent query and document respectively and a crossing layer to combine the embeddings and produce a similarity score. Different from BERT, where the two input sentences are concatenated and encoded together, TwinBERT decouples them during encoding and produces the embeddings for query and document independently, which allows document embeddings to be pre-computed offline and cached in memory. Thereupon, the computation left for run-time is from the query encoding and query-document crossing only. This single change can save large amount of computation time and resources, and therefore significantly improve serving efficiency. Moreover, a few well-designed network layers and training strategies are proposed to further reduce computational cost while at the same time keep the performance as remarkable as BERT model. Lastly, we develop two versions of TwinBERT for retrieval and relevance tasks correspondingly, and both of them achieve close or on-par performance to BERT-Base model. The model was trained following the teacher-student framework and evaluated with data from one of the major search engines. Experimental results showed that the inference time was significantly reduced and was firstly controlled around 20ms on CPUs while at the same time the performance gain from fine-tuned BERT-Base model was mostly retained. Integration of the models into production systems also demonstrated remarkable improvements on relevance metrics with negligible influence on latency.

연구 동기 및 목표

실시간 정보 검색(IR) 시스템에서 BERT의 높은 추론 지연을 해결하기 위해.
스폰서드 서치와 같은 저지연 환경에서 딥 네트워크 모델의 효율적 온라인 서비스를 가능하게 하기 위해.
계산 비용을 크게 줄이면서도 높은 검색 및 관련성 성능를 유지하기 위해.
성능을 유지하면서 추론 효율성을 향상시키는 지식 정착 기법을 탐색하기 위해.
관련성 품질을 손상시키지 않고 CPU에서 밀도 있는 의미적 모델을 배포할 수 있도록 하기 위해.

제안 방법

TwinBERT는 쿼리와 문서를 별도로 처리하는 두 개의 BERT 유사 인코더를 사용하여, BERT의 표준 연결 방식과는 별개로 입력 인코딩 프로세스를 분리합니다.
문서 임베딩은 오프라인으로 사전에 계산하고 메모리에 캐시하여 추론 시 문서에 대한 인코딩을 제거합니다.
교차 레이어는 코사인 유사도 또는 잔차 신경망을 사용하여 쿼리 및 문서 임베딩을 조합하여 관련성 점수를 계산합니다.
지식 정착을 통해 BERT-Base를 교사 모델로 사용하여 TwinBERT를 훈련함으로써 성능를 유지하면서 모델 복잡성을 줄입니다.
ONNX 런타임을 사용하여 CPU 추론 최적화를 통해 생산 환경에서의 서빙 오버헤드를 최소화합니다.
성능 저하 없이 계산 비용을 줄이기 위해 효율적인 네트워크 구성 요소와 훈련 전략을 설계합니다.

실험 결과

연구 질문

RQ1BERT에서 쿼리와 문서 인코딩을 분리하면 추론 지연을 줄일 수 있을까, 동시에 높은 검색 성능를 유지할 수 있을까?
RQ2지식 정착을 통해 더 작은, 더 빠른 모델이 IR 작업에서 BERT 수준의 성능를 유지할 수 있는 정도는 어느 정도일까?
RQ3문서 임베딩을 사전에 계산하고 캐시하는 것이 검색 시스템에서 런타임 계산을 줄이는 데 얼마나 효과적인가?
RQ4TwinBERT는 CPU에서 BERT-Base 성능를 따라잡으면서도 20ms 이내의 추론 지연을 달성할 수 있을까?
RQ5TwinBERT가 생산 검색 시스템에 미치는 영향은 지연, 정확도, 배포 가능성 측면에서 어떠한가?

주요 결과

TwinBERT는 쿼리당 100개의 문서를 평가할 때 CPU에서 평균 약 20ms의 추론 시간을 기록하여 BERT보다 지연을 크게 줄였습니다.
사전에 계산된 문서 임베딩 덕분에 TwinBERT는 BERT-Base 대비 77배, BERT-Large 대비 422배 빠른 추론 속도를 기록했습니다.
생산 A/B 테스트에서 토치된 BERT-12의 성능 향상 기여도의 90퍼센트 이상을 유지했으며, 악성 광고 인식률이 10퍼센트 이상 감소했습니다.
TwinBERT 모델은 주요 스폰서드 서치 시스템에 성공적으로 배포되어 지연 영향을 거의 느끼지 못했고, 높은 관련성 품질을 확보했습니다.
코사인 유사도 기반 TwinBERT는 동일 조건에서 BERT-3 대비 121배, BERT-12 대비 663배의 속도 향상을 기록했습니다.
쿼리 임베딩을 런타임에 재계산해도 TwinBERT는 BERT-3보다 빠르며, 이는 그 효율성 우월성을 입증합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.