QUICK REVIEW

[논문 리뷰] Semantic Similarity Matching for Patent Documents Using Ensemble BERT-related Model and Novel Text Processing Method

Liqiang Yu, Bo Liu|arXiv (Cornell University)|2024. 01. 06.

Biomedical Text Mining and Ontologies인용 수 30

한 줄 요약

본 논문은 네 가지 BERT 계열 모델의 앙상블과 새로운 텍스트 전처리 방법(V3)을 제안하여 특허 구문에 대한 의미적 유사성 매칭을 개선하고, BCELoss 학습으로 U.S. Patent Phrase-to-Phrase 데이터셋에서 평가하였다.

ABSTRACT

In the realm of patent document analysis, assessing semantic similarity between phrases presents a significant challenge, notably amplifying the inherent complexities of Cooperative Patent Classification (CPC) research. Firstly, this study addresses these challenges, recognizing early CPC work while acknowledging past struggles with language barriers and document intricacy. Secondly, it underscores the persisting difficulties of CPC research. To overcome these challenges and bolster the CPC system, This paper presents two key innovations. Firstly, it introduces an ensemble approach that incorporates four BERT-related models, enhancing semantic similarity accuracy through weighted averaging. Secondly, a novel text preprocessing method tailored for patent documents is introduced, featuring a distinctive input structure with token scoring that aids in capturing semantic relationships during CPC context training, utilizing BCELoss. Our experimental findings conclusively establish the effectiveness of both our Ensemble Model and novel text processing strategies when deployed on the U.S. Patent Phrase to Phrase Matching dataset.

연구 동기 및 목표

CPC 중심 특허 분석에서 의미적 유사성의 문제에 대응한다.
모델 앙상블과 맞춤형 텍스트 전처리를 통해 CPC의 정확도와 효율성을 향상시킨다.
특허 텍스트의 의미적 관계를 포착하기 위해 BCELoss 기반 토큰 점수를 활용한다.

제안 방법

다음의 네 가지 딥러닝 모델의 앙상블을 사용한다: DeBERTaV3, Microsoft DeBERTa-v3-large, MoritzLaurer DeBERTa-v3-large-mnli-fever-anli-ling-wanli, Anferico BERT-for-Patents, 및 Google ELECTRA-large-discriminator.
가중 평균을 적용하고(validation 데이터에서 최적의 가중치를 사용)하여 모델 예측의 가중 평균을 적용한다.
구조화된 입력을 포함한 [CLS], [SEP], 및 [TAR]를 포함하는 타깃과 점수 리스트로 고정된 입력을 구성하는 새로운 텍스트 전처리 방법 V3를 도입한다.
TrainDataset 내에서 학습 중 각 토큰에 점수를 할당하고 BCELoss로 학습하여 예측 점수를 실제 정답과 일치시킨다.
4-fold 교차 검증을 사용하여 U.S. Patent Phrase-to-Phrase Matching 데이터셋에서 Pearson 상관계수로 평가한다.

실험 결과

연구 질문

RQ1여러 BERT 관련 모델의 앙상블이 특허 구문 유사성 작업에서 단일 모델을 능가할 수 있는가?
RQ2V3 텍스트 전처리 방법이 CPC-맥락 학습에서 의미적 유사성 포착을 향상시키는가?
RQ3BCELoss를 이용한 토큰 수준 점수가 특허 구문 매칭에서 모델 학습 및 성능에 어떤 영향을 미치는가?
RQ4개별 모델과 비교한 경우 U.S. Patent Phrase-to-Phrase Matching 데이터셋에서 앙상블의 성능은 어느가인가?

주요 결과

V3 전처리 방식은 V1, V2, V3 중 최상위 변형으로, DeBERTa-v3-large 기반 변형의 CV 스코어가 0.8512이다.
앙상블 모델은 포함된 모든 모델 중 가장 높은 CV 스코어인 0.8534를 달성했다.
개별 모델 기여는 Microsoft/DeBERTa-v3-large (0.8512 CV), Anferico/BERT-for-Patents (0.8382 CV), Google/ELECTRA-large (0.8503 CV), MoritzLaurer/DeBERTa-v3-large (0.8385 CV) 를 포함한다.
전반적으로 앙상블은 대상 데이터셋에서 단일 모델 변형보다 피어슨 상관계수 면에서 더 우수하다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.