QUICK REVIEW

[論文レビュー] Semantic Similarity Matching for Patent Documents Using Ensemble BERT-related Model and Novel Text Processing Method

Liqiang Yu, Bo Liu|arXiv (Cornell University)|Jan 6, 2024

Biomedical Text Mining and Ontologies被引用数 30

ひとこと要約

この論文は、4つのBERT関連モデルのアンサンブルと新しいテキスト前処理法（V3）を提案し、特許フレーズの意味的類似性マッチングを改善することを目的としています。評価データセットは U.S. Patent Phrase-to-Phrase で、BCELoss を用いた訓練を行っています。

ABSTRACT

In the realm of patent document analysis, assessing semantic similarity between phrases presents a significant challenge, notably amplifying the inherent complexities of Cooperative Patent Classification (CPC) research. Firstly, this study addresses these challenges, recognizing early CPC work while acknowledging past struggles with language barriers and document intricacy. Secondly, it underscores the persisting difficulties of CPC research. To overcome these challenges and bolster the CPC system, This paper presents two key innovations. Firstly, it introduces an ensemble approach that incorporates four BERT-related models, enhancing semantic similarity accuracy through weighted averaging. Secondly, a novel text preprocessing method tailored for patent documents is introduced, featuring a distinctive input structure with token scoring that aids in capturing semantic relationships during CPC context training, utilizing BCELoss. Our experimental findings conclusively establish the effectiveness of both our Ensemble Model and novel text processing strategies when deployed on the U.S. Patent Phrase to Phrase Matching dataset.

研究の動機と目的

CPCに焦点を当てた特許分析における意味的類似性の課題に対処する。
モデルアンサンブルと特定用途に合わせたテキスト前処理を通じて、CPCの精度と効率を向上させる。
特許文書の意味的関係を捉えるため、BCELossベースのトークンスコアリングを活用する。

提案手法

DeBERTaV3、Microsoft DeBERTa-v3-large、MoritzLaurer DeBERTa-v3-large-mnli-fever-anli-ling-wanli、Anferico BERT-for-Patents、Google ELECTRA-large-discriminator のアンサンブル。
検証データで最適化された重みを用いてモデル予測の加重平均を適用する。
構造化入力として [CLS]、[SEP]、および [TAR] を含む、アンカー-コンテキストペアをターゲットとスコアリストにグルーピングする新しいテキスト前処理法 V3 を導入する。
TrainDataset 内の訓練中にすべてのトークンにスコアを割り当て、BCELoss で訓練して予測スコアとグラウンドトゥルースを一致させる。
4-fold クロスバリデーションを用いて U.S. Patent Phrase-to-Phrase Matching データセットで Pearson 相関を評価する。

実験結果

リサーチクエスチョン

RQ1複数のBERT関連モデルのアンサンブルは、特許フレーズの類似性タスクにおいて単一モデルを上回ることができるか？
RQ2V3 テキスト前処理法は CPC-コンテキスト訓練における意味的類似性の捕捉を改善するか？
RQ3BCELoss を用いたトークンレベルのスコアリングは、特許フレーズマッチングにおけるモデルの訓練と性能にどう影響するか？
RQ4個々のモデルと比較した場合、U.S. Patent Phrase-to-Phrase Matching データセットに対するアンサンブルの性能はどうか？

主な発見

V3 前処理は V1, V2, V3 の中で最良のバリアントを生み出し、DeBERTa-v3-large ベースのバリアントで CV スコア 0.8512。
アンサンブルモデルは、含まれるすべてのモデルの中で最高の CV スコア 0.8534 を達成。
個々のモデルの寄与には Microsoft/DeBERTa-v3-large (0.8512 CV), Anferico/BERT-for-Patents (0.8382 CV), Google/ELECTRA-large (0.8503 CV), MoritzLaurer/DeBERTa-v3-large (0.8385 CV) が含まれる。
全体として、アンサンブルは対象データセットにおける Pearson 相関で単一モデルのバリアントを上回る。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。