QUICK REVIEW

[论文解读] Semantic Similarity Matching for Patent Documents Using Ensemble BERT-related Model and Novel Text Processing Method

Liqiang Yu, Bo Liu|arXiv (Cornell University)|Jan 6, 2024

Biomedical Text Mining and Ontologies被引用 30

一句话总结

本文提出一个由四个与 BERT 相关的模型组成的集成，以及一种新颖的文本预处理方法（V3），以提升专利短语的语义相似性匹配，在美国专利短语对短语数据集上对 BCELoss 训练进行评估。

ABSTRACT

In the realm of patent document analysis, assessing semantic similarity between phrases presents a significant challenge, notably amplifying the inherent complexities of Cooperative Patent Classification (CPC) research. Firstly, this study addresses these challenges, recognizing early CPC work while acknowledging past struggles with language barriers and document intricacy. Secondly, it underscores the persisting difficulties of CPC research. To overcome these challenges and bolster the CPC system, This paper presents two key innovations. Firstly, it introduces an ensemble approach that incorporates four BERT-related models, enhancing semantic similarity accuracy through weighted averaging. Secondly, a novel text preprocessing method tailored for patent documents is introduced, featuring a distinctive input structure with token scoring that aids in capturing semantic relationships during CPC context training, utilizing BCELoss. Our experimental findings conclusively establish the effectiveness of both our Ensemble Model and novel text processing strategies when deployed on the U.S. Patent Phrase to Phrase Matching dataset.

研究动机与目标

解决 CPC 聚焦的专利分析中的语义相似性挑战。
通过模型集成和定制文本预处理提高 CPC 的准确性和效率。
利用基于 BCELoss 的标记分数来捕捉专利文本中的语义关系。

提出的方法

使用四个深度学习模型的集成：DeBERTaV3、Microsoft DeBERTa-v3-large、MoritzLaurer DeBERTa-v3-large-mnli-fever-anli-ling-wanli、Anferico BERT-for-Patents，以及 Google ELECTRA-large-discriminator。
在验证数据上优化权重，对模型预测进行加权平均。
引入一种新颖的文本预处理方法 V3，将锚点-上下文对分组为目标和分数组，输入结构包括 [CLS]、[SEP] 和 [TAR]。
在 TrainDataset 的训练过程中为每个标记分配一个分数，并使用 BCELoss 训练以使预测分数与 ground truth 对齐。
使用 U.S. Patent Phrase-to-Phrase Matching 数据集上的 Pearson 相关性进行4折交叉验证进行评估。

实验结果

研究问题

RQ1一个由多种 BERT 相关模型组成的集成能否在专利短语相似性任务中超越单一模型？
RQ2V3 文本预处理方法是否提高 CPC-context 训练的语义相似性捕捉？
RQ3在专利短语匹配中，对标记级别的 BCELoss 分数训练如何影响模型训练与性能？
RQ4与单模型相比，集成在 U.S. Patent Phrase-to-Phrase Matching 数据集上的性能如何？

主要发现

V3 预处理在 V1、V2、V3 中产生最佳变体，DeBERTa-v3-large 基变体的 CV 分数为 0.8512。
集成模型在所包含的所有模型中达到最高的 CV 分数 0.8534。
单个模型贡献包括 Microsoft/DeBERTa-v3-large (0.8512 CV)，Anferico/BERT-for-Patents (0.8382 CV)，Google/ELECTRA-large (0.8503 CV)，MoritzLaurer/DeBERTa-v3-large (0.8385 CV)。
总体来说，集成在目标数据集上的 Pearson 相关性超过单模型变体。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。