QUICK REVIEW

[论文解读] X-BERT: eXtreme Multi-label Text Classification with BERT

Wei-Cheng Chang, Hsiang‐Fu Yu|arXiv (Cornell University)|May 7, 2019

Text and Document Classification Technologies被引用 7

一句话总结

X-BERT 提出了一种基于微调的 BERT 模型，用于极端多标签文本分类（XMC），通过联合建模文档和标签文本，学习语义标签聚类并建模标签依赖关系。该方法在包含 50 万个标签的 Wiki 数据集上取得了最先进性能，precision@1 达到 67.80%，相较于 Parabel 提升了 11.31% 的相对性能。

ABSTRACT

Extreme multi-label text classification (XMC) aims to tag each input text with the most relevant labels from an extremely large label set, such as those that arise in product categorization and e-commerce recommendation. Recently, pretrained language representation models such as BERT achieve remarkable state-of-the-art performance across a wide range of NLP tasks including sentence classification among small label sets (typically fewer than thousands). Indeed, there are several challenges in applying BERT to the XMC problem. The main challenges are: (i) the difficulty of capturing dependencies and correlations among labels, whose features may come from heterogeneous sources, and (ii) the tractability to scale to the extreme label setting as the model size can be very large and scale linearly with the size of the output space. To overcome these challenges, we propose X-BERT, the first feasible attempt to finetune BERT models for a scalable solution to the XMC problem. Specifically, X-BERT leverages both the label and document text to build label representations, which induces semantic label clusters in order to better model label dependencies. At the heart of X-BERT is finetuning BERT models to capture the contextual relations between input text and the induced label clusters. Finally, an ensemble of the different BERT models trained on heterogeneous label clusters leads to our best final model. Empirically, on a Wiki dataset with around 0.5 million labels, X-BERT achieves new state-of-the-art results where the precision@1 reaches 67:80%, a substantial improvement over 32.58%/60.91% of deep learning baseline fastText and competing XMC approach Parabel, respectively. This amounts to a 11.31% relative improvement over Parabel, which is indeed significant since the recent approach SLICE only has 5.53% relative improvement.

研究动机与目标

为解决在大规模标签集合中建模复杂标签依赖关系的极端多标签文本分类（XMC）挑战。
实现基于 BERT 的模型在极端标签设置下的高效扩展，其中模型大小随输出空间线性增长。
通过联合建模文档和标签文本，诱导语义标签聚类，从而提升 XMC 任务的性能。
开发一种可扩展的微调 BERT 解决方案，优于现有的深度学习和 XMC 专用基线方法。
通过在异质标签聚类上使用集成学习，在大规模 XMC 基准上展示显著的性能提升。

提出的方法

X-BERT 通过联合编码文档文本和标签文本来构建标签表征，以捕捉语义关系。
通过微调 BERT 来建模输入文本与生成的标签聚类之间的上下文交互，增强标签依赖关系的学习。
基于联合文档-标签表征所获得的语义相似性，形成标签聚类，从而实现对标签相关性的结构化建模。
该模型采用多个在不同异质标签聚类上训练的 BERT 变体的集成，以提升泛化能力和鲁棒性。
在联合表征空间上进行端到端微调，以优化 XMC 评估指标（如 precision@1）。
通过聚类减少有效标签空间，同时保持语义一致性，从而实现高效扩展。

实验结果

研究问题

RQ1BERT 是否能被有效微调以应用于标签数量超过 50 万个的极端多标签文本分类任务？
RQ2在极端多标签设置中，如何有效建模标签依赖关系和相关性？
RQ3联合编码文档和标签文本是否能改善标签的语义聚类并提升下游分类性能？
RQ4在 XMC 中，使用基于异质标签聚类的集成 BERT 模型相比单模型基线，性能提升如何？
RQ5在大规模数据集上，X-BERT 与 SOTA 的 XMC 方法（如 Parabel 和 fastText）在 precision@1 指标上的表现如何比较？

主要发现

X-BERT 在包含约 50 万个标签的 Wiki 数据集上实现了 67.80% 的 precision@1，创下新的最先进水平。
与强基线方法 Parabel 相比，相对性能提升了 11.31%，展现出显著优势。
相较于 SLICE 的提升（5.53% 相对提升），X-BERT 的性能增益超过两倍，凸显其有效性。
联合文档-标签编码的使用实现了更优的标签语义聚类，从而增强了标签依赖关系的建模能力。
在异质标签聚类上训练的多个 BERT 模型的集成，显著提升了性能，优于单模型基线。
X-BERT 通过利用标签聚类和微调，成功将 BERT 扩展至极端标签设置，克服了模型大小随输出空间线性增长的问题。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。