QUICK REVIEW

[论文解读] DocBERT: BERT for Document Classification

Ashutosh Adhikari, Achyudh Ram|arXiv (Cornell University)|Apr 17, 2019

Text and Document Classification Technologies参考文献 23被引用 215

一句话总结

微调 BERT 在四个文档分类数据集上达到最先进的结果；蒸馏的 KD-LSTM reg 能在参数数量约少 30 倍、推理速度约快 40 倍的情况下达到 BERT base 的性能。

ABSTRACT

We present, to our knowledge, the first application of BERT to document classification. A few characteristics of the task might lead one to think that BERT is not the most appropriate model: syntactic structures matter less for content categories, documents can often be longer than typical BERT input, and documents often have multiple labels. Nevertheless, we show that a straightforward classification model using BERT is able to achieve the state of the art across four popular datasets. To address the computational expense associated with BERT inference, we distill knowledge from BERT-large to small bidirectional LSTMs, reaching BERT-base parity on multiple datasets using 30x fewer parameters. The primary contribution of our paper is improved baselines that can provide the foundation for future work.

研究动机与目标

证明对标准文档分类数据集进行微调的 BERT 能达到最先进的结果。
研究在常见的1-到4标签设置中，使用 BERT 处理长篇多标签文档的可行性。
通过将知识蒸馏到更小的模型（KD-LSTM reg）来提升推理速度，从而解决 BERT 的计算成本问题。

提出的方法

通过在 [CLS] 标记上添加最终分类层，对 BERT base 和 BERT large 进行文档分类微调。
在优化中使用交叉熵（单标签）或二元交叉熵（多标签）损失。
使用在迁移集上的 KL 散度，将微调后的 BERT large 的知识蒸馏给一个轻量级单层 BiLSTM（LSTM reg）。
通过加权和将分类损失与蒸馏损失结合起来，训练学生模型（KD-LSTM reg）。
创建一个具有基于词性指导的单词置换和随机掩码的迁移集，以改善蒸馏。
在 Reuters、AAPD、IMDB 和 Yelp 2014 上使用标准划分和报道的基线进行评估。

实验结果

研究问题

RQ1对标准文档分类数据集，微调的 BERT 是否能达到新的最先进水平？
RQ2轻量级 BiLSTM 能否通过知识蒸馏逼近 BERT base 的性能？
RQ3对 BERT 与蒸馏学生相比，准确性、模型大小和推理时间之间的权衡是什么？
RQ4不同数据集（单标签 vs 多标签）如何影响 BERT 微调的训练动态和性能？

主要发现

模型	Reuters 验证 F1	Reuters 测试 F1	AAPD 验证 F1	AAPD 测试 F1	IMDB 验证 F1	IMDB 测试 F1	Yelp 验证准确率	Yelp 测试准确率
LSTM reg	89.1 ±0.8	87.0 ±0.5	73.1 ±0.4	70.5 ±0.5	53.4 ±0.2	52.8 ±0.3	69.0 ±0.1	68.7 ±0.1
BERT base	90.5	89.0	75.3	73.4	54.4	54.2	72.1	72.0
BERT large	92.3	90.7	76.6	75.2	56.0	55.6	72.6	72.5
KD-LSTM reg	91.0 ±0.2	88.9 ±0.2	75.4 ±0.2	72.9 ±0.3	54.5 ±0.1	53.7 ±0.3	69.7 ±0.1	69.4 ±0.1

BERT large 在全部四个数据集上均达到最先进的结果。
BERT base 也取得了很强的结果，紧随 BERT large 之后。
KD-LSTM reg 在 Reuters、AAPD 和 IMDB 上达到与 BERT base 相当的水平，并提供显著的加速（推理速度至少快 40x）。
KD-LSTM reg 的参数量约为 BERT base 的 1–3%，同时在各数据集上保持有竞争力的准确性。
相对于 BERT base，蒸馏后的模型在推理时延上大幅降低（在测试硬件上约快 40 倍）。
蒸馏模型证明了更简单的架构也能在显著减少参数量的情况下恢复 BERT 的大部分性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。