[论文解读] Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model
本论文提出 WKLM,一种弱监督预训练目标,强制从非结构化文本中进行实体中心的知识学习,提升实体相关的问答和对细粒度实体类型的识别,相对于 BERT 基线。它在维基百科上进行实体替换训练,以注入真实世界实体知识,无需额外的下游内存或架构修改。
Recent breakthroughs of pretrained language models have shown the effectiveness of self-supervised learning for a wide range of natural language processing (NLP) tasks. In addition to standard syntactic and semantic NLP tasks, pretrained models achieve strong improvements on tasks that involve real-world knowledge, suggesting that large-scale language modeling could be an implicit method to capture knowledge. In this work, we further investigate the extent to which pretrained models such as BERT capture knowledge using a zero-shot fact completion task. Moreover, we propose a simple yet effective weakly supervised pretraining objective, which explicitly forces the model to incorporate knowledge about real-world entities. Models trained with our new objective yield significant improvements on the fact completion task. When applied to downstream tasks, our model consistently outperforms BERT on four entity-related question answering datasets (i.e., WebQuestions, TriviaQA, SearchQA and Quasar-T) with an average 2.7 F1 improvements and a standard fine-grained entity typing dataset (i.e., FIGER) with 5.7 accuracy gains.
研究动机与目标
- Motivate whether pretrained models implicitly capture real-world entity knowledge and quantify its extent via a zero-shot fact completion task.
- Introduce a weakly supervised knowledge learning objective that explicitly teaches models about real-world entities from unstructured text.
- Show that knowledge-enriched pretraining improves entity-related QA datasets and fine-grained entity typing beyond standard BERT baselines.
提出的方法
- Entity-centric pretraining with weak supervision via entity replacement: replace mentions with same-type entities and train the model to detect replacement.
- Use boundary-word representations of entities to predict P(e|C) and distinguish true vs false knowledge statements.
- Combine the knowledge-learning objective with masked language model (MLM) loss in a multi-task pretraining setup on Wikipedia and BooksCorpus.
- Maintain standard BERT architecture and no extra memory or architectural changes for downstream tasks.
- Perform ablations to compare WKLM against MLM-only and extended MLM baselines to isolate the knowledge-learning contribution.
实验结果
研究问题
- RQ1Can large-scale pretraining encode explicit entity-level knowledge beyond standard MLM objectives?
- RQ2Does a weakly supervised knowledge-learning objective improve entity-related tasks without external knowledge bases?
- RQ3How does WKLM perform on zero-shot fact completion and downstream entity-centric QA and typing tasks compared to BERT and GPT-2?
- RQ4What is the impact of MLM ratio and, separately, entity-replacement objectives on downstream performance?
主要发现
| Relation Name | # of Candidates | # of Answers | Model | BERT-base | BERT-large | GPT-2 | Ours | Average Hits@10 |
|---|---|---|---|---|---|---|---|---|
| HasChild (P40) | 906 | 3.8 | HasChild | 9.00 | 6.00 | 20.5 | 63.5 | - |
| NotableWork (P800) | 901 | 5.2 | NotableWork | 1.88 | 2.56 | 2.39 | 4.10 | - |
| CapitalOf (P36) | 820 | 2.2 | CapitalOf | 1.87 | 1.55 | 15.8 | 49.1 | - |
| FoundedBy (P112) | 798 | 3.7 | FoundedBy | 2.44 | 1.93 | 8.65 | 24.2 | - |
| Creator (P170) | 536 | 3.6 | Creator | 4.57 | 4.57 | 7.27 | 9.84 | - |
| PlaceOfBirth (P19) | 497 | 1.8 | PlaceOfBirth | 19.2 | 30.9 | 8.95 | 23.2 | - |
| LocatedIn (P131) | 382 | 1.9 | LocatedIn | 13.2 | 52.5 | 21.0 | 61.1 | - |
| EducatedAt (P69) | 374 | 4.1 | EducatedAt | 9.10 | 7.93 | 11.0 | 16.9 | - |
| PlaceOfDeath (P20) | 313 | 1.7 | PlaceOfDeath | 43.0 | 42.6 | 8.83 | 26.5 | - |
| Occupation (P106) | 190 | 1.4 | Occupation | 8.58 | 10.7 | 9.17 | 10.7 | - |
| Average Hits@10 | - | - | - | 11.3 | 16.1 | 16.3 | 28.9 | - |
- WKLM achieves best results on 8 of 10 fact-completion relations in zero-shot evaluation.
- On open-domain QA, WKLM outperforms BERT on entity-related datasets by an average of 2.7 F1 points when ranking scores are not used; with ranking, it attains near state-of-the-art results on three datasets.
- On fine-grained entity typing (FIGER), WKLM sets a new state-of-the-art with accuracy 60.21, Ma-F1 81.99, Mi-F1 77.00.
- Ablation shows that combining the WKLM objective with MLM yields the best downstream performance; using too high an MLM masking ratio (15%) can hurt knowledge learning.
- WKLM requires no additional data processing or memory during fine-tuning and works with the original BERT architecture.
- Compared to ERNIE, WKLM provides larger absolute gains on FIGER, suggesting text-based knowledge extraction is effective without external KBs.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。