[論文レビュー] TENER: Adapting Transformer Encoder for Named Entity Recognition
TENERは、NERのために注意機構を方向性と距離を考慮した相対的エンコーディングで強化し、スケールを適用しないアテンションを用い、Transformerベースの文字エンコーダを導入して、事前学習なしで6つのデータセットにおいて最先端の結果を達成する。
The Bidirectional long short-term memory networks (BiLSTM) have been widely used as an encoder in models solving the named entity recognition (NER) task. Recently, the Transformer is broadly adopted in various Natural Language Processing (NLP) tasks owing to its parallelism and advantageous performance. Nevertheless, the performance of the Transformer in NER is not as good as it is in other NLP tasks. In this paper, we propose TENER, a NER architecture adopting adapted Transformer Encoder to model the character-level features and word-level features. By incorporating the direction and relative distance aware attention and the un-scaled attention, we prove the Transformer-like encoder is just as effective for NER as other NLP tasks.
研究の動機と目的
- Motivate the use of Transformer-based encoders for NER and identify why vanilla Transformers underperform in NER.
- Propose adaptations: direction- and distance-aware relative positional encoding and un-scaled attention for NER.
- Integrate a Transformer-based character encoder with a word-level Transformer encoder for robust word representations.
- Evaluate the adapted Transformer (AdaTrans) on multiple English and Chinese NER datasets and compare to BiLSTM-based models.
提案手法
- Use an adapted Transformer encoder with direction- and distance-aware attention based on relative positional encodings.
- Replace the classic scaled dot-product attention with un-scaled, sharper attention to induce sparsity in context selection.
- Incorporate relative positional encoding R_{t-j} and learnable biases (u, v) to capture distance and direction in attention.
- Apply the Transformer encoder to both word-level and character-level representations (AdaTrans for both).
- Concatenate character features from the encoder with pre-trained word embeddings to form word representations.
- Use a CRF layer on top to model label dependencies and decode with Viterbi.
実験結果
リサーチクエスチョン
- RQ1Can the Transformer encoder be adapted to NER performance levels comparable to or exceeding BiLSTM-based encoders?
- RQ2Does direction- and distance-aware relative positional encoding improve NER performance over vanilla Transformer in multiple languages?
- RQ3Does un-scaled dot-product attention yield sharper, more effective attention for NER tasks?
- RQ4Is a Transformer-based character encoder beneficial for capturing subword patterns and alleviating OOV in NER?
- RQ5How does AdaTrans perform across English and Chinese NER datasets compared to previous state-of-the-art models?
主な発見
| Model | CoNLL2003 F1 | OntoNotes 5.0 F1 |
|---|---|---|
| BiLSTM-CRF (comparative) | 88.83 | - |
| Transformer | 89.57 | 86.73 |
| TENER (Ours) | 91.33 | 88.43 |
| w/ scale | 91.06 | 87.94 |
| w/ CNN-char | 91.45 | 88.25 |
| TENER with ELMo | 92.62 | 89.78 |
- TENER significantly boosts Transformer performance for NER over vanilla Transformer and can surpass BiLSTM-based models on several datasets.
- Using direction- and distance-aware relative positional encoding plus un-scaled attention yields substantial gains; scaled attention degrades performance.
- AdaTrans improves both character- and word-level encoding, achieving state-of-the-art results on six datasets without pretraining. On English CoNLL2003 and OntoNotes 5.0, TENER attains 91.33 and 88.43 F1 respectively (non-contextual embeddings).
- TENER with w/ CNN-char and without scale generally performs best among non-pretrained setups; scaled attention consistently underperforms.
- With ELMo embeddings, TENER further improves to 92.62 (CoNLL2003) and 89.78 (OntoNotes 5.0).
- TENER converges as fast as BiLSTM on the OntoNotes 5.0 development set and outperforms the vanilla Transformer in convergence speed.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。