QUICK REVIEW

[论文解读] Optimal Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks

Nils Reimers, Iryna Gurevych|arXiv (Cornell University)|Jul 21, 2017

Topic Modeling参考文献 32被引用 264

一句话总结

本论文系统地分析了基于 BiLSTM 的序列标注在五个 NLP 任务上的超参数，找出哪些设置最重要，并提供稳健的配置建议。

ABSTRACT

Selecting optimal parameters for a neural network architecture can often make the difference between mediocre and state-of-the-art performance. However, little is published which parameters and design choices should be evaluated or selected making the correct hyperparameter optimization often a "black art that requires expert experiences" (Snoek et al., 2012). In this paper, we evaluate the importance of different network design choices and hyperparameters for five common linguistic sequence tagging tasks (POS, Chunking, NER, Entity Recognition, and Event Detection). We evaluated over 50.000 different setups and found, that some parameters, like the pre-trained word embeddings or the last layer of the network, have a large impact on the performance, while other parameters, for example the number of LSTM layers or the number of recurrent units, are of minor importance. We give a recommendation on a configuration that performs well among different tasks.

研究动机与目标

确定哪些超参数和结构扩展对基于 BiLSTM 的序列标注性能影响最大。
量化设计选择在五个任务（POS、Chunking、NER、Entities、Events）上的影响。
为配置 BiLSTM-CRF 模型提供实用、任务鲁棒的建议。
评估对随机种子和多任务学习设定的鲁棒性。

提出的方法

在五个序列标注任务中评估超过 50,000 种 BiLSTM 网络配置。
比较 BiLSTM-CRF、BiLSTM-CNN-CRF 和 BiLSTM-LSTM-CRF 架构。
系统地改变超参数，包括词嵌入、字符表示、优化器、梯度处理、标注方案、丢弃率、层数和单元数。
使用随机抽样来评估鲁棒性，并提供带统计检验的选项对比。
通过描述性统计、小提琴图以及中位数/增量分析来报告结果。

实验结果

研究问题

RQ1哪些超参数对常见序列标注任务的性能影响最大？
RQ2架构扩展（CRF 分类器、字符表示）是否始终提升性能，以及在什么条件下？
RQ3在跨领域和跨语言的鲁棒 BiLSTM 基于序列标注中，哪些实用的配置经验规则会出现？

主要发现

预训练词嵌入在各任务中始终带来最佳性能；选择对结果有显著影响（例如，在比较选项时，POS 的中位差约为 4.97 个百分点）。
通常在保持总递归单元数适中时，两个堆叠的 BiLSTM 层表现最好；单元数本身的影响较小。
带有 Nesterov 动量的 Adam 家族优化器（Nadam）通常提供最高性能和最快收敛；SGD 常常无法收敛。
梯度归一化在阈值约为 1 时显著提升测试性能，而梯度裁剪并无一致性收益。
在强标签依赖的任务中，最后一层使用 CRF 分类器通常比 Softmax 提高测试性能；BIO 标注优于 IOB，IOBES 未提供明显的普遍优势。
对输出和递归单元应用的变分 dropout 的效果优于不使用 dropout 或简单 dropout；每个 LSTM 网络约 100 个递归单元是一个实用经验。
多任务学习主要在任务在语言学上相似时有帮助；否则单任务设置通常更好，某些情况下带有任务特定 LSTM 层可能更有利。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。