QUICK REVIEW

[论文解读] Subword Encoding in Lattice LSTM for Chinese Word Segmentation

Jie Yang, Yue Zhang|arXiv (Cornell University)|Oct 30, 2018

Natural Language Processing Techniques被引用 36

一句话总结

本文提出一种结合子词编码的网格长短期记忆（LSTM）网络用于中文分词，通过门控捷径路径将字符级特征与子词或词级子序列相结合。实验表明，子词编码在不依赖外部分词器的情况下性能可与词编码相媲美，且在受控消融实验中，词典的贡献大于预训练嵌入。

ABSTRACT

We investigate a lattice LSTM network for Chinese word segmentation (CWS) to utilize words or subwords. It integrates the character sequence features with all subsequences information matched from a lexicon. The matched subsequences serve as information shortcut tunnels which link their start and end characters directly. Gated units are used to control the contribution of multiple input links. Through formula derivation and comparison, we show that the lattice LSTM is an extension of the standard LSTM with the ability to take multiple inputs. Previous lattice LSTM model takes word embeddings as the lexicon input, we prove that subword encoding can give the comparable performance and has the benefit of not relying on any external segmentor. The contribution of lattice LSTM comes from both lexicon and pretrained embeddings information, we find that the lexicon information contributes more than the pretrained embeddings information through controlled experiments. Our experiments show that the lattice structure with subword encoding gives competitive or better results with previous state-of-the-art methods on four segmentation benchmarks. Detailed analyses are conducted to compare the performance of word encoding and subword encoding in lattice LSTM. We also investigate the performance of lattice LSTM structure under different circumstances and when this model works or fails.

研究动机与目标

探究在中文分词（CWS）任务中，子词编码在网格LSTM中的有效性，避免对分词器的依赖。
比较基于子词编码与传统词嵌入的网格LSTM在性能与鲁棒性方面的差异。
分析在网格LSTM模型中，来自词典的信息与预训练嵌入的相对贡献。
评估子词/词覆盖度对不同数据集上模型性能的影响。
识别失败案例，并分析网格LSTM门控机制的局限性。

提出的方法

网格LSTM结构通过添加门控捷径路径扩展标准LSTM，该路径连接词典中匹配的子序列（词或子词）的起始与结束字符。
使用字节对编码（BPE）算法生成子词嵌入，从而消除对预分词语料的依赖。
最终隐藏状态通过字符LSTM输出与所有门控捷径路径的加权和计算，门控机制控制各路径的贡献。
模型在字符序列上端到端训练，网格路径通过将输入句子与子词或词级词典动态匹配构建。
通过训练包含与不包含每个组件的模型，受控实验隔离了词典与预训练嵌入的贡献。
案例研究分析了基于词与子词的网格模型的失败模式，以评估门控机制的鲁棒性。

实验结果

研究问题

RQ1在中文分词任务中，网格LSTM中的子词编码能否实现与词编码相当的性能？
RQ2子词编码是否能消除构建网格LSTM词典时对外部分词器的依赖？
RQ3在中文分词的网格LSTM中，词典信息与预训练嵌入的贡献如何比较？
RQ4子词/词覆盖度在多大程度上影响网格LSTM模型的性能提升？
RQ5在何种场景下，具有门控机制的网格LSTM会失效，原因是什么？

主要发现

在四个中文分词基准测试中，网格LSTM中的子词编码实现了最先进或具有竞争力的性能，与基于词的模型结果相当或更优。
尽管词覆盖度较低，但在MSR和Weibo数据集上，采用子词编码的网格LSTM仍优于基于词的模型，表明子词嵌入在低覆盖率场景下更具鲁棒性。
受控实验表明，来自词典的信息对模型性能的贡献大于预训练嵌入，凸显了领域特定词典的重要性。
更高的子词/词覆盖度始终带来更大的性能提升，PKU/MSR数据集中覆盖度超过90%时，显著降低了错误率。
案例研究发现，门控机制虽有效但并非万无一失：基于词的模型在遭遇噪声匹配（如“性日”）时会失效，而子词模型则在关键子词缺失或模糊时失败。
当覆盖度与嵌入质量均较高时，网格LSTM结构表现最佳，且在使用领域特定词典时，展现出在跨领域序列标注任务中的强大潜力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。