QUICK REVIEW

[论文解读] Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs

Claude Coulombe|arXiv (Cornell University)|Dec 5, 2018

Natural Language Processing Techniques参考文献 7被引用 36

一句话总结

本文提出了一种实用且可扩展的文本数据增强框架，利用NLP Cloud API来克服自然语言处理中的'大数据壁垒'，特别是在低资源场景下。通过应用回译、句法树转换和词汇替换等技术，该方法在文本情感分类任务中将模型准确率提升了4.3%至21.6%，即使仅使用5倍的数据增强因子。

ABSTRACT

In practice, it is common to find oneself with far too little text data to train a deep neural network. This "Big Data Wall" represents a challenge for minority language communities on the Internet, organizations, laboratories and companies that compete the GAFAM (Google, Amazon, Facebook, Apple, Microsoft). While most of the research effort in text data augmentation aims on the long-term goal of finding end-to-end learning solutions, which is equivalent to "using neural networks to feed neural networks", this engineering work focuses on the use of practical, robust, scalable and easy-to-implement data augmentation pre-processing techniques similar to those that are successful in computer vision. Several text augmentation techniques have been experimented. Some existing ones have been tested for comparison purposes such as noise injection or the use of regular expressions. Others are modified or improved techniques like lexical replacement. Finally more innovative ones, such as the generation of paraphrases using back-translation or by the transformation of syntactic trees, are based on robust, scalable, and easy-to-use NLP Cloud APIs. All the text augmentation techniques studied, with an amplification factor of only 5, increased the accuracy of the results in a range of 4.3% to 21.6%, with significant statistical fluctuations, on a standardized task of text polarity prediction. Some standard deep neural network architectures were tested: the multilayer perceptron (MLP), the long short-term memory recurrent network (LSTM) and the bidirectional LSTM (biLSTM). Classical XGBoost algorithm has been tested with up to 2.5% improvements.

研究动机与目标

解决自然语言处理中训练数据不足的问题，特别是针对少数语言和低资源语言。
克服限制深度神经网络在低数据环境下性能的'大数据壁垒'。
开发一种实用、可扩展且易于实现的基于外部NLP API的文本数据增强流水线。
在标准文本分类基准上评估各种文本增强技术的有效性。
证明基于API的增强方法可在无需复杂端到端训练的情况下显著提升模型准确率。

提出的方法

利用NLP Cloud API实现文本增强技术，确保鲁棒性和可扩展性。
通过多语言模型应用回译技术生成改写句子。
使用句法树转换生成语义相似但结构不同的句子。
通过词嵌入实现词汇替换，用同义词替换原词。
集成噪声注入和基于正则表达式的转换作为基线对比。
在训练数据上以一致的5倍增强因子应用所有增强技术。

实验结果

研究问题

RQ1NLP Cloud API能否实现有效、可扩展且易于部署的低资源NLP任务文本数据增强？
RQ2不同文本增强技术在标准文本分类任务中提升模型准确率方面的表现如何比较？
RQ3仅通过5倍增强，数据增强在低数据环境下对深度学习模型性能的提升程度如何？
RQ4哪种增强技术组合能带来最一致且显著的准确率提升？
RQ5基于API的增强方法能否优于传统方法（如噪声注入或正则表达式转换）？

主要发现

使用NLP Cloud API进行文本增强在文本情感预测任务中将模型准确率提升了4.3%至21.6%，即使仅进行5倍数据增强。
回译和句法树转换方法表现出特别显著的改进，表明其生成高质量改写句的能力较强。
即使简单的技术如词汇替换和噪声注入也带来了可测量的增益，尽管低于高级方法。
多层感知机（MLP）、LSTM和双向LSTM模型均从增强中受益，其中biLSTM表现尤为出色。
当在增强数据上训练时，XGBoost的准确率也提升了最多2.5%，表明该方法在各类模型中均具有广泛适用性。
性能增益中的统计波动表明增强质量与任务敏感性存在差异，但总体增益显著且一致。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。