QUICK REVIEW

[论文解读] Efficient Vector Representation for Documents through Corruption

Minmin Chen|arXiv (Cornell University)|Jul 8, 2017

Topic Modeling参考文献 29被引用 78

一句话总结

Doc2VecC 将文档表示为其单词嵌入的平均值，这些嵌入是在基于损坏的正则化下学习的，从而实现快速且可扩展的文档表示，在情感分析、分类和语义相关性方面表现良好。

ABSTRACT

We present an efficient document representation learning framework, Document Vector through Corruption (Doc2VecC). Doc2VecC represents each document as a simple average of word embeddings. It ensures a representation generated as such captures the semantic meanings of the document during learning. A corruption model is included, which introduces a data-dependent regularization that favors informative or rare words while forcing the embeddings of common and non-discriminative ones to be close to zero. Doc2VecC produces significantly better word embeddings than Word2Vec. We compare Doc2VecC with several state-of-the-art document representation learning algorithms. The simple model architecture introduced by Doc2VecC matches or out-performs the state-of-the-art in generating high-quality document representations for sentiment analysis, document classification as well as semantic relatedness tasks. The simplicity of the model enables training on billions of words per hour on a single machine. At the same time, the model is very efficient in generating representations of unseen documents at test time.

研究动机与目标

提出高效文档表示方法，超越 BoW 与先前的神经网络方法。
提出一种简单的基于平均的文档向量，并引入腐蚀机制。
展示腐蚀机制作为数据相关的正则化，倾向于有信息量的词。
展示在情感分析、分类和语义相关性任务上具有竞争力或更优的性能。

提出的方法

将每个文档表示为其单词嵌入的平均值，这些嵌入与局部上下文共同学习。
引入一个腐蚀（dropout）机制，在学习过程中随机删除单词，并对剩余分量进行缩放以保持无偏。
通过 P(w|c, x̃) 使用局部上下文和全局文档上下文建模目标词的概率，并使用负采样进行优化。
在腐蚀均值周围提供泰勒展开，以导出一个数据相关的正则化项，从而抑制常见且无信息量的词。
以类似 Word2Vec 的方式训练投影矩阵 U 和 V，实现高效的训练和测试时推理。
通过简单地对未见文档的学习到的单词嵌入进行平均来表示它们。

实验结果

研究问题

RQ1通过基于腐蚀的目标函数学习的单词嵌入的简单平均，是否能产生高质量的文档表示？
RQ2腐蚀机制是否作为数据相关的正则化器，提升性能与训练速度？
RQ3Doc2VecC 相比最先进的文档表示在情感分析、分类和语义相关性方面有何差异？
RQ4使用平均的单词嵌入时，测试时的表示生成是否高效？

主要发现

Doc2VecC 在情感分析、分类和语义相关性方面，与 Paragraph Vectors 和其他基线相比具有竞争力或更优的表现。
训练时间快速，能够扩展到大规模语料库，测试时的表示仅需简单地对单词嵌入进行平均。
腐蚀机制作为数据相关的正则化器，对常见但非判别性词语的嵌入进行惩罚，并减少测试时的计算量。
实验上，Doc2VecC 产生的单词嵌入不再被停用词主导，对于下游任务更具信息量。
单词类比与语义相关性任务显示在许多场景中 Doc2VecC 的嵌入优于 Word2Vec，尤其是在更大的语料库下。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。