QUICK REVIEW

[论文解读] Unsupervised Learning of Semantic Orientation from a Hundred-Billion-Word Corpus

Peter D. Turney, Michael L. Littman|ArXiv.org|Dec 8, 2002

Natural Language Processing Techniques参考文献 7被引用 361

一句话总结

本文提出了一种无监督算法，利用1000亿词的网络语料库学习词语的语义倾向（正面或负面情感）。通过查询搜索引擎并应用点互信息（PMI）分析结果模式，该方法在3,596个多样化词语（形容词、副词、名词、动词）上实现了80%的准确率，优于以往的监督方法，在更广泛的词汇范围内实现，且无需人工标注。

ABSTRACT

The evaluative character of a word is called its semantic orientation. A positive semantic orientation implies desirability (e.g., "honest", "intrepid") and a negative semantic orientation implies undesirability (e.g., "disturbing", "superfluous"). This paper introduces a simple algorithm for unsupervised learning of semantic orientation from extremely large corpora. The method involves issuing queries to a Web search engine and using pointwise mutual information to analyse the results. The algorithm is empirically evaluated using a training corpus of approximately one hundred billion words -- the subset of the Web that is indexed by the chosen search engine. Tested with 3,596 words (1,614 positive and 1,982 negative), the algorithm attains an accuracy of 80%. The 3,596 test words include adjectives, adverbs, nouns, and verbs. The accuracy is comparable with the results achieved by Hatzivassiloglou and McKeown (1997), using a complex four-stage supervised learning algorithm that is restricted to determining the semantic orientation of adjectives.

研究动机与目标

开发一种可扩展的无监督方法，无需标注训练数据即可确定词语的语义倾向（正面或负面情感）。
将语义倾向检测的范围从形容词扩展至名词、动词和副词，拓宽先前研究的适用范围。
在包含约1000亿词的大型网络语料库上评估该方法的性能，使用一种简单高效的算法。
证明无需监督学习或复杂特征工程即可实现高准确率。

提出的方法

该方法使用网络搜索引擎检索词对的结果：目标词与一组正面或负面的锚定词（如'优秀'或'糟糕'）。
基于搜索结果中的共现频率，计算目标词与每个锚定词之间的点互信息（PMI）。
语义倾向由PMI得分的符号和大小决定：正PMI表示正面倾向，负PMI表示负面倾向。
该算法通过聚合多个锚定词的PMI得分来提高鲁棒性并减少噪声。
该方法仅依赖搜索引擎查询日志，无需人工标注或语言学预处理。
该方法应用于约1000亿词的语料库，该语料库源自搜索引擎索引的网络内容。

实验结果

研究问题

RQ1能否从大规模网络语料库中无监督地准确学习语义倾向？
RQ2该方法是否能泛化到形容词以外的其他词性，如名词、动词和副词？
RQ3与需要大量特征工程和标注数据的监督方法相比，性能如何？
RQ4仅使用搜索引擎查询结果，点互信息能否有效捕捉情感极性？

主要发现

该算法在包含3,596个词语的测试集上实现了80%的准确率，其中包含1,614个正面词和1,982个负面词。
该方法成功识别了包括形容词、副词、名词和动词在内的多种词性的语义倾向。
其性能与Hatzivassiloglou和McKeown（1997）提出的复杂四阶段监督算法相当，但后者仅限于形容词。
在搜索引擎结果上使用PMI为情感分析提供了一种稳健且可扩展的无监督学习替代方案。
该方法表明，仅使用网络规模的查询数据即可实现大规模无监督情感学习。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。