Skip to main content
QUICK REVIEW

[论文解读] Sentiment Analysis of German Twitter

Wladimir Sidorenko|arXiv (Cornell University)|Jan 1, 2019
Sentiment Analysis and Opinion Mining参考文献 155被引用 4
一句话总结

本文引入了一个大规模、人工标注的德语推特情感语料库,并提出了针对德语社交媒体情感分析的新方法。该研究在情感词典生成、基于增强条件随机场(CRF)的细粒度意见挖掘、结合词典注意力机制的消息级分类,以及通过潜在边际化CRF和递归狄利克雷过程实现的语篇感知分析方面取得进展,实现了德语推特情感分析任务的最先进性能。

ABSTRACT

The immense popularity of online communication services in the last decade has not only upended our lives (with news spreading like wildfire on the Web, presidents announcing their decisions on Twitter, and the outcome of political elections being determined on Facebook) but also dramatically increased the amount of data exchanged on these platforms. Therefore, if we wish to understand the needs of modern society better and want to protect it from new threats, we urgently need more robust, higher-quality natural language processing (NLP) applications that can recognize such necessities and menaces automatically, by analyzing uncensored texts. Unfortunately, most NLP programs today have been created for standard language, as we know it from newspapers, or, in the best case, adapted to the specifics of English social media. This thesis reduces the existing deficit by entering the new frontier of German online communication and addressing one of its most prolific forms—users’ conversations on Twitter. In particular, it explores the ways and means by how people express their opinions on this service, examines current approaches to automatic mining of these feelings, and proposes novel methods, which outperform state-of-the-art techniques. For this purpose, I introduce a new corpus of German tweets that have been manually annotated with sentiments, their targets and holders, as well as lexical polarity items and their contextual modifiers. Using these data, I explore four major areas of sentiment research: (i) generation of sentiment lexicons, (ii) fine-grained opinion mining, (iii) message-level polarity classification, and (iv) discourse-aware sentiment analysis. In the first task, I compare three popular groups of lexicon generation methods: dictionary-, corpus-, and word-embedding–based ones, finding that dictionary-based systems generally yield better polarity lists than the last two groups. Apart from this, I propose a linear projection algorithm, whose results surpass many existing automatically-generated lexicons. Afterwords, in the second task, I examine two common approaches to automatic prediction of sentiment spans, their sources, and targets: conditional random fields (CRFs) and recurrent neural networks, obtaining higher scores with the former model and improving these results even further by redefining the structure of CRF graphs. When dealing with message-level polarity classification, I juxtapose three major sentiment paradigms: lexicon-, machine-learning–, and deep-learning–based systems, and try to unite the first and last of these method groups by introducing a bidirectional neural network with lexicon-based attention. Finally, in order to make the new classifier aware of microblogs' discourse structure, I let it separately analyze the elementary discourse units of each tweet and infer the overall polarity of a message from the scores of its EDUs with the help of two new approaches: latent-marginalized CRFs and Recursive Dirichlet Process.

研究动机与目标

  • 为解决德语社交媒体情感分析中高质量、人工标注数据的缺乏问题。
  • 开发并评估针对德语推特的情感分析新方法,重点聚焦于词典生成、意见挖掘、消息级分类和语篇感知分析。
  • 创建一个全面的资源,用于在低资源、非正式语言环境下训练和评估德语自然语言处理系统。
  • 通过整合语言结构、上下文修饰语和语篇感知建模,提升情感分析任务的性能。

提出的方法

  • 提出一个新的、人工标注的德语推特语料库,包含情感标签、目标、持有者和词汇极性项。
  • 对比基于词典、基于语料库和基于词嵌入的词典生成方法,支持基于词典的方法,并提出一种线性投影算法。
  • 使用具有重构图拓扑结构的条件随机场(CRF)以提升细粒度意见挖掘的性能。
  • 引入一种基于词典注意力机制的双向神经网络,用于消息级情感分类。
  • 应用潜在边际化CRF和递归狄利克雷过程,以建模语篇结构,并从基本语篇单元推断整体推文情感极性。
  • 采用信念传播和Viterbi解码方法,在线性链、半马尔可夫和树形结构CRF中进行推理,同时对α和β得分计算进行改进。

实验结果

研究问题

  • RQ1在基于词典、基于语料库和基于词嵌入的方法中,哪一种能生成最可靠的德语推特情感词典?
  • RQ2重构图拓扑的CRF模型能否提升德语推文细粒度意见挖掘的性能?
  • RQ3将基于词典的注意力机制整合到双向神经网络中,对消息级情感分类有何影响?
  • RQ4建模语篇结构在多大程度上能提升微博情感分类的性能?
  • RQ5潜在边际化CRF和递归狄利克雷过程能否有效建模推文中的语篇感知情感推理?

主要发现

  • 基于词典的词典生成方法在极性列表质量方面优于基于语料库和基于词嵌入的方法。
  • 所提出的线性投影算法在性能上超越了众多现有自动生成的词典。
  • 采用重构图拓扑的CRF模型在细粒度意见挖掘任务中的表现优于标准CRF或RNN模型。
  • 结合基于词典注意力机制的双向神经网络通过融合词典与深度学习的优势,提升了消息级情感分类性能。
  • 潜在边际化CRF和递归狄利克雷过程通过建模基本语篇单元及其层次关系,增强了语篇感知情感分析效果。
  • 所提出的方法在新德语推特语料库的四项情感分析任务中均达到了最先进性能。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。