QUICK REVIEW

[论文解读] Through a Gender Lens: Learning Usage Patterns of Emojis from Large-Scale Android Users

Zhenpeng Chen, Xuan Lü|arXiv (Cornell University)|May 16, 2017

Digital Communication and Language参考文献 59被引用 66

一句话总结

本文分析使用大规模 Android 数据集中的性别差异在表情符号使用，并证明在多语言环境下可以通过表情符号使用高准确度推断性别。

ABSTRACT

Based on a large data set of emoji using behavior collected from smartphone users over the world, this paper investigates gender-specific usage of emojis. We present various interesting findings that evidence a considerable difference in emoji usage by female and male users. Such a difference is significant not just in a statistical sense; it is sufficient for a machine learning algorithm to accurately infer the gender of a user purely based on the emojis used in their messages. In real world scenarios where gender inference is a necessity, models based on emojis have unique advantages over existing models that are based on textual or contextual information. Emojis not only provide language-independent indicators, but also alleviate the risk of leaking private user information through the analysis of text and metadata.

研究动机与目标

在全球多语言背景下，激发对性别如何影响表情符号使用的理解。
提供关于表情符号使用频率、偏好和情感表达方面的性别特异差异的实证证据。
证明仅通过表情符号进行性别推断的预测能力，无需文本或上下文数据。

提出的方法

编制包含134,419 名匿名 Android 用户、来自 Kika Keyboard 应用的自报性别和 401 million 条消息，覆盖 58 种语言的大型数据集。
计算表情符号使用统计数据，包括频率、最常用表情符号、通过互信息（MI）辨别性的表情符号，以及通过 PMI 的共用模式。
基于条件概率 p(Male|e) 和 p(Female|e) 将表情符号标记为男性/女性。
构建表情符号为基础的特征集（频率、偏好、情感）共计每个用户 1,370 个特征。
训练多种分类器（Ridge、Random Forest、Gradient Boosting、SVM 线性核）仅用表情符号使用来推断性别。
使用准确性和精确度（Precision_M、Precision_F）进行评估，并在跨语言的文本基线之间进行比较。

实验结果

研究问题

RQ1女性和男性用户在使用表情符号的频率（%emoji-msg）以及表情符号偏好上是否存在差异？
RQ2仅靠表情符号使用模式是否能够跨语言且无需文本数据就高精度地预测用户性别？
RQ3哪些表情符号对性别具有最大差别性，以及共用模式如何因性别而异？

主要发现

女性使用表情符号的概率高于男性（消息中的比例为 7.96% 对 7.02%）。
表情符号偏好因性别而异，某些表情符号对性别信息更具辨别力（例如通过 MI 确定的辨别性表情符号）。
基于 PMI 的网络中，表情符号的共用模式形成性别特异的社区。
女性更常使用与表情相关的脸部表情符号；男性更常使用与心形相关的表情符号，表明情感表达的细微差别。
基于表情符号的模型优于文本基线，使用 Gradient Boosting 取得最高 0.811 的准确率，且在跨语言具有泛化性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。