QUICK REVIEW

[论文解读] ClassiNet -- Predicting Missing Features for Short-Text Classification

Danushka Bollegala, Vincent Atanasov|arXiv (Cornell University)|Jan 1, 2018

Topic Modeling参考文献 1被引用 1

一句话总结

ClassiNet 是一个有向加权图，由二元特征预测器构成，通过建模特征之间的条件共现概率来预测短文本中的缺失特征。通过利用无标签数据训练特征预测器，并使用基于图的传播方法进行特征扩展，ClassiNet 在无需外部资源的情况下显著提升了短文本分类的准确性，在基准数据集上优于 Skip-thought 和 FastSent 等方法。

ABSTRACT

The fundamental problem in short-text classification is \emph{feature sparseness} -- the lack of feature overlap between a trained model and a test instance to be classified. We propose \emph{ClassiNet} -- a network of classifiers trained for predicting missing features in a given instance, to overcome the feature sparseness problem. Using a set of unlabeled training instances, we first learn binary classifiers as feature predictors for predicting whether a particular feature occurs in a given instance. Next, each feature predictor is represented as a vertex $v_i$ in the ClassiNet where a one-to-one correspondence exists between feature predictors and vertices. The weight of the directed edge $e_{ij}$ connecting a vertex $v_i$ to a vertex $v_j$ represents the conditional probability that given $v_i$ exists in an instance, $v_j$ also exists in the same instance. We show that ClassiNets generalize word co-occurrence graphs by considering implicit co-occurrences between features. We extract numerous features from the trained ClassiNet to overcome feature sparseness. In particular, for a given instance $\vec{x}$, we find similar features from ClassiNet that did not appear in $\vec{x}$, and append those features in the representation of $\vec{x}$. Moreover, we propose a method based on graph propagation to find features that are indirectly related to a given short-text. We evaluate ClassiNets on several benchmark datasets for short-text classification. Our experimental results show that by using ClassiNet, we can statistically significantly improve the accuracy in short-text classification tasks, without having to use any external resources such as thesauri for finding related features.

研究动机与目标

为解决短文本分类中的特征稀疏问题，即训练和测试样本之间词汇重叠有限，从而影响模型性能。
开发一种方法，在不依赖外部知识源（如同义词词典）的情况下，预测短文本中缺失但相关的特征。
通过利用无标签数据推导出的条件概率，建模超越直接词语共现的隐式特征共现关系。
通过在学习得到的特征预测器网络中进行局部和全局图传播，扩展特征表示，从而提升分类准确性。

提出的方法

使用从无标签数据中选取的正样本（特征出现）和负样本（特征未出现）来训练每个特征的二元分类器（特征预测器）。
构建一个有向加权图（ClassiNet），其中每个顶点代表一个特征预测器，边的权重表示在给定一个特征出现的条件下，另一个特征出现的条件概率。
使用局部敏感哈希（locality-sensitive hashing）高效近似邻域计算，避免完整的成对混淆矩阵计算。
应用两种特征扩展策略：(1) 所有邻居扩展（All Neighbour Expansion），即追加所有活跃特征的邻居；(2) 全局特征扩展（Global Feature Expansion），通过阻尼因子控制多跳传播。
使用阻尼因子 γ 控制全局传播中远距离邻居的影响，实验显示在 γ = 0.8 时性能最佳。
将扩展后的特征整合到原始特征向量中，以丰富稀疏表示后再进行分类。

实验结果

研究问题

RQ1学习得到的特征预测器网络能否有效预测短文本中的缺失特征，从而缓解特征稀疏问题？
RQ2与显式共现或词嵌入方法相比，通过条件概率建模隐式共现关系是否能提升分类性能？
RQ3考虑通过多跳传播的间接关系的全局特征扩展，与局部扩展相比，在准确性和鲁棒性方面表现如何？
RQ4ClassiNet 是否能在不使用外部资源（如同义词词典或预训练嵌入）的情况下提升分类准确性？

主要发现

ClassiNet 显著提升了短文本数据集上的分类准确性，其中全局特征扩展优于局部扩展以及 SCL、FTS、Skip-thought、FastSent 和 Paragraph2Vec 等基线方法。
在全局特征扩展中，最优阻尼因子 γ = 0.8 时达到最高准确率，过高或过低的值均导致性能下降。
全局特征扩展使特征向量平均扩大 25–30 倍，而所有邻居扩展仅扩大 1.5–2.5 倍，表明其具备更广泛的特征发现能力。
ClassiNet 图高度连通，平均出度为 263.35，大多数顶点连接 240–300 个其他顶点，形成密集图结构。
使用 ClassiNet 进行特征扩展能成功识别原始文本中未出现但语义相关的特征，例如在提及 'iPhone 6' 的评论中建议 'iPhone 6 plus'。
该方法在多个基准数据集上均实现了统计上显著的准确率提升，且无需外部知识或预训练嵌入。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。