QUICK REVIEW

[论文解读] Scaling Graph-based Semi Supervised Learning to Large Number of Labels Using Count-Min Sketch

Partha Talukdar, William W. Cohen|arXiv (Cornell University)|Oct 10, 2013

Text and Document Classification Technologies参考文献 22被引用 18

一句话总结

该论文提出 MAD-Sketch，一种基于图的半监督学习算法，利用 Count-min Sketch 在每个节点上紧凑地存储标签分布，将空间复杂度从 O(m) 降低至 O(log m)，并实现显著的速度提升。该方法在包含多达一百万个标签的大规模数据集上达到当前最优性能，而传统方法因内存限制而失效。

ABSTRACT

Graph-based Semi-supervised learning (SSL) algorithms have been successfully used in a large number of applications. These methods classify initially unlabeled nodes by propagating label information over the structure of graph starting from seed nodes. Graph-based SSL algorithms usually scale linearly with the number of distinct labels (m), and require O(m) space on each node. Unfortunately, there exist many applications of practical significance with very large m over large graphs, demanding better space and time complexity. In this paper, we propose MAD-Sketch, a novel graph-based SSL algorithm which compactly stores label distribution on each node using Count-min Sketch, a randomized data structure. We present theoretical analysis showing that under mild conditions, MAD-Sketch can reduce space complexity at each node from O(m) to O(log m), and achieve similar savings in time complexity as well. We support our analysis through experiments on multiple real world datasets. We observe that MAD-Sketch achieves similar performance as existing state-of-the-art graph-based SSL algorithms, while requiring smaller memory footprint and at the same time achieving up to 10x speedup. We find that MAD-Sketch is able to scale to datasets with one million labels, which is beyond the scope of existing graph-based SSL algorithms.

研究动机与目标

解决当标签数量（m）极大时，基于图的半监督学习中的可扩展性瓶颈问题。
降低传统标签传播方法中固有的每个节点 O(m) 的空间和时间复杂度。
实现在包含数十万至数百万个标签的数据集上，基于图的 SSL 的实际部署。
为在标签偏斜和社区结构等现实数据条件下使用 Count-min Sketch 进行 SSL 提供理论和实证支持。
证明基于 sketch 的存储方式在大幅降低内存占用和运行时间的同时，仍能保持高性能。

提出的方法

使用 Count-min Sketch 表示每个节点的标签得分向量，而非显式存储为密集向量。
利用 Count-min Sketch 的线性特性，在迭代标签传播过程中实现高效、近似的更新。
将改进的吸附算法（MAD）与基于 sketch 的表示结合，实现图上标签得分的传播。
采用随机哈希函数和矩阵结构（w × d），以实现标签得分的 probabilistic、bounded-error 近似。
基于理论边界调整 sketch 参数（宽度 w 和深度 d），以高概率控制过估计误差。
将 MAD 算法扩展至支持 sketch 表示，构建 MAD-Sketch，支持迭代更新与收敛。

实验结果

研究问题

RQ1Count-min Sketch 是否可用于紧凑表示基于图的 SSL 中的标签分布，而不会造成显著的精度损失？
RQ2在何种条件下，sketching 可将空间和时间复杂度从 O(m) 降低至 O(log m)？
RQ3与精确标签传播（MAD-Exact）相比，MAD-Sketch 在大规模数据集上的精度、内存使用和速度表现如何？
RQ4基于 sketch 的 SSL 是否可扩展至包含一百万个标签的数据集，而精确方法因内存溢出而失败？
RQ5真实图中标签偏斜或社区结构的存在是否会增强 sketching 在 SSL 中的有效性？

主要发现

MAD-Sketch 在 Freebase 和 Flickr-10k 数据集上达到与 MAD-Exact 相同的平均倒数排名（MRR），表明精度无显著下降。
在 Freebase 上，MAD-Sketch 使用 w=109、d=8 时，相比 MAD-Exact 实现 4.7 倍速度提升，并且每次迭代的内存占用更低。
在包含一百万个标签的 Flickr-1m 数据集上，MAD-Exact 在第三次迭代即因内存耗尽而崩溃，而 MAD-Sketch 使用 w=55、d=17 成功完成 20 次迭代，且内存使用稳定且极低。
理论分析表明，在标签偏斜和社区结构等温和条件下，sketching 可将空间复杂度从 O(m) 降低至 O(log m)。
实验结果证实，真实世界数据集中的标签得分分布符合幂律分布，支持理论假设，并使 sketching 有效可行。
即使使用小于理论预测大小的 sketch，性能仍保持良好，表明标签偏斜和图社区结构可能带来额外的结构优势。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。