[论文解读] Efficient and trustworthy methods for knowledge discovery
本文提出了时间网络的跨度核心分解(span-core decomposition)方法,该方法可识别具有相关时间跨度的密集连接顶点群组(跨度核心)。通过利用包含性质和最大核心检测,作者设计了高效算法以计算跨度核心,并通过动态规划将其应用于时间社区搜索,实现了多项式时间解法,且利用最大跨度核心显著提升了计算速度。该方法在真实世界的人际面对面接触网络上得到验证,展示了其在分析社交动态和提升图嵌入性能方面的可扩展性和实际效用。
Data are building blocks to information and, subsequently, they are vital input to knowledge. Today, in the midst of the digital era, vast quantities of highly-complex data are being collected and processed at an unprecedented scale. This abundance of data has highlighted the importance of efficient and effective knowledge-discovery algorithms to identify patterns hidden in the data with the ultimate aim of uncovering valuable knowledge and shape our understanding of the world around us. To capitalize on the opportunities offered by massive amounts of data as well as modern computing power, for many years, research in knowledge discovery and related areas has introduced algorithms that are increasingly efficient and effective, but also more and more opaque and unpredictable. Recently, growing interest in the ethical dimensions of algorithms has drawn attention to the limitations of opaque algorithms and has emphasized a need for trustworthy algorithms particularly when such algorithms are used to support high-stakes decision making. In order to be trustworthy, algorithms should solve a clearly defined problem via a clear sequence of instructions, they should not be utterly unsuccessful in any particular case and they should be easy to understand and interpret for humans so that no harmful biases can be hidden. In this thesis, we pursue the goal of developing novel knowledge-discovery algorithmic methods that are not only highly efficient to face the challenges and opportunities posed by modern data, but also trustworthy. In particular, we propose efficient and trustworthy methods for a collection of popular knowledgediscovery tasks. First, we consider tasks of exact inference in Bayesian networks and hidden Markov models. Trustworthy approaches for such tasks exist. However, their applicability may be severely limited by time or memory requirements. Therefore, we propose novel methods to reduce the time or memory resources that are needed by existing approaches for the considered exact inference tasks. Beside exact inference tasks, we also consider two different knowledge-discovery tasks that arise naturally in modern data: multi-label classification and community search in temporal graphs. Regarding multi-label classification, we propose an efficient and accurate rule-based multi-label classifier that drastically improves upon the interpretability of existing solutions. For community search in temporal graphs, we formalise the task for the first time, and we propose a solution that guarantees high efficiency and interpretability. In designing knowledge-discovery methods, we often rely on existing database-management and probabilistic methods. Methods for database management are valuable to address the large dimension and high complexity of modern data, while probabilistic methods are essential to methodologically handle uncertainty in the data.
研究动机与目标
- 为解决在时间网络中识别密集且时间一致的子图的挑战,此类子图对分析社交动态和检测异常至关重要。
- 形式化一种新颖的时间核心分解概念——跨度核心,其中每个核心由其核心度(密度)和时间跨度(存在时间区间)定义。
- 设计高效算法以计算所有跨度核心,并更高效地仅计算最大跨度核心(在核心度和跨度上均不被支配),利用理论上的包含性质。
- 将时间社区搜索问题形式化并作为多项式时间动态规划任务求解,利用最大跨度核心实现性能加速。
- 展示跨度核心在实际应用中的相关性,包括异常检测、数据质量评估以及提升图嵌入分类性能。
提出的方法
- 提出跨度核心分解作为核心分解的时间扩展,其中每个核心是在连续时间区间 Δ 内顶点度数最小值 ≥ k 的顶点集合。
- 设计高效算法以计算所有跨度核心,通过利用核心之间包含层次结构,减少时间区间数量的二次方增长,通过剪枝优化。
- 设计专用算法以直接检查最大性条件,仅提取最大跨度核心,避免完整枚举。
- 建立时间社区搜索与最大跨度核心之间的理论联系,支持动态规划公式化,确保时间域的完整覆盖。
- 提出一种加速技术,将最大跨度核心作为构建模块用于时间社区搜索,与朴素动态规划相比显著减少计算时间。
- 采用 node2vec 和 DeepWalk 并通过网格搜索进行超参数调优,使用缩放嵌入后的惩罚逻辑回归评估分类性能。
实验结果
研究问题
- RQ1如何在计算开销最小的前提下,高效发现时间网络中密集且时间一致的子图(跨度核心)?
- RQ2跨度核心的理论结构是什么?如何在不枚举所有可能核心的前提下,计算最大跨度核心(在核心度或跨度上均不被支配)?
- RQ3时间社区搜索问题——即寻找覆盖整个时间域的社区——能否被高效求解?跨度核心在其中如何提升性能?
- RQ4跨度核心在多大程度上提升了真实时间网络中图嵌入的质量,特别是在分类顶点角色或检测异常方面?
- RQ5最大跨度核心在实际应用中如何贡献于异常检测、数据验证和动态接触网络的可视化?
主要发现
- 所提出的计算所有跨度核心的算法通过利用包含性质实现高效性,避免了时间区间数量的二次方爆炸。
- 仅提取最大跨度核心的算法显著快于计算所有核心,因其通过直接检查最大性条件避免了冗余计算。
- 时间社区搜索可通过动态规划在多项式时间内求解,且通过集成最大跨度核心,与基线方法相比显著减少了运行时间。
- 在 PrimarySchool 数据集上,TCS 嵌入在嵌入维度 h ≥ 200 时 Macro F1 分数接近 1,高维下优于基线方法,且在 h = |T| 时与基线方法持平。
- 在 HighSchool 数据集上,TCS 性能在 h ≥ 200 时与最佳方法相当,展示了在时间分辨率提升时的可扩展性和有效性。
- 使用跨度核心可提升图嵌入分类性能,支持接触网络中的异常检测,并为大规模时变图的可视化提供了新方法。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。