QUICK REVIEW

[论文解读] Survey on Models and Techniques for Root-Cause Analysis

Marc Solé, Víctor Muntés-Mulero|arXiv (Cornell University)|Jan 30, 2017

Software System Performance and Reliability参考文献 168被引用 80

一句话总结

本综述回顾根因分析模型及学习/推理技术，聚焦在物联网/云计算IT系统中的性能和可扩展性，并就选择RCA策略提供指南。

ABSTRACT

Automation and computer intelligence to support complex human decisions becomes essential to manage large and distributed systems in the Cloud and IoT era. Understanding the root cause of an observed symptom in a complex system has been a major problem for decades. As industry dives into the IoT world and the amount of data generated per year grows at an amazing speed, an important question is how to find appropriate mechanisms to determine root causes that can handle huge amounts of data or may provide valuable feedback in real-time. While many survey papers aim at summarizing the landscape of techniques for modelling system behavior and infering the root cause of a problem based in the resulting models, none of those focuses on analyzing how the different techniques in the literature fit growing requirements in terms of performance and scalability. In this survey, we provide a review of root-cause analysis, focusing on these particular aspects. We also provide guidance to choose the best root-cause analysis strategy depending on the requirements of a particular system and application.

研究动机与目标

在物联网/云时代和大型分布式系统中，激发对高级根因分析的需求。
对RCA模型（确定性 vs 概率性）及其学习/推理方法进行分类与比较。
分析模型生成（领域知识、系统知识、观测数据）如何影响性能与可扩展性。
基于系统需求（实时与否、事后分析、数据规模、更新情况）就选择RCA策略提供指南。
讨论手工、辅助与数据驱动模型构建之间的权衡。

提出的方法

将RCA模型分类为确定性与概率性族，并映射子类型（如逻辑、贝叶斯网络、自动机、Petri网）。
描述模型的获得方式：由专家驱动、从子模型辅助生成，或完全数据驱动学习。
回顾跨越模型族的自动模型构造学习算法（参考表II）。
解释推断/应用假言推理（abduction）技术以及如何产生不同的输出（根因、解释）（参考表III/IV）。
讨论模型更新及处理系统知识变化，包括增量更新 vs 完整重构。
强调对性能、可扩展性以及实时与事后诊断的影响。

实验结果

研究问题

RQ1在大型IT/物联网/云系统中，哪些RCA模型和学习技术最能满足性能与可扩展性的需求？
RQ2如何从领域知识、系统拓扑和观测数据生成RCA模型，以及其权衡？
RQ3哪些推理策略在实时约束下能提供有用的解释并具有可接受的延迟？
RQ4模型结构和推理长度如何影响可扩展性和诊断准确性？

主要发现

RCA模型涵盖确定性与概率性族，且不同子类型在速度、准确性和可解释性之间提供了不同的权衡。
模型生成可以是专家驱动、辅助（部分基于知识）或完全数据驱动，影响准确性和更新效率。
推理技术在是否提供精确结果或随时/近似答案方面有所不同，影响实时诊断的适用性。
编译与算术电路表示可以加速诊断，但需要付出离线模型构建的代价。
学习算法在复杂性和可扩展性方面各不相同，有些方法能够实现增量更新以应对不断演变的系统。
该综述基于系统需求（观测数据规模、组件数量、更新动态等）提供关于选择RCA策略的指南。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。