QUICK REVIEW

[论文解读] A Directed Acyclic Graph Approach to Online Log Parsing

Pinjia He, Jieming Zhu|arXiv (Cornell University)|Jun 12, 2018

Software System Performance and Reliability参考文献 18被引用 23

一句话总结

本文提出了一种基于有向无环图（DAG）的在线日志解析方法Drain，可自动初始化并动态更新解析规则，无需人工调参。该方法在11个真实世界日志数据集上实现了最先进水平的准确率，且解析速度比现有在线解析器最高快97.14%。

ABSTRACT

Logs are widely used in modern software system management because they are often the only data accessible that record system events at runtime. In recent years, because of the ever-increasing log size, data mining techniques are often utilized to help developers and operators conduct system reliability management. A typical log-based system reliability management procedure is to first parse log messages because of their unstructured format; and apply data mining techniques on the parsed logs to obtain critical system behavior information. Most of existing research studies focus on offline log parsing, which need to parse logs in batch mode. However, software systems, especially distributed systems, require online monitoring and maintenance. Thus, a log parser that can parse log messages in a streaming manner is highly in demand. To address this problem, we propose an online log parsing method, namely Drain, based on directed acyclic graph, which encodes specially designed rules for parsing. Drain can automatically generate a directed acyclic graph for a new system and update the graph according to the incoming log messages. Besides, Drain frees developers from the burden of parameter tuning by allowing them use Drain with no pre-defined parameters. To evaluate the performance of Drain, we collect 11 log datasets generated by real-world systems, ranging from distributed systems, Web applications, supercomputers, operating systems, to standalone software. The experimental results show that Drain has the highest accuracy on all 11 datasets. Moreover, Drain obtains 37.15\%$\sim$ 97.14\% improvement in the running time over the state-of-the-art online parsers. We also conduct a case study on a log-based anomaly detection task using Drain in the parsing step, which determines its effectiveness in system reliability management.

研究动机与目标

为解决离线日志解析在实时系统监控中的局限性，特别是针对大规模分布式系统的问题。
通过实现自动初始化和动态规则更新，消除日志解析中对手动参数调优的需求。
开发一种在线日志解析器，在保持高准确率的同时显著提升解析效率。
在包括分布式系统、Web应用和超级计算机在内的多样化真实系统中评估解析器的性能。
证明该解析器在端到端系统可靠性任务（如异常检测）中的有效性。

提出的方法

Drain利用有向无环图（DAG）编码日志消息解析的启发式规则，实现在流式日志中的高效模式匹配。
解析器根据流入日志消息的统计特性自动初始化其DAG结构，无需预定义参数。
随着新日志消息的到达，动态更新DAG结构，实现对演变日志模式的适应，而无需从头开始重新训练。
DAG结构通过分层、基于规则的方式组织日志模板，减少比较开销，从而加速日志组搜索。
Drain采用基于相似度的匹配策略将日志消息分组为模板，重点在于在高吞吐流式环境下保持解析准确率。
该方法设计为内存高效且可扩展，适用于大规模实时监控流水线中的部署。

实验结果

研究问题

RQ1日志解析器是否能在无需人工参数调优的情况下实现在线日志解析的高准确率？
RQ2基于DAG的在线日志解析器在准确率和速度方面与现有最先进在线及离线日志解析器相比表现如何？
RQ3当作为预处理步骤时，所提出的解析器是否能有效支持下游系统可靠性任务（如异常检测）？
RQ4该解析器在具有不同日志格式和数据量的多样化真实系统中泛化能力如何？
RQ5自动规则初始化与动态更新机制对解析效率和准确率有何影响？

主要发现

Drain在从分布式系统、Web应用、超级计算机、操作系统和独立软件收集的11个真实世界日志数据集上均实现了最高解析准确率。
与最先进在线日志解析器相比，Drain的解析速度提升了37.15%至97.14%，展现出显著的效率优势。
该解析器无需任何手动参数调优，其规则基于流入日志流自动初始化并动态更新。
在案例研究中，Drain成功实现了有效的基于日志的异常检测，证实了其在真实系统可靠性管理中的实用性。
基于DAG的结构实现了高效的日志组搜索与可扩展的解析，其在速度和准确率方面均优于基于树和聚类的方法。
源代码和全部11个数据集已公开发布，以支持可复现性与未来研究。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。