[Paper Review] LogFold: Compressing Logs with Structured Tokens and Hybrid Encoding
LogFold introduces a skeleton-aware structured-token analysis and a type-aware hybrid encoding pipeline to compress logs, outperforming state-of-the-art baselines across 16 public datasets.
Logs are essential for diagnosing failures and conducting retrospective studies, leading many software organizations to retain log messages for a long time. Nevertheless, the volume of generated log data grows rapidly as software systems grow, necessitating an effective compression method. Apart from general-purpose compressors (e.g., Gzip, Bzip2), many recent studies developed log-specific compression algorithms, but they offer suboptimal performance because of (1) overlooking redundancies within certain complex tokens, and (2) lacking a fine-grained encoding strategy for diverse token types. This work uncovers a new redundancy pattern in structured tokens and proposes a new type-aware encoding strategy to improve log compression. Building on this insight, we introduce LogFold, a novel log compression method consisting of four components: a token analyzer to classifies tokens as structured, unstructured, or static types; a processor that mines recurring patterns within structured tokens based on their delimiter skeletons; a hybrid encoder that tailors data representation according to token types; and a packer that compresses the output into an archive file. Extensive experiments on 16 public log datasets demonstrate that LogFold surpasses state-of-the-art baselines, achieving average compression ratio improvements by 11.11%, with a compression speed of 9.842 MB/s. Ablation studies further indicate the importance of each component. We also conduct sensitivity analyses to verify LogFold's robustness and stability across various internal settings.
Motivation & Objective
- Identify redundancies in structured tokens within logs to improve compression.
- Propose a four-component pipeline (token analyzer, structured token processor, hybrid encoder, packer) for efficient log compression.
- Develop a type-aware encoding strategy that tailors encoding to numeric, string, and mixed-type tokens.
- Evaluate LogFold on diverse public log datasets and compare with state-of-the-art log compressors and general-purpose compressors.
Proposed method
- Token Analyzer classifies tokens as structured, unstructured, or static for each log entry.
- Structured Token Processor performs Delimiter Skeleton-aware Grouping and Pattern Mining to extract intra-token redundancies.
- Hybrid Encoder applies optimized numeric encoding, dictionary encoding, and mixed-type encoding tailored to token types.
- Packer aggregates intermediate outputs and applies a general-purpose compressor to produce the final archive.
- Decompressor reverses the pipeline to ensure lossless recovery.

Experimental results
Research questions
- RQ1RQ1: How well does LogFold improve log compression?
- RQ2RQ2: How do different components contribute to LogFold’s effectiveness?
- RQ3RQ3: How sensitive is LogFold to its internal parameter settings?
- RQ4RQ4: How generalizable is LogFold across different zip tools with different compression levels?
- RQ5RQ5: How does LogFold perform in log decompression?
Key findings
- LogFold achieves an average compression ratio improvement of 11.11% over state-of-the-art baselines on 16 public datasets.
- LogFold achieves a compression speed of 9.842 MB/s.
- LogFold outperforms nine baseline compressors across the evaluation datasets and attains the best compression on 12 of 16 datasets.
- Ablation studies show the contribution of each component (token analyzer, structured token processor, hybrid encoder, packer).
- Sensitivity analyses confirm LogFold’s robustness and stability across internal settings.

Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.