QUICK REVIEW

[论文解读] FLAT: Fast, Lightweight and Accurate Method for Cardinality Estimation.

Rong Zhu, Zi‐Niu Wu|arXiv (Cornell University)|Jan 1, 2020

Data Management and Algorithms参考文献 43被引用 2

一句话总结

FLAT 是一种快速、轻量且精确的基数估计方法，采用一种新颖的无监督图模型 FSPN，通过结合独立因子分解与条件因子分解，自适应地建模属性相关性。该方法实现了近线性时间的概率计算，存储成本降低 1–2 个数量级，相较于现有方法，查询优化精度提升 1–5 个数量级，在 Postgres 中实现比基线快 12.9% 的查询执行速度。

ABSTRACT

Query optimizers rely on accurate cardinality estimation (CardEst) to produce good execution plans. The core problem of CardEst is how to model the rich joint distribution of attributes in an accurate and compact manner. Despite decades of research, existing methods either over simplify the models only using independent factorization which leads to inaccurate estimates, or over complicate them by lossless conditional factorization without any independent assumption which results in slow probability computation. In this paper, we propose FLAT, a CardEst method that is simultaneously fast in probability computation, lightweight in model size and accurate in estimation quality. The key idea of FLAT is a novel unsupervised graphical model, called FSPN. It utilizes both independent and conditional factorization to adaptively model different levels of attributes correlations, and thus dovetails their advantages. FLAT supports efficient online probability computation in near liner time on the underlying FSPN model, provides effective offline model construction and enables incremental model updates. It can estimate cardinality for both single table queries and multi table join queries. Extensive experimental study demonstrates the superiority of FLAT over existing CardEst methods on well known IMDB benchmarks: FLAT achieves 1 to 5 orders of magnitude better accuracy, 1 to 3 orders of magnitude faster probability computation speed and 1 to 2 orders of magnitude lower storage cost. We also integrate FLAT into Postgres to perform an end to end test. It improves the query execution time by 12.9% on the benchmark workload, which is very close to the optimal result 14.2% using the true cardinality.

研究动机与目标

解决基数估计中模型精度、计算速度与存储成本之间的权衡问题。
开发一种方法，高效建模复杂属性相关性，而无需完全依赖独立性或无损条件因子分解。
实现快速在线概率计算与增量模型更新，以适应动态数据库环境。
支持单表与多表连接查询，并保持高精度的基数估计。
实现与真实数据库系统的集成，并展示端到端的性能提升。

提出的方法

FLAT 提出 FSPN，一种新颖的无监督图模型，通过结合独立因子分解与条件因子分解，自适应地捕捉不同层次的属性相关性。
采用混合因子分解策略：对弱相关属性使用独立因子分解，对强相关属性使用条件因子分解，从而在精度与效率之间取得平衡。
通过在 FSPN 架构上优化推理算法，实现近线性时间的概率计算。
离线模型构建基于数据库统计信息的无监督学习，实现高效初始化。
支持增量更新，使模型能够无需从头开始重新训练即可适应模式或数据变更。
该方法已集成至 Postgres 中，用于端到端评估，验证了其在实际部署中的可行性。

实验结果

研究问题

RQ1结合独立与条件因子分解的混合因子分解模型，是否能在精度上优于纯粹独立或完全条件模型？
RQ2此类模型是否能在降低存储开销的同时，保持快速的概率计算速度？
RQ3该方法在不同查询工作负载下，其精度、速度与内存使用量的可扩展性如何？
RQ4当集成至真实 DBMS（如 Postgres）时，FLAT 能在多大程度上提升查询执行时间？
RQ5与真实基数相比，该方法在优化质量方面表现如何？

主要发现

在 IMDB 基准测试中，FLAT 的基数估计精度比现有方法高出 1 至 5 个数量级。
与先前方法相比，概率计算时间减少了 1 至 3 个数量级。
存储成本降低 1 至 2 个数量级，使模型更加轻量化。
在 Postgres 中集成 FLAT 后，查询执行时间提升 12.9%，接近使用真实基数时的最优值 14.2%。
该方法在单表与多表连接查询中均表现出一致的性能提升。
增量更新机制使模型能够无需从头训练即可高效适应数据与模式变更。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。