QUICK REVIEW

[论文解读] Database Learning: Toward a Database that Becomes Smarter Every Time

Yongjoo Park, Ahmad Shahab Tajik|arXiv (Cornell University)|Mar 16, 2017

Advanced Database Systems and Queries参考文献 76被引用 35

一句话总结

本文提出了数据库学习（Database Learning, DBL）框架，该框架通过从过往的近似查询结果中学习，使数据库能够随着时间推移变得更快速、更准确。基于最大熵原则，DBL 构建并不断优化底层数据分布的统计模型，使基于 Spark SQL 的查询引擎 Verdict 能够以更小的样本更高效地回答新查询，在相同准确度下相比现有 AQP 系统实现最高达 23.0× 的性能提升。

ABSTRACT

In today's databases, previous query answers rarely benefit answering future queries. For the first time, to the best of our knowledge, we change this paradigm in an approximate query processing (AQP) context. We make the following observation: the answer to each query reveals some degree of knowledge about the answer to another query because their answers stem from the same underlying distribution that has produced the entire dataset. Exploiting and refining this knowledge should allow us to answer queries more analytically, rather than by reading enormous amounts of raw data. Also, processing more queries should continuously enhance our knowledge of the underlying distribution, and hence lead to increasingly faster response times for future queries. We call this novel idea---learning from past query answers---Database Learning. We exploit the principle of maximum entropy to produce answers, which are in expectation guaranteed to be more accurate than existing sample-based approximations. Empowered by this idea, we build a query engine on top of Spark SQL, called Verdict. We conduct extensive experiments on real-world query traces from a large customer of a major database vendor. Our results demonstrate that Verdict supports 73.7% of these queries, speeding them up by up to 23.0x for the same accuracy level compared to existing AQP systems.

研究动机与目标

为解决传统数据库中过往查询工作被丢弃而非重用所导致的低效问题。
使数据库能够从近似查询结果中学习，并不断优化对底层数据分布的理解。
通过利用从历史查询结果中推导出的统计模型，减少未来查询的响应时间。
为广泛的分析型 SQL 查询提供系统化、准确且快速的近似答案支持。
构建一个在处理更多查询过程中持续提升准确度与性能的系统。

提出的方法

系统采用最大熵原则，推断与过往查询结果一致的最可能的数据分布，确保在不确定性下实现最优准确度。
将每个查询表述为对底层数据分布的一组线性约束，实现无需迭代求解器的解析推断。
通过使用 O(n) 个变量对多维空间中的重叠查询范围进行建模，避免了传统方法的指数级复杂度膨胀。
在 Spark SQL 上实现的查询引擎 Verdict 集成这些模型，通过结合模型推理与最小采样来回答新查询。
系统在每次查询后动态更新其模型，逐步细化置信区间，从而随时间推移提高准确度。
支持复杂、多列聚合及重叠范围查询，无需严格查询包含关系或物化视图。

实验结果

研究问题

RQ1是否可以系统性地利用过往的近似查询结果，以提升未来查询的性能与准确度？
RQ2数据库如何从查询结果中学习，以随时间减少对全表扫描的依赖？
RQ3何种统计框架能够实现从重叠查询结果中高效、可扩展且准确的模型优化？
RQ4基于模型的方法在速度与准确度方面，相较于传统采样型 AQP 能在多大程度上实现超越？
RQ5数据库系统是否能通过从查询工作增量学习实现持续的性能提升？

主要发现

Verdict 支持来自大型企业客户查询日志中 73.7% 的真实分析型查询。
在相同准确度水平下，Verdict 相比基于在线聚合的 AQP 系统最高可实现 23.0× 的性能加速。
随着新查询的不断加入，系统的模型准确度持续提升，从而逐步减少对数据采样的依赖。
最大熵方法实现了 O(n) 变量的解析推断，避免了先前方法中 O(2^n) 的复杂度。
该框架天然捕捉数据相关性，无需预先假设数据的平滑性或结构。
该方法优于现有技术（如物化视图和 COSMOS），可在无需严格包含关系的前提下支持重叠的多列查询。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。