QUICK REVIEW

[论文解读] Semantic Operators: A Declarative Model for Rich, AI-based Data Processing

Liana Patel, S. N. Jha|arXiv (Cornell University)|Jul 16, 2024

Semantic Web and Ontologies被引用 9

一句话总结

语义操作符引入了一种声明性的 LOTUS 编程模型，扩展关系模型，具备基于AI的语义操作符用于对结构化和非结构化数据的批处理中，已在事实核查、极端多标签分类和检索任务中进行了演示。

ABSTRACT

The semantic capabilities of large language models (LLMs) have the potential to enable rich analytics and reasoning over vast knowledge corpora. Unfortunately, existing systems either empirically optimize expensive LLM-powered operations with no performance guarantees, or serve a limited set of row-wise LLM operations, providing limited robustness, expressiveness and usability. We introduce semantic operators, the first formalism for declarative and general-purpose AI-based transformations based on natural language specifications (e.g., filtering, sorting, joining or aggregating records using natural language criteria). Each operator opens a rich space for execution plans, similar to relational operators. Our model specifies the expected behavior of each operator with a high-quality gold algorithm, and we develop an optimization framework that reduces cost, while providing accuracy guarantees with respect to a gold algorithm. Using this approach, we propose several novel optimizations to accelerate semantic filtering, joining, group-by and top-k operations by up to $1,000 imes$. We implement semantic operators in the LOTUS system and demonstrate LOTUS' effectiveness on real, bulk-semantic processing applications, including fact-checking, biomedical multi-label classification, search, and topic analysis. We show that the semantic operator model is expressive, capturing state-of-the-art AI pipelines in a few operator calls, and making it easy to express new pipelines that match or exceed quality of recent LLM-based analytic systems by up to $170\%$, while offering accuracy guarantees. Overall, LOTUS programs match or exceed the accuracy of state-of-the-art AI pipelines for each task while running up to $3.6 imes$ faster than the highest-quality baselines. LOTUS is publicly available at https://github.com/lotus-data/lotus.

研究动机与目标

动机：在传统的 RAG 和 LM-UDF 方法之外，推动对大规模语义处理的需求。
定义一个声明性编程接口（语义运算符），扩展关系模型以支持AI驱动的数据任务。
展示 LOTUS 在多种应用中的表达能力和优化能力（事实核查、多标签分类、检索）。
显示语义运算符能够在降低开发成本的同时实现高质量的流水线与更高的效率。

提出的方法

引入语义运算符（sem_filter、sem_join、sem_sim_join、sem_agg、sem_topk、sem_map、sem_extract、sem_cluster_by、sem_search、sem_index、load_sem_index）作为可扩展的、基于语言的原语，在结构化和非结构化数据上运行。
在 LOTUS 中提供类似 Pandas 的 API 实现，并描述参数化自然语言表达（langex）如何指定AI驱动的谓词、聚合和投影。
描述利用并行批处理推理、模型级联、语义索引和针对昂贵算子的算法近似的优化器和执行策略。
解释包含结构化字段和NL文本的表的数据建模，以及使用语义相似性索引进行高效查询。
概述与现有 LM 工具（vLLM、FAISS）的集成，以及在不同运算符间复用或调整提示语的能力。
给出示例程序，展示如何组合多个运算符以构建复杂的AI驱动流水线。

Figure 1 . Accuracy versus execution time (log-scale) for 3 short LOTUS programs, shown as Program A, B, and C, which implement distinct query pipelines for fact-checking on the FEVER (Thorne et al . , 2018 ) dataset. The blue circles show the performance of these un-optimized programs, and the blue

实验结果

研究问题

RQ1语义运算符能否为大规模语义处理提供一种可扩展且具表达力的替代方案，超越临时性的 RAG 流水线？
RQ2声明性 LOTUS 模型如何实现对混合数据类型的AI驱动操作的高效组合？
RQ3哪些优化与算法能够在语义运算符中最好地在准确性与执行时间之间取得平衡？
RQ4LOTUS 在事实核查、极端多标签分类和检索任务中在多大程度上能够复现或超越最先进的流水线？

主要发现

LOTUS 可以在更少的代码行数下复现并改进最先进的事实核查流水线（FEVER），并显著降低执行时间。
经过优化的 LOTUS 程序在 FEVER 上相比未优化的对手和 FacTool，精度更高，执行速度提升高达 7–34×。
LOTUS 的基于连接的极端多标签分类算法在执行速度上比朴素连接快最多 800×，同时达到与最先进结果质量相当的水平。
在检索和排序方面，LOTUS 组合在 nDCG@10 上比纯粹的检索和再排序设置高出 5.9–49.4%，执行时间比基于 LM 的排序方法低 1.67–10×。

Figure 3 . Table schema of ArXiv papers.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。