[论文解读] Adaptive Chunking: Optimizing Chunking-Method Selection for RAG
摘要:论文提出自适应分块(Adaptive Chunking),通过内在指标在每份文档中选择最佳分块策略,从而显著提升RAG的准确性与回答问题的效果。
The effectiveness of Retrieval-Augmented Generation (RAG) is highly dependent on how documents are chunked, that is, segmented into smaller units for indexing and retrieval. Yet, commonly used "one-size-fits-all" approaches often fail to capture the nuanced structure and semantics of diverse texts. Despite its central role, chunking lacks a dedicated evaluation framework, making it difficult to assess and compare strategies independently of downstream performance. We challenge this paradigm by introducing Adaptive Chunking, a framework that selects the most suitable chunking strategy for each document based on a set of five novel intrinsic, document-based metrics: References Completeness (RC), Intrachunk Cohesion (ICC), Document Contextual Coherence (DCC), Block Integrity (BI), and Size Compliance (SC), which directly assess chunking quality across key dimensions. To support this framework, we also introduce two new chunkers, an LLM-regex splitter and a split-then-merge recursive splitter, alongside targeted post-processing techniques. On a diverse corpus spanning legal, technical, and social science domains, our metric-guided adaptive method significantly improves downstream RAG performance. Without changing models or prompts, our framework increases RAG outcomes, raising answers correctness to 72% (from 62-64%) and increasing the number of successfully answered questions by over 30% (65 vs. 49). These results demonstrate that adaptive, document-aware chunking, guided by a complementary suite of intrinsic metrics, offers a practical and effective path to more robust RAG systems. Code available at https://github.com/ekimetrics/adaptive-chunking.
研究动机与目标
- 推动在检索增强生成(RAG)中对文档感知分块的需求。
- 定义并引入独立于下游任务的内在指标来评估分块质量。
- 提出一个按文档自适应选择分块策略的框架。
- 开发新的分块方法与后处理技术以支持自适应分块。
- 在跨领域数据集上证明RAG性能的提升。
提出的方法
- 提出五个基于文档的内在指标:References Completeness (RC)、Intrachunk Cohesion (ICC)、Document Contextual Coherence (DCC)、Block Integrity (BI)、Size Compliance (SC)。
- 开发两种新分块器:一个LLM正则表达式分割器和一个先分割再合并的递归分割器。
- 应用这些指标来指导按文档的分块策略选择(自适应框架)。
- 结合有针对性的后处理技术以提升分块质量。
- 在覆盖法律、技术和社会科学领域的多样化语料库上进行评估。
- 报告下游RAG改进但不改变模型或提示。
实验结果
研究问题
- RQ1内在、基于文档的分块指标是否能够可靠地指导RAG的分块策略选择?
- RQ2自适应、度量导向的分块方法是否能在多领域提升RAG的准确性与问答性能?
- RQ3新分块器(LLM-regex 分割器;split-then-merge 递归分割器)对分块质量和下游检索有哪些影响?
- RQ4拟议指标与下游RAG性能之间的相关性如何?
主要发现
- 在使用基于指标的自适应分块框架时,RAG性能得到提升。
- 采用自适应分块后,答案的准确性提升至72%,未自适应时为62-64%。
- 成功回答的问题数量增加(65 对比 49)。
- 证明自适应分块在法律、技术和社会科学文本中的鲁棒性。
- 引入两种新分块器并通过有针对性的后处理来支持自适应分块。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。