[论文解读] Reducing hallucination in structured outputs via Retrieval-Augmented Generation
本论文将检索增强生成(RAG)应用于结构化输出任务(自然语言转JSON工作流),以减少幻觉并实现使用较小的LLM和微小检索器的部署。研究表明RAG显著降低幻觉的步骤和表格数量,并支持域外泛化。
A common and fundamental limitation of Generative AI (GenAI) is its propensity to hallucinate. While large language models (LLM) have taken the world by storm, without eliminating or at least reducing hallucinations, real-world GenAI systems may face challenges in user adoption. In the process of deploying an enterprise application that produces workflows based on natural language requirements, we devised a system leveraging Retrieval Augmented Generation (RAG) to greatly improve the quality of the structured output that represents such workflows. Thanks to our implementation of RAG, our proposed system significantly reduces hallucinations in the output and improves the generalization of our LLM in out-of-domain settings. In addition, we show that using a small, well-trained retriever encoder can reduce the size of the accompanying LLM, thereby making deployments of LLM-based systems less resource-intensive.
研究动机与目标
- Demonstrate that RAG can reduce hallucination in structured output generation (natural language to workflow JSON).
- Show that a small, well-trained retriever with a modest LLM can achieve competitive performance.
- Illustrate deployment benefits, including reduced model size and modular architecture for enterprise use.
提出的方法
- Train a domain-specific retriever encoder to map natural language to existing workflow steps and database tables.
- Build a FAISS index of steps and tables and use cosine similarity to retrieve top candidates.
- Fine-tune a small encoder and LLM separately (LoRA) in a RAG setup where the retriever output is prepended to the LLM prompt.
- Use contrastive loss to train the retriever with positive and negative pairs (including BM25/ANCE negatives).
- Evaluate with Trigger Exact Match, Bag of Steps, and Hallucination metrics on in-domain and out-of-domain splits.
实验结果
研究问题
- RQ1Does retrieval-augmented generation reduce hallucinations in generating structured workflow JSON?
- RQ2Can a small retriever plus a modest LLM match or exceed larger models without retrieval in this task?
- RQ3How well does the RAG approach generalize to out-of-domain deployments (OOD data) without retraining?
- RQ4What is the impact of different negative sampling strategies on retriever performance?
- RQ5What are practical deployment considerations (latency, scalability) for a production system using RAG?
主要发现
| 模型 | EM(触发器) | BofS(步骤集合) | HS(幻觉步骤) | HT(幻觉表格) |
|---|---|---|---|---|
| 无检索器 StarCoderBase-1B | 0.580 | 0.645 | 0.157 | 0.192 |
| 无检索器 StarCoderBase-3B | 0.551 | 0.648 | 0.140 | 0.214 |
| 无检索器 StarCoderBase-7B | 0.547 | 0.669 | 0.137 | 0.206 |
| 无检索器 StarCoderBase (15.5B) | 0.632 | 0.662 | 0.160 | 0.194 |
| 有检索器 StarCoderBase-1B | 0.591 | 0.619 | 0.072 | 0.044 |
| 有检索器 StarCoderBase-3B | 0.615 | 0.641 | 0.017 | 0.030 |
| 有检索器 StarCoderBase-7B | 0.664 | 0.672 | 0.019 | 0.042 |
| 有检索器 StarCoderBase (15.5B) | 0.667 | 0.667 | 0.040 | 0.016 |
| 有检索器 CodeLlama-7B | 0.623 | 0.617 | 0.039 | 0.108 |
| 有检索器 Mistral-7B-v0.1 | 0.596 | 0.617 | 0.049 | 0.045 |
- RAG reduces hallucinated steps to below 7.5% and hallucinated tables to below 4.5% on the Human Eval split with StarCoderBase variants.
- Without a retriever, hallucination can reach about 21% for steps and tables, indicating a strong benefit from retrieval.
- A 7B parameter RAG model offers the best trade-off between performance and compute, with marginal gains over larger models while enabling deployment.
- Fine-tuning the smallest encoder (110M) is insufficient; however, all-mpnet-base-v2 with proper fine-tuning achieves strong recall in retrieval (Recall@15 for steps up to 0.743 and Recall@10 for tables up to 0.766).
- RAG-equipped StarCoderBase-7B matches or surpasses several larger LLMs in Trigger EM and Bag of Steps while maintaining lower hallucination.
- OOD evaluation shows average performance with retriever comparable to in-domain results, indicating good generalization without retraining.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。