QUICK REVIEW

[论文解读] RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation

Zihao Wang, Anji Liu|arXiv (Cornell University)|Mar 8, 2024

Context-Aware Activity Recognition Systems被引用 10

一句话总结

RAT 通过检索相关信息来逐步修正每个链式思考步骤，从而迭代改进长时域推理，在代码生成、数学、具身规划和创造性写作方面提高事实性和性能。

ABSTRACT

We explore how iterative revising a chain of thoughts with the help of information retrieval significantly improves large language models' reasoning and generation ability in long-horizon generation tasks, while hugely mitigating hallucination. In particular, the proposed method -- *retrieval-augmented thoughts* (RAT) -- revises each thought step one by one with retrieved information relevant to the task query, the current and the past thought steps, after the initial zero-shot CoT is generated. Applying RAT to GPT-3.5, GPT-4, and CodeLLaMA-7b substantially improves their performances on various long-horizon generation tasks; on average of relatively increasing rating scores by 13.63% on code generation, 16.96% on mathematical reasoning, 19.2% on creative writing, and 42.78% on embodied task planning. The demo page can be found at https://craftjarvis.github.io/RAT

研究动机与目标

推动通过将检索与迭代思维修订相结合，降低长时域生成中的幻觉现象。
开发一个零-shot 提示管道（RAT），使用检索信息来修订每一步思维。
在多样化任务（代码生成、数学推理、具身规划、创造性写作）和多种基础大语言模型上评估 RAT。
分析消融实验，以理解检索策略以及因果与非因果推理对性能的影响。

提出的方法

从任务提示生成初始的零-shot 分步思考。
使用从外部知识库检索的段落迭代修订每一步思考。
从当前及过往修订后的思考构建检索相关信息的查询。
用检索到的信息修订当前思考并补充下一步思考，直到所有步骤都被修订。
使用任务特定的知识源（如代码数据集、Minecraft 维基、网页检索）和嵌入（text-embedding-ada-002）来支持检索。
以因果、渐进的方式运作，逐步修订思考以提高准确性，同时不对早期步骤进行大幅改动。

Figure 1: Pipeline of RAT . Given a task prompt (denoted as $\mathit{I}$ in the figure), RAT starts from initial step-by-step thoughts ( $T_{1},T_{2},\cdots,T_{n}$ ) produced by an LLM in zero-shot (“let’s think step by step”). Some thought steps (such as $T_{1}$ in the figure) may be flawed due to

实验结果

研究问题

RQ1检索增强的思维是否能提升事实性并减少长时域生成中的幻觉？
RQ2迭代的分步骤检索如何影响中间推理的质量和最终输出？
RQ3RAT 的收益是否在代码生成、数学推理、具身规划和创造性写作，以及在不同基础大语言模型之间保持一致？
RQ4在 RAT 中，因果与非因果检索引导推理的影响是什么？

主要发现

RAT 在各任务上取得显著的平均提升：代码生成 13.63%，数学推理 16.96%，创造性写作 19.2%，具身任务规划 42.78%。
RAT 在若干基准测试上达到新的行业最优水平，超越了常规的 CoT 和标准 RAG 基线。
消融研究显示迭代检索和因果推理在提高性能方面的有效性。
RAT 在多种模型（GPT-3.5、GPT-4、CodeLLaMA-7b）和任务上表现出鲁棒性，GPT-4 上的收益尤为显著。

Figure 2: Top : An example of different LLM reasoning methods on creative generation tasks. Red text indicates errors or illusions in the text generated by LLM, while green text represents correct generation. Methods without RAG often generate incorrect information with hallucination, classical RAG

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。