Skip to main content
QUICK REVIEW

[论文解读] Researchy Questions: A Dataset of Multi-Perspective, Decompositional Questions for LLM Web Agents

Corby Rosset, Ho-Lam Chung|arXiv (Cornell University)|Feb 27, 2024
Digital Rights Management and Security被引用 5
一句话总结

这篇论文引入 Researchy Questions,一个来自真实搜索日志的大规模非事实性、分解性、多观点问题数据集,用于研究大模型网页代理在处理不清晰信息需求时的表现,包含 96k 个问题及相关的分解计划和点击证据。

ABSTRACT

Existing question answering (QA) datasets are no longer challenging to most powerful Large Language Models (LLMs). Traditional QA benchmarks like TriviaQA, NaturalQuestions, ELI5 and HotpotQA mainly study ``known unknowns'' with clear indications of both what information is missing, and how to find it to answer the question. Hence, good performance on these benchmarks provides a false sense of security. A yet unmet need of the NLP community is a bank of non-factoid, multi-perspective questions involving a great deal of unclear information needs, i.e. ``unknown uknowns''. We claim we can find such questions in search engine logs, which is surprising because most question-intent queries are indeed factoid. We present Researchy Questions, a dataset of search engine queries tediously filtered to be non-factoid, ``decompositional'' and multi-perspective. We show that users spend a lot of ``effort'' on these questions in terms of signals like clicks and session length, and that they are also challenging for GPT-4. We also show that ``slow thinking'' answering techniques, like decomposition into sub-questions shows benefit over answering directly. We release $\sim$ 100k Researchy Questions, along with the Clueweb22 URLs that were clicked.

研究动机与目标

  • 激发对复杂的、非事实性问题的需求,这些问题揭示现实世界用户信息需求中的未知因素。
  • 描述将搜索日志查询挖掘、筛选和去重为分解型问答数据集的构建流程。
  • 提供分层次的分解和证据信号(点击的 URL)以支撑并评估 LLM 代理的行为。
  • 提供基线评估,展示分解式回答的好处以及对 Researchy Questions 的用户搜索行为洞见。

提出的方法

  • 挖掘现实世界的英文搜索日志,收集至少出现 50 次且有多个点击 URL 的候选问题。
  • 按阶段筛选,以使用分类器和 GPT-4 标注来提取适合分解式回答的非事实性、开放领域问题。
  • 使用基于 ANCE 的嵌入和聚类进行查询意图的聚合去重,生成组的代表性头部。
  • 应用最终基于 GPT-4 的质量检查,筛除模糊、不完整、假设性或不安全的问题,产出 96k 个问题。
  • 为每个问题提供 2 级分层分解,并发布相应的 ClueWeb22 点击 URL 作为证据信号。
Figure 1: Qualitative comparison of how Researchy Questions differs from other Question Answering datasets. Researchy Questions involve a greater deal of complexity and “unknown unknowns” than other QA datasets.
Figure 1: Qualitative comparison of how Researchy Questions differs from other Question Answering datasets. Researchy Questions involve a greater deal of complexity and “unknown unknowns” than other QA datasets.

实验结果

研究问题

  • RQ1What are Researchy Questions, and how do they differ from traditional factoid QA benchmarks?
  • RQ2Can non-factoid, decompositional questions mined from search logs effectively reveal unknown unknowns for LLM web agents?
  • RQ3Do hierarchical decompositions improve retrieval and synthesis for complex, multi-document queries compared to direct answering?
  • RQ4What behavioral signals (e.g., clicks, session length) indicate greater effort and complexity in Researchy Questions?
  • RQ5How do decompositional answering techniques compare to direct answering on long-form, multi-perspective questions?

主要发现

  • Researchy Questions are non-factoid, decompositional, and multi-perspective, requiring substantial research effort beyond a paragraph answer.
  • Users spend more time and clicks on Researchy Questions, indicating higher information-seeking effort and diverse evidence needs.
  • GPT-4-based decomposition approaches (especially factored decomposition) outperform direct closed-book answering on long-form, multi-faceted questions.
  • Decompositional techniques yield accuracy and quality gains on long-form questions, with notable improvements for datasets like Wikihow and Researchy Questions.
  • Approximately 96k questions are released, each with a 2-level hierarchical plan and a vector-based deduplication head, plus the associated ClueWeb22 clicked URLs.
Figure 2: (Right) Histogram of number of documents clicked per question for Researchy Questions which is much higher than for general web search queries. (Left) number of queries associated with each document. The fact that not very many queries are associated with each document validates the effect
Figure 2: (Right) Histogram of number of documents clicked per question for Researchy Questions which is much higher than for general web search queries. (Left) number of queries associated with each document. The fact that not very many queries are associated with each document validates the effect

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。