QUICK REVIEW

[论文解读] Using Large Language Models to Generate, Validate, and Apply User Intent Taxonomies

Chirag Shah, Ryen W. White|arXiv (Cornell University)|Sep 14, 2023

Semantic Web and Ontologies被引用 11

一句话总结

该论文提出了一个端到端的人机交互管线，使用LLMs 生成、验证和应用用户意图分类法以进行日志分析，并在 Bing 聊天/搜索数据上进行了验证，且具备强的一致性评注者间一致性。

ABSTRACT

Log data can reveal valuable information about how users interact with Web search services, what they want, and how satisfied they are. However, analyzing user intents in log data is not easy, especially for emerging forms of Web search such as AI-driven chat. To understand user intents from log data, we need a way to label them with meaningful categories that capture their diversity and dynamics. Existing methods rely on manual or machine-learned labeling, which are either expensive or inflexible for large and dynamic datasets. We propose a novel solution using large language models (LLMs), which can generate rich and relevant concepts, descriptions, and examples for user intents. However, using LLMs to generate a user intent taxonomy and apply it for log analysis can be problematic for two main reasons: (1) such a taxonomy is not externally validated; and (2) there may be an undesirable feedback loop. To address this, we propose a new methodology with human experts and assessors to verify the quality of the LLM-generated taxonomy. We also present an end-to-end pipeline that uses an LLM with human-in-the-loop to produce, refine, and apply labels for user intent analysis in log data. We demonstrate its effectiveness by uncovering new insights into user intents from search and chat logs from the Microsoft Bing commercial search engine. The proposed work's novelty stems from the method for generating purpose-driven user intent taxonomies with strong validation. This method not only helps remove methodological and practical bottlenecks from intent-focused research, but also provides a new framework for generating, validating, and applying other kinds of taxonomies in a scalable and adaptable way with reasonable human effort.

研究动机与目标

在现代 AI 驱动的搜索与聊天日志中标注用户意图的必要性。
开发一种基于自下而上的方法，利用 LLMs 生成用户意图分类法。
用人类评估者验证 LLM 生成的分类法以确保质量。
将分类法应用于日志数据标注，并评估相对于人类评估者的可靠性。
在 Microsoft Bing 搜索/聊天日志上展示该方法，并通过开源 LLMs 评估其泛化能力。

提出的方法

使用 GPT-4 生成初始分类法（阶段 1）。
用两名人工评估者对分类法质量进行验证并进行迭代改进（阶段 2）。
使用 GPT-4 和人类编码人员对测试数据应用分类法，评估编码者之间的一致性（阶段 3）。
使用预定义标准衡量分类法的完整性、一致性、清晰度、准确性和简洁性。
检查跨 LLM 与人类的一致性，以验证可靠性（包括在复现实验中使用开源 LLMs）。
探索单级与多级分类法生成，并在多种 LLM 上进行引导抽样以评估鲁棒性。

实验结果

研究问题

RQ1LLMs 是否能可靠地为分析日志数据中的用户意图生成分类法？
RQ2LLM 能否正确应用用户意图分类法来标注日志？
RQ3在何种条件下 LLM 的表现与人类评注者相当或优于人类？
RQ4提出的人机交互方法是否可推广到其他分类法和数据源？

主要发现

GPT-4 生成的分类法在第三阶段与人类评注者达成高度一致。
两名人工评注者之间的编码者一致性（Cohen’s kappa）为 0.7620。
GPT-4 与多数人类注释之间的 Cohen’s kappa 为 0.7212。
开源 LLMs（Mistral、Hermes）在引导抽样中产生了可比的分类法生成，表明在不同模型上具有鲁棒性。
五次 GPT-4 运行的 Fleiss’ kappa 显示高一致性（0.8516）。
在三个开源模型中，LLMs 与人类之间的一致性介于 0.5732 到 0.6772（成对 Cohen’s kappa）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。