QUICK REVIEW

[論文レビュー] Using Large Language Models to Generate, Validate, and Apply User Intent Taxonomies

Chirag Shah, Ryen W. White|arXiv (Cornell University)|Sep 14, 2023

Semantic Web and Ontologies被引用数 11

ひとこと要約

この論文は、LLMsを用いてユーザー意図の分類を生成・検証・適用するエンドツーエンドの人間-in-the-loopパイプラインを提案し、Bingのチャット/検索データで実証され、評価者間一致が強いことを示しています。

ABSTRACT

Log data can reveal valuable information about how users interact with Web search services, what they want, and how satisfied they are. However, analyzing user intents in log data is not easy, especially for emerging forms of Web search such as AI-driven chat. To understand user intents from log data, we need a way to label them with meaningful categories that capture their diversity and dynamics. Existing methods rely on manual or machine-learned labeling, which are either expensive or inflexible for large and dynamic datasets. We propose a novel solution using large language models (LLMs), which can generate rich and relevant concepts, descriptions, and examples for user intents. However, using LLMs to generate a user intent taxonomy and apply it for log analysis can be problematic for two main reasons: (1) such a taxonomy is not externally validated; and (2) there may be an undesirable feedback loop. To address this, we propose a new methodology with human experts and assessors to verify the quality of the LLM-generated taxonomy. We also present an end-to-end pipeline that uses an LLM with human-in-the-loop to produce, refine, and apply labels for user intent analysis in log data. We demonstrate its effectiveness by uncovering new insights into user intents from search and chat logs from the Microsoft Bing commercial search engine. The proposed work's novelty stems from the method for generating purpose-driven user intent taxonomies with strong validation. This method not only helps remove methodological and practical bottlenecks from intent-focused research, but also provides a new framework for generating, validating, and applying other kinds of taxonomies in a scalable and adaptable way with reasonable human effort.

研究の動機と目的

現代のAI主導の検索・チャットログにおいてユーザー意図にラベルを付ける必要性を動機づける。
LLMsを用いてユーザー意図タキソノミーを生成するボトムアップ法を開発する。
人間の評価者によってLLM生成タキソノミーの品質を検証する。
タキソノミーをログデータに適用して、人間の評価者に対する信頼性を評価する。
このアプローチをMicrosoft Bingの検索/チャットログで実証し、オープンソースLLMで一般化可能性を評価する。

提案手法

GPT-4で初期タキソノミーを生成する（Phase 1）。
2名の人間評価者によるタキソノミー品質の検証と反復的改良（Phase 2）。
GPT-4と人間コーダーを用いてテストデータにタキソノミーを適用し、インターコーダ信頼性を評価する（Phase 3）。
既 predefined criteria を用いてタキソノミーの網羅性、整合性、明確性、正確性、簡潔さを測定する。
複数のLLM間および人間の合意を検証するために、オープンソースLLMを含む再現性を検討する。
単一および多層タキソノミー生成を探究し、複数のLLM間でブートストラップして堅牢性を評価する。

実験結果

リサーチクエスチョン

RQ1LLMsはログデータの分析のためのタキソノミーを信頼性をもって生成できるか？
RQ2LLMはユーザー意図タキソノミーを適用してログに注釈を付けられるか？
RQ3このタスクにおいてLLMsは人間の評価者と同等またはそれ以上に性能を発揮する条件は？
RQ4提案された人間-in-the-loop手法は他のタキソノミーやデータソースに一般化できるか？

主な発見

GPT-4–生成タキソノミーは人間の評価者と高い一致を達成した（Phase 3）。
二人の人間コーダー間のインターコーダ信頼性（Cohen’s κ）= 0.7620。
GPT-4と多数派の人間注釈とのCohen’s κは0.7212。
ブートストラップでオープンソースLLM（Mistral、Hermes）が同等のタキソノミー生成を示し、モデル間で堅牢性があることを示唆。
Fleiss’ κは5回のGPT-4実行で高い一貫性を示し（0.8516）。
オープンソースの3モデルでのLLMと人間の合意は0.5732から0.6772の範囲（ペアごとのCohen’s kappas）。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。