QUICK REVIEW

[논문 리뷰] Using Large Language Models to Generate, Validate, and Apply User Intent Taxonomies

Chirag Shah, Ryen W. White|arXiv (Cornell University)|2023. 09. 14.

Semantic Web and Ontologies인용 수 11

한 줄 요약

이 논문은 Bing 채팅/검색 데이터에 대해 강한 다중 주석자 일치를 보인 사용자 의도 분류 체계를 생성, 검증 및 적용하기 위해 LLM을 이용한 엔드-투-엔드, 인간-개입 파이프라인을 제안한다.

ABSTRACT

Log data can reveal valuable information about how users interact with Web search services, what they want, and how satisfied they are. However, analyzing user intents in log data is not easy, especially for emerging forms of Web search such as AI-driven chat. To understand user intents from log data, we need a way to label them with meaningful categories that capture their diversity and dynamics. Existing methods rely on manual or machine-learned labeling, which are either expensive or inflexible for large and dynamic datasets. We propose a novel solution using large language models (LLMs), which can generate rich and relevant concepts, descriptions, and examples for user intents. However, using LLMs to generate a user intent taxonomy and apply it for log analysis can be problematic for two main reasons: (1) such a taxonomy is not externally validated; and (2) there may be an undesirable feedback loop. To address this, we propose a new methodology with human experts and assessors to verify the quality of the LLM-generated taxonomy. We also present an end-to-end pipeline that uses an LLM with human-in-the-loop to produce, refine, and apply labels for user intent analysis in log data. We demonstrate its effectiveness by uncovering new insights into user intents from search and chat logs from the Microsoft Bing commercial search engine. The proposed work's novelty stems from the method for generating purpose-driven user intent taxonomies with strong validation. This method not only helps remove methodological and practical bottlenecks from intent-focused research, but also provides a new framework for generating, validating, and applying other kinds of taxonomies in a scalable and adaptable way with reasonable human effort.

연구 동기 및 목표

현대 AI-주도 탐색 및 채팅 로그에서 사용자 의도에 라벨을 붙일 필요성을 제시한다.
LLMs로 사용자 의도 분류 체계를 생성하는 하향식(bottom-up) 방법을 개발한다.
품질을 보장하기 위해 LLM이 생성한 분류 체계를 인간 평가자로 검증한다.
분류 체계를 로그 데이터에 주석으로 적용하고 인간 평가자에 대한 신뢰성을 평가한다.
Microsoft Bing 검색/채팅 로그에서 접근 방식을 시연하고 오픈 소스 LLM으로 일반화 가능성을 평가한다.

제안 방법

GPT-4로 초기 분류 체계 생성(Phase 1).
두 명의 인간 평가자로 분류 체계의 질을 검증하고 반복 개선(Phase 2).
GPT-4와 인간 코더를 사용해 테스트 데이터에 분류 체계를 적용하고 코더 간 신뢰도 평가(Phase 3).
사전 정의된 기준을 사용해 분류 체계의 포괄성, 일관성, 명확성, 정확성 및 간결성을 측정한다.
오픈 소스 LLM(예: Mistral, Hermes)을 포함한 교차-LLM 및 인간의 합의를 검토하여 신뢰성을 검증한다(재현성 포함).
단일- 및 다단계 분류 체계 생성을 탐구하고, 여러 LLM 간 부트스트랩으로 견고성을 평가한다.

실험 결과

연구 질문

RQ1LLMs가 로그 데이터에서 사용자 의도를 분석하기 위한 분류 체계를 신뢰성 있게 생성할 수 있는가?
RQ2LLM이 사용자 의도 분류 체계를 적용해 로그를 주석 달 수 있는가?
RQ3이 작업에서 인간 주석자와 LLM이 비슷하거나 더 잘 수행하는 조건은 무엇인가?
RQ4제안된 인간-개입 방법이 다른 분류 체계 및 데이터 소스에 일반화 가능한가?

주요 결과

GPT-4가 생성한 분류 체계가 인간 주석자와 높은 합의를 달성했다(Phase 3).
두 인간 코더 간의 코더 간 신뢰도(Cohen’s kappa)는 0.7620였다.
GPT-4와 다수 인간 주석 간의 Cohen’s kappa는 0.7212였다.
오픈 소스 LLM(Mistral, Hermes)이 부트스트래핑에서 유사한 분류 체계 생성을 보여주며 모델 간 견고성을 시사한다.
다섯 번의 GPT-4 실행에서 Fleiss’ kappa가 높은 일관성을 보였다(0.8516).
세 개의 오픈 소스 모델에서 LLM과 인간 간 합의는 0.5732에서 0.6772까지 범위였다(쌍별 Cohen’s kappas).

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.