QUICK REVIEW

[논문 리뷰] An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems

Anna Martin, William Humphreys|arXiv (Cornell University)|2026. 02. 24.

Topic Modeling인용 수 0

한 줄 요약

본 논문은 학술 QA를 대상으로 7개 범주에 걸친 20개의 LLM 오류 패턴에 대한 전문가 주도 스키마를 개발·검증하고, 도메인 전문가의 구조화된 평가가 자동화된 지표가 놓친 오류를 드러내며 개인화된 스키마 기반 평가 도구를 가능하게 한다는 것을 보여준다.

ABSTRACT

Large Language Models (LLMs) are transforming scholarly tasks like search and summarization, but their reliability remains uncertain. Current evaluation metrics for testing LLM reliability are primarily automated approaches that prioritize efficiency and scalability, but lack contextual nuance and fail to reflect how scientific domain experts assess LLM outputs in practice. We developed and validated a schema for evaluating LLM errors in scholarly question-answering systems that reflects the assessment strategies of practicing scientists. In collaboration with domain experts, we identified 20 error patterns across seven categories through thematic analysis of 68 question-answer pairs. We validated this schema through contextual inquiries with 10 additional scientists, which showed not only which errors experts naturally identify but also how structured evaluation schemas can help them detect previously overlooked issues. Domain experts use systematic assessment strategies, including technical precision testing, value-based evaluation, and meta-evaluation of their own practices. We discuss implications for supporting expert evaluation of LLM outputs, including opportunities for personalized, schema-driven tools that adapt to individual evaluation patterns and expertise levels.

연구 동기 및 목표

도메인 전문가가 학술 QA 작업에서 LLM 출력의 평가 방식은 무엇인지 식별한다.
학술 QA와 관련된 LLM 오류의 전문가 도출 분류체를 개발한다.
추가 도메인 전문가와의 맥락적 문의를 통해 분류체를 검증한다.
구조화된 스키마가 미묘하거나 간과된 오류의 탐지에 어떻게 도움을 주는지 시연한다.
학술 QA에서 개인화된 스키마 기반 평가 도구의 시사점을 논의한다.

제안 방법

도메인 전문가를 대상으로 오류 패턴을 도출하고 검증하기 위한 2단계 질적 연구를 수행한다.
전문가 피드백에 대해 개방 코딩과 축 코딩을 사용하여 20패턴, 7개 카테고리의 스키마를 생성한다.
평가용 학술 QA 출력을 생성하기 위해 소규모의 오픈 소스 Retrieval-Augmented Generation (RAG) 시스템을 구현한다.
하이브리드 전처리와 문장 임베딩으로 문서를 처리하여 의미 검색을 가능하게 한다.
검색을 위해 반복적 쿼리 확장과 KeyBERT 기반 핵심구문 보강을 적용한다.
전문가를 대상으로 맥락적 문의와 사고소리 인터뷰를 통해 스키마를 검증한다.

Figure 1 . Errors identified by domain experts and model developers with entity tags for anonymity. The expert recognized a chronological error about test sequences that the developer missed, showing how domain expertise can yield more precise error analysis.

실험 결과

연구 질문

RQ1도메인 전문가가 학술 QA를 위한 LLM 출력 평가에서 자연스럽게 식별하는 오류 패턴은 무엇인가?
RQ2구조화된 전문가 도출 스키마가 자동화된 지표를 넘어서는 도메인 특유의 오류를 포착할 수 있는가?
RQ3정형화된 스키마가 전문가가 개방형 평가에서 놓칠 수 있는 오류를 탐지하는 데 도움이 되는가?
RQ4학술 맥락에서 LLM 출력 평가 시 전문가의 평가 전략은 어떻게 전개되는가?
RQ5스키마 기반 평가 방식의 설계 및 도구적 함의는 무엇인가?

주요 결과

전문가 주도 분석에서 7개 카테고리로 분류된 20항목 오류 스키마가 도출되었다.
맥락적 문의를 통해 전문가가 정답 여부를 넘어 미묘한 망상과 인용 문제를 포함한 오류를 식별한다는 것이 확인되었다.
구조화된 스키마가 도구 없이 평가할 때보다 이전에 간과된 이슈를 탐지하는 데 도움을 주었다.
전문가들은 기술적 정밀성 테스트와 자신의 평가 방식에 대한 메타평가와 같은 체계적 평가 전략을 활용한다.
188개의 전문가 질문에서 11가지 질문 유형이 식별되어 오류 패턴을 질문 유형에 매핑한다.
전문가 간 차이가 개인화된 스키마 기반 평가 도구의 가능성을 시사한다.

Figure 2 . Distribution of error types across question categories. Each row is normalized to sum to 1, showing the proportion of errors within each question type. Column labels indicate the total occurrences of each error type across all questions ( $n$ ). Question types are sorted by total error fr

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.