QUICK REVIEW

[論文レビュー] An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems

Anna Martin, William Humphreys|arXiv (Cornell University)|Feb 24, 2026

Topic Modeling被引用数 0

ひとこと要約

論文は、学術的QAのための7カテゴリにわたる20のLLMエラーパターンの専門家主導スキーマを開発・検証し、ドメインエキスパートの構造化評価が自動化指標が見落とすエラーを明らかにし、個別化されたスキーマ駆動の評価ツールを可能にすることを示す。

ABSTRACT

Large Language Models (LLMs) are transforming scholarly tasks like search and summarization, but their reliability remains uncertain. Current evaluation metrics for testing LLM reliability are primarily automated approaches that prioritize efficiency and scalability, but lack contextual nuance and fail to reflect how scientific domain experts assess LLM outputs in practice. We developed and validated a schema for evaluating LLM errors in scholarly question-answering systems that reflects the assessment strategies of practicing scientists. In collaboration with domain experts, we identified 20 error patterns across seven categories through thematic analysis of 68 question-answer pairs. We validated this schema through contextual inquiries with 10 additional scientists, which showed not only which errors experts naturally identify but also how structured evaluation schemas can help them detect previously overlooked issues. Domain experts use systematic assessment strategies, including technical precision testing, value-based evaluation, and meta-evaluation of their own practices. We discuss implications for supporting expert evaluation of LLM outputs, including opportunities for personalized, schema-driven tools that adapt to individual evaluation patterns and expertise levels.

研究の動機と目的

Identify how domain experts evaluate LLM outputs in scholarly QA tasks.
Develop an expert-derived taxonomy of LLM errors relevant to scholarly QA.
Validate the taxonomy through contextual inquiries with additional domain experts.
Demonstrate how structured schemas aid detection of subtle or overlooked errors.
Discuss implications for personalized, schema-driven evaluation tools in scholarly QA.

提案手法

Conduct two-phase qualitative study with domain experts to derive and validate error patterns.
Use open and axial coding on expert feedback to generate a 20-pattern, seven-category schema.
Implement a small, open-source retrieval-augmented generation (RAG) system to produce scholarly QA outputs for evaluation.
Process documents with hybrid preprocessing and sentence embeddings, enabling semantic retrieval.
Employ iterative query expansion and KeyBERT-based keyphrase augmentation for retrieval.
Validate schema via contextual inquiries and think-aloud interviews with experts.

Figure 1 . Errors identified by domain experts and model developers with entity tags for anonymity. The expert recognized a chronological error about test sequences that the developer missed, showing how domain expertise can yield more precise error analysis.

実験結果

リサーチクエスチョン

RQ1What error patterns do domain experts naturally identify when evaluating LLM outputs for scholarly QA?
RQ2Can a structured expert-derived schema capture domain-specific errors beyond automated metrics?
RQ3Does a formalized schema help experts detect errors they might overlook in open-ended evaluation?
RQ4How do experts’ evaluation strategies unfold when assessing LLM outputs in scholarly contexts?
RQ5What are the potential design and tooling implications of a schema-driven evaluation approach?

主な発見

A 20-item error schema across seven categories emerged from expert-driven analysis.
Contextual inquiries showed experts identify errors beyond correctness, including subtle hallucinations and citation issues.
Structured schema helped experts detect previously overlooked issues compared to unaided evaluation.
Experts employ systematic evaluation strategies like technical precision testing and meta-evaluation of their practices.
Across 188 expert questions, 11 question types were identified, mapping error patterns to question types.
Variation across experts suggests potential for personalized, schema-driven evaluation tools.

Figure 2 . Distribution of error types across question categories. Each row is normalized to sum to 1, showing the proportion of errors within each question type. Column labels indicate the total occurrences of each error type across all questions ( $n$ ). Question types are sorted by total error fr

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。