QUICK REVIEW

[論文レビュー] The System Hallucination Scale (SHS): A Minimal yet Effective Human-Centered Instrument for Evaluating Hallucination-Related Behavior in Large Language Models

Heimo Müller, Dominik Steiger|arXiv (Cornell University)|Feb 13, 2026

Mental Health via Writing被引用数 0

ひとこと要約

要約: 本論文は System Hallucination Scale (SHS) を提案する。これは 10 项目、5 点リッカート尺度の指標で、LLM 出力の幻覚関連行動を5次元で評価し、信頼性と構成整合性を検証済みで、参照実装を提供する。

ABSTRACT

We introduce the System Hallucination Scale (SHS), a lightweight and human-centered measurement instrument for assessing hallucination-related behavior in large language models (LLMs). Inspired by established psychometric tools such as the System Usability Scale (SUS) and the System Causability Scale (SCS), SHS enables rapid, interpretable, and domain-agnostic evaluation of factual unreliability, incoherence, misleading presentation, and responsiveness to user guidance in model-generated text. SHS is explicitly not an automatic hallucination detector or benchmark metric; instead, it captures how hallucination phenomena manifest from a user perspective under realistic interaction conditions. A real-world evaluation with 210 participants demonstrates high clarity, coherent response behavior, and construct validity, supported by statistical analysis including internal consistency (Cronbach's alpha = 0.87$) and significant inter-dimension correlations (p < 0.001$). Comparative analysis with SUS and SCS reveals complementary measurement properties, supporting SHS as a practical tool for comparative analysis, iterative system development, and deployment monitoring.

研究の動機と目的

ユーザー視点からの幻覚関連行動を評価する軽量でドメイン非依存の指標を提供する。
SHS が解釈可能で、スケーラブル、対話評価ワークフローと互換性があることを保証する。
心理測定学的妥当性（信頼性と構成妥当性）を確立し、実世界での使用可能性を実証する。

提案手法

5つの次元に跨る10項目を正・負の表現を対を成す形で定義する。
回答を5段階リッカート尺度で符号化し、次元スコアを (positive − negative)/4 として計算する。
5つの次元スコアの平均として総合 SHS スコアを算出する。
補足資料に標準的なスコアリング式と参照 Python 実装を提供する。
比較可能性を高めるために SHS 0–100 の再スケーリングを任意で提供する。

実験結果

リサーチクエスチョン

RQ1短く人間中心の指標が、LLM 出力における幻覚関連行動の5つの異なる次元を信頼性高く捉えられるか。
RQ2対を成す項目構造（正/負）により高い内部一貫性と有用な診断信号が得られるか。
RQ3SHS は現実的な対話設定で実施可能で、専門家・非専門家の評価者の双方に解釈可能か。
RQ4SHS は既存の使いやすさ/使い勝手尺度（SUS、SCS）と、測定特性および補完性の点でどう関連するか。

主な発見

SHS は高い内部一貫性を示した（Cronbach’s alpha = 0.87, 95% 信頼区間 [0.84, 0.90]）。
次元間の相関は中程度から強く（r = 0.42–0.72）、統計的に有意（p < 0.001）、多次元構造を支持。
各次元内の項目間相関は polarity 反転後、すべての5次元で強く（r = 0.65–0.79, p < 0.001）、二極性項目設計を検証。
カイ二乗検定はリッカート尺度の非一様で意味のある利用を示した（χ2(4)=187.3, p<0.001）。
平均所要時間は 4.2 分（SD = 1.8）、参加者は本指標を明確で関連性があり、妨げにならないと評価。
SHS は自動化指標だけでは得られない、幻覚関連の異なる故障モードに関する診断的洞察を提供する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。