QUICK REVIEW

[論文レビュー] SmartBench: Evaluating LLMs in Smart Homes with Anomalous Device States and Behavioral Contexts

Qingsong Zou, Zhi Yan|arXiv (Cornell University)|Feb 24, 2026

Anomaly Detection Techniques and Applications被引用数 0

ひとこと要約

SmartBench は、スマートホームの異常を検知・説明するための初の LLM 集中ベンチマークを提供。現在のモデルは、文脈非依存および文脈依存の状況での異常検知、局在化、帰属推定に苦戦している。

ABSTRACT

Due to the strong context-awareness capabilities demonstrated by large language models (LLMs), recent research has begun exploring their integration into smart home assistants to help users manage and adjust their living environments. While LLMs have been shown to effectively understand user needs and provide appropriate responses, most existing studies primarily focus on interpreting and executing user behaviors or instructions. However, a critical function of smart home assistants is the ability to detect when the home environment is in an anomalous state. This involves two key requirements: the LLM must accurately determine whether an anomalous condition is present, and provide either a clear explanation or actionable suggestions. To enhance the anomaly detection capabilities of next-generation LLM-based smart home assistants, we introduce SmartBench, which is the first smart home dataset designed for LLMs, containing both normal and anomalous device states as well as normal and anomalous device state transition contexts. We evaluate 13 mainstream LLMs on this benchmark. The experimental results show that most state-of-the-art models cannot achieve good anomaly detection performance. For example, Claude-Sonnet-4.5 achieves only 66.1% detection accuracy on context-independent anomaly categories, and performs even worse on context-dependent anomalies, with an accuracy of only 57.8%. More experimental results suggest that next-generation LLM-based smart home assistants are still far from being able to effectively detect and handle anomalous conditions in the smart home environment. Our dataset is publicly available at https://github.com/horizonsinzqs/SmartBench.

研究の動機と目的

環境状態の異常を検知・説明できる異常認識型スマートホームアシスタントの必要性を動機づける。
LLM 評価のための正常および異常デバイス状態と状態遷移文脈の専用データセット SmartBench を紹介する。
文脈非依存および文脈依存の異常検知タスクに対する主流 LLM の性能を評価する。
安全でより信頼性の高いスマートホームアシスタントの開発を導く指標と分析を提供する。

提案手法

2種類の異常タイプを定義する: 文脈非依存（スナップショット）と文脈依存（状態遷移系列）。
正規データは実データを用い、異常は GPT-5 ベース生成を用いて作成し、長い系列の圧縮戦略を適用するデータセットパイプラインを構築する。
サンプルの現実性と一貫性を保証するための準拠性検証と意味論的チェックを実装する。
固定温度 0 、調整済みトークン制限を用いて 13 件の LLM（オープン/クローズドソース）を評価する。
検知、局在、説明の評価には F1、FPR、異常局在スコア（AL Score）、帰属整合性スコア（AC Score）を用いる。

実験結果

リサーチクエスチョン

RQ1RQ1: LLM はスマートホームの異常状態をどの程度検出できるか？
RQ2RQ2: LLM は異常の根本原因を分析できるか？
RQ3RQ3: モデルサイズは異常検知性能にどう影響するか？
RQ4RQ4: コンテキスト圧縮はモデル性能にどのような影響を与えるか？
RQ5RQ5: 少数ショット学習は異常検知能力を向上させるか？

主な発見

Model	Context-Independent Precision	Context-Independent Recall	Context-Independent F1	Context-Independent FPR	Context-Independent AL Score	Context-Dependent Precision	Context-Dependent Recall	Context-Dependent F1	Context-Dependent FPR	Context-Dependent AL Score
gemini-3	74.2%	85.2%	79.3%	29.7%	0.491	57.4%	79.8%	66.8%	59.2%	0.347
gemini-2.5	64.5%	85.6%	73.5%	47.2%	0.397	53.8%	91.0%	67.6%	78.2%	0.365
claude-4.5	63.9%	74.0%	68.6%	41.8%	0.319	59.6%	59.0%	59.3%	40.0%	0.257
claude-4	73.8%	50.7%	60.1%	18.0%	0.232	67.3%	44.5%	53.6%	21.7%	0.247
deepseek-r1	75.8%	68.5%	72.0%	21.9%	0.365	52.2%	83.7%	64.3%	76.5%	0.261
deepseek-v3	83.4%	37.1%	51.3%	7.4%	0.179	53.9%	51.3%	52.6%	43.8%	0.170
gpt-5	92.6%	68.9%	79.0%	5.5%	0.416	68.8%	48.8%	57.1%	22.2%	0.251
gpt-5-mini	68.5%	76.9%	72.5%	35.3%	0.363	60.9%	68.8%	64.6%	44.2%	0.252
qwen-3-32b	53.1%	83.1%	64.8%	73.3%	0.189	51.0%	80.0%	62.3%	77.0%	0.185
qwen-3-8b	52.4%	41.3%	46.2%	37.5%	0.052	53.3%	61.7%	57.2%	54.0%	0.105

ほとんどのモデルは異常を効果的に検出できず、文脈非依存の F1 は平均約 66.7%、文脈依存の F1 は平均約 60.5%。
異常の局在化は乏しく、平均 AL スコアは CI: 0.300、CD: 0.221。
帰属推定の説明は総じて弱く、トップモデルでも CI 異常で約 74% の帰属が得られるが CD 異常では大幅に低い。
大型モデルは一般に性能が向上し、Qwen および LLaMA ファミリ全体にサイズ効果が見られるが、普遍的ではない。
GPT-5 系列は高い精度を示す一方で一部ケースで非常に低い FPR 制御を示し、異常信号の一貫性が課題。
文脈依存検知は評価対象のモデル全体で文脈非依存検知より依然として難しい。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。