QUICK REVIEW

[论文解读] SmartBench: Evaluating LLMs in Smart Homes with Anomalous Device States and Behavioral Contexts

Qingsong Zou, Zhi Yan|arXiv (Cornell University)|Feb 24, 2026

Anomaly Detection Techniques and Applications被引用 0

一句话总结

简要：SmartBench 提供了首个面向大语言模型的智能家居异常检测与解释基准，显示当前模型在跨情境和情境依赖场景中的异常检测、定位与归因方面仍存在困难。

ABSTRACT

Due to the strong context-awareness capabilities demonstrated by large language models (LLMs), recent research has begun exploring their integration into smart home assistants to help users manage and adjust their living environments. While LLMs have been shown to effectively understand user needs and provide appropriate responses, most existing studies primarily focus on interpreting and executing user behaviors or instructions. However, a critical function of smart home assistants is the ability to detect when the home environment is in an anomalous state. This involves two key requirements: the LLM must accurately determine whether an anomalous condition is present, and provide either a clear explanation or actionable suggestions. To enhance the anomaly detection capabilities of next-generation LLM-based smart home assistants, we introduce SmartBench, which is the first smart home dataset designed for LLMs, containing both normal and anomalous device states as well as normal and anomalous device state transition contexts. We evaluate 13 mainstream LLMs on this benchmark. The experimental results show that most state-of-the-art models cannot achieve good anomaly detection performance. For example, Claude-Sonnet-4.5 achieves only 66.1% detection accuracy on context-independent anomaly categories, and performs even worse on context-dependent anomalies, with an accuracy of only 57.8%. More experimental results suggest that next-generation LLM-based smart home assistants are still far from being able to effectively detect and handle anomalous conditions in the smart home environment. Our dataset is publicly available at https://github.com/horizonsinzqs/SmartBench.

研究动机与目标

需要有异常感知能力的智能家居助手，能够检测并解释异常环境状态。
引入 SmartBench，一组用于对LLM进行评估的正常与异常设备状态及状态转移情境的专门数据集。
评估主流LLM在情境独立与情境相关的异常检测任务中的表现。
提供度量与分析，帮助开发更安全、可靠的智能家居助手。

提出的方法

定义两种异常类型：情境独立（快照）和情境相关（状态转移序列）。
使用真实智能家居数据构建正常样本的数据集管道，并基于GPT-5生成异常样本，针对长序列采用压缩策略。
实施合规性验证与语义检查，确保样本的真实感与一致性。
以固定温度0、定制的Token限制，评估13种LLM（开源与闭源）。
使用F1、FPR、异常定位分数（AL Score）以及归因一致性分数（AC Score）来评估检测、定位与解释能力。

实验结果

研究问题

RQ1RQ1：LLMs在智能家居中检测异常状态的能力如何？
RQ2RQ2：LLMs能否分析异常的潜在原因？
RQ3RQ3：模型规模如何影响异常检测性能？
RQ4RQ4：情境压缩对模型性能有何影响？
RQ5RQ5：少量示例学习是否有助于提升异常检测能力？

主要发现

Model	Context-Independent Precision	Context-Independent Recall	Context-Independent F1	Context-Independent FPR	Context-Independent AL Score	Context-Dependent Precision	Context-Dependent Recall	Context-Dependent F1	Context-Dependent FPR	Context-Dependent AL Score
gemini-3	74.2%	85.2%	79.3%	29.7%	0.491	57.4%	79.8%	66.8%	59.2%	0.347
gemini-2.5	64.5%	85.6%	73.5%	47.2%	0.397	53.8%	91.0%	67.6%	78.2%	0.365
claude-4.5	63.9%	74.0%	68.6%	41.8%	0.319	59.6%	59.0%	59.3%	40.0%	0.257
claude-4	73.8%	50.7%	60.1%	18.0%	0.232	67.3%	44.5%	53.6%	21.7%	0.247
deepseek-r1	75.8%	68.5%	72.0%	21.9%	0.365	52.2%	83.7%	64.3%	76.5%	0.261
deepseek-v3	83.4%	37.1%	51.3%	7.4%	0.179	53.9%	51.3%	52.6%	43.8%	0.170
gpt-5	92.6%	68.9%	79.0%	5.5%	0.416	68.8%	48.8%	57.1%	22.2%	0.251
gpt-5-mini	68.5%	76.9%	72.5%	35.3%	0.363	60.9%	68.8%	64.6%	44.2%	0.252
qwen-3-32b	53.1%	83.1%	64.8%	73.3%	0.189	51.0%	80.0%	62.3%	77.0%	0.185
qwen-3-8b	52.4%	41.3%	46.2%	37.5%	0.052	53.3%	61.7%	57.2%	54.0%	0.105

大多数模型在有效检测异常方面表现欠佳；情境独立的F1平均约为66.7%，情境相关的F1平均约为60.5%。
异常定位较差；平均AL Score为0.300（CI）和0.221（CD）。
归因解释通常较弱；即使是顶级模型，在CI异常上的归因约为74%，在CD异常上远低于此水平。
较大模型通常表现更好，在Qwen和LLaMA家族中存在与规模相关的提升，尽管并非普遍适用。
GPT-5 系列模型在某些情况下呈现较强的精确性但非常低的FPR控制，表明异常信号不一致。
情境相关检测在评估模型中仍然比情境独立检测更困难。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。