QUICK REVIEW

[论文解读] Cracking IoT Security: Can LLMs Outsmart Static Analysis Tools?

Jason Quantrill, N. Khajehnouri|arXiv (Cornell University)|Jan 2, 2026

Adversarial Robustness in Machine Learning被引用 0

一句话总结

简要直接回答摘要：本论文评估大型语言模型（LLMs）在 openHAB TAC 规则中检测规则交互威胁（RITs）的能力，比较其与符号静态分析的效果，并提出一种混合工作流以在保持召回率的同时提升精确度。

ABSTRACT

Smart home IoT platforms such as openHAB rely on Trigger Action Condition (TAC) rules to automate device behavior, but the interplay among these rules can give rise to interaction threats, unintended or unsafe behaviors emerging from implicit dependencies, conflicting triggers, or overlapping conditions. Identifying these threats requires semantic understanding and structural reasoning that traditionally depend on symbolic, constraint-driven static analysis. This work presents the first comprehensive evaluation of Large Language Models (LLMs) across a multi-category interaction threat taxonomy, assessing their performance on both the original openHAB (oHC/IoTB) dataset and a structurally challenging Mutation dataset designed to test robustness under rule transformations. We benchmark Llama 3.1 8B, Llama 70B, GPT-4o, Gemini-2.5-Pro, and DeepSeek-R1 across zero-, one-, and two-shot settings, comparing their results against oHIT's manually validated ground truth. Our findings show that while LLMs exhibit promising semantic understanding, particularly on action- and condition-related threats, their accuracy degrades significantly for threats requiring cross-rule structural reasoning, especially under mutated rule forms. Model performance varies widely across threat categories and prompt settings, with no model providing consistent reliability. In contrast, the symbolic reasoning baseline maintains stable detection across both datasets, unaffected by rule rewrites or structural perturbations. These results underscore that LLMs alone are not yet dependable for safety critical interaction-threat detection in IoT environments. We discuss the implications for tool design and highlight the potential of hybrid architectures that combine symbolic analysis with LLM-based semantic interpretation to reduce false positives while maintaining structural rigor.

研究动机与目标

评估 LLMs 在真实 openHAB 数据集中验证和分类 RITs 的基线能力。
确定模型规模与提示对上下文推理与可靠性的影响。
在带有工程化交互的变异数据集上测试可扩展性和泛化能力。
评估结合符号分析与 LLm 验证的和解型混合工作流，以减少误报。

提出的方法

在零-shot、one-shot、two-shot 提示下评估多种 LLMs（包括 Llama 3.1 8B/70B、GPT-4o、Gemini-2.5-Pro、DeepSeek-R1）。
以 oHIT 作为符号静态分析基线生成 RIT 候选项。
引入一个混合的 Reconciliation & Validation 管线，通过 LLM 的上下文检查对威胁进行筛选、分类和验证。
采用两个数据集（openHAB Community 和 IoTBench）以及一个带有工程化交互的 Mutation 数据集用于压力测试鲁棒性。
应用基于提示的提取将 RIT 分类为类别（WAC、SAC、WTC、STC、WCC、SCC），并通过微观准确度与逐类别召回率进行评估。
在多次回复与单次回复条件下分析实验，以评估精确召回的权衡。

实验结果

研究问题

RQ1RQ1 基线能力：在真实的 openHAB 数据中，预训练的 LLMs 在验证和分类 RIT 方面有多有效？
RQ2RQ2 模型尺度效应：LLM 的规模如何影响对 RIT 的上下文验证准确性与推理一致性？
RQ3RQ3 可扩展性与泛化性：在具有真实漏洞的变异数据集上方法是否仍保持性能？
RQ4RQ4 混合效果：与符号方法和仅 LLM 方法相比，混合工作流是否提升精确度并减少误报？

主要发现

LLMs 展现出对行动相关和条件相关威胁的语义理解的潜力，但在跨规则的结构化推理方面存在困难。
需要复杂的多规则推理和变体规则形式时，准确性会下降。
符号推理基线在不同数据集上保持稳定检测，且不受规则改写影响。
基于和解的混合工作流显著提升精确度（如针对有挑战性的案例），同时保持符号分析的高召回率。
在威胁类别和提示设置之间的性能差异较大，且没有任何单一模型能独立提供一致的可靠性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。