QUICK REVIEW

[논문 리뷰] Cracking IoT Security: Can LLMs Outsmart Static Analysis Tools?

Jason Quantrill, N. Khajehnouri|arXiv (Cornell University)|2026. 01. 02.

Adversarial Robustness in Machine Learning인용 수 0

한 줄 요약

본 논문은 openHAB TAC 규칙에서 Rule Interaction Threats(RITs)를 탐지하기 위해 대형 언어 모델(LLM)을 평가하고, 이를 기호적 정적 분석과 비교하며, 재현율을 유지하면서 정밀도를 높이기 위한 하이브리드 워크플로우를 제안한다.

ABSTRACT

Smart home IoT platforms such as openHAB rely on Trigger Action Condition (TAC) rules to automate device behavior, but the interplay among these rules can give rise to interaction threats, unintended or unsafe behaviors emerging from implicit dependencies, conflicting triggers, or overlapping conditions. Identifying these threats requires semantic understanding and structural reasoning that traditionally depend on symbolic, constraint-driven static analysis. This work presents the first comprehensive evaluation of Large Language Models (LLMs) across a multi-category interaction threat taxonomy, assessing their performance on both the original openHAB (oHC/IoTB) dataset and a structurally challenging Mutation dataset designed to test robustness under rule transformations. We benchmark Llama 3.1 8B, Llama 70B, GPT-4o, Gemini-2.5-Pro, and DeepSeek-R1 across zero-, one-, and two-shot settings, comparing their results against oHIT's manually validated ground truth. Our findings show that while LLMs exhibit promising semantic understanding, particularly on action- and condition-related threats, their accuracy degrades significantly for threats requiring cross-rule structural reasoning, especially under mutated rule forms. Model performance varies widely across threat categories and prompt settings, with no model providing consistent reliability. In contrast, the symbolic reasoning baseline maintains stable detection across both datasets, unaffected by rule rewrites or structural perturbations. These results underscore that LLMs alone are not yet dependable for safety critical interaction-threat detection in IoT environments. We discuss the implications for tool design and highlight the potential of hybrid architectures that combine symbolic analysis with LLM-based semantic interpretation to reduce false positives while maintaining structural rigor.

연구 동기 및 목표

실제 openHAB 데이터셋에서 LLM의 기본 능력이 RIT를 검증하고 분류하는지 평가한다.
모델 크기와 프롬프트가 맥락적 추론 및 신뢰성에 어떤 영향을 미치는지 판단한다.
취약 사례를 변형한 데이터셋에서 확장성 및 일반화 가능성을 테스트한다.
기호 분석과 LLM 검증을 결합한 조정 기반 하이브리드 워크플로우를 평가하여 거짓 양성을 감소시킨다.

제안 방법

제로샷, 원샷, 이샷 프롬프트에서 Llama 3.1 8B/70B, GPT-4o, Gemini-2.5-Pro, DeepSeek-R1 등을 포함한 다수의 LLM을 평가한다.
RIT 후보를 생성하기 위해 oHIT를 기호적 정적 분석 기준선으로 사용한다.
LLM의 맥락적 검사를 통해 위협을 필터링하고 분류하며 검증하는 하이브리드 조정 및 검증 파이프라인을 도입한다.
두 데이터셋(openHAB Community and IoTBench)과 조작된 상호작용을 포함한 Mutation 데이터셋을 사용해 강건성을 스트레스 테스트한다.
프롬프트 기반 유도에 적용하여 RIT를 (WAC, SAC, WTC, STC, WCC, SCC) 범주로 분류하고 마이크로 정확도 및 클래스별 재현율로 평가한다.
정밀도-재현율 트레이드오프를 평가하기 위해 다중 응답 대 단일 응답 조건에서의 실험을 분석한다.

실험 결과

연구 질문

RQ1RQ1 Baseline 역량: 사전 학습된 LLM이 실제 openHAB 데이터에서 RIT를 검증하고 분류하는 데 얼마나 효과적일 수 있는가?
RQ2RQ2 모델 확장 효과: LLM의 크기가 RIT에 대한 맥락적 검증 정확도와 추론 일관성에 어떤 영향을 미치는가?
RQ3RQ3 확장성 및 일반화 가능성: 실제 취약점을 가진 변형 기반 데이터세트에서도 접근 방식이 성능을 유지하는가?
RQ4RQ4 하이브리드 효과성: 기호 전용 및 LLM 전용 접근 방식과 비교하여 하이브리드 워크플로우가 정밀도를 향상시키고 거짓 양성을 줄이는가?

주요 결과

LLMs는 실행 및 조건 관련 위협에 대해 의미론적 이해가 유망하지만 규칙 간 구조적 추론에 어려움을 보인다.
복잡하고 다중 규칙 추론 및 변형된 규칙 형태를 요구하는 위협의 경우 정확도가 떨어진다.
기호적 추론 기준선은 데이터셋 전반에 걸쳐 안정적인 탐지를 제공하며 규칙 재작성에 영향을 받지 않는다.
조정 기반의 하이브리드 워크플로우가 정밀도를 크게 향상시키며(예: 도전적 사례에서) 기호 분석으로부터 높은 재현율을 유지한다.
위협 범주와 프롬프트 설정에 따라 성능이 크게 다르게 나타나며, 어느 모델도 자체적으로 일관된 신뢰성을 제공하지 못한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.