QUICK REVIEW

[论文解读] IoT Device Labeling Using Large Language Models

Bar Meyuhas, Anat Bremler-Barr|arXiv (Cornell University)|Mar 3, 2024

Big Data and Digital Economy被引用 5

一句话总结

本文提出一种被动的物联网标签方法，使用来自网络流量的丰富文本特征和由 LLM 驱动的零样本分类来识别未见的 IoT 供应商和功能，并具备目录的自动更新机制。

ABSTRACT

The IoT market is diverse and characterized by a multitude of vendors that support different device functions (e.g., speaker, camera, vacuum cleaner, etc.). Within this market, IoT security and observability systems use real-time identification techniques to manage these devices effectively. Most existing IoT identification solutions employ machine learning techniques that assume the IoT device, labeled by both its vendor and function, was observed during their training phase. We tackle a key challenge in IoT labeling: how can an AI solution label an IoT device that has never been seen before and whose label is unknown? Our solution extracts textual features such as domain names and hostnames from network traffic, and then enriches these features using Google search data alongside catalog of vendors and device functions. The solution also integrates an auto-update mechanism that uses Large Language Models (LLMs) to update these catalogs with emerging device types. Based on the information gathered, the device's vendor is identified through string matching with the enriched features. The function is then deduced by LLMs and zero-shot classification from a predefined catalog of IoT functions. In an evaluation of our solution on 97 unique IoT devices, our function labeling approach achieved HIT1 and HIT2 scores of 0.7 and 0.77, respectively. As far as we know, this is the first research to tackle AI-automated IoT labeling.

研究动机与目标

应对在实时安全性与可观测性环境中对未见 IoT 设备进行标注的挑战。
利用通过搜索结果丰富的来自流量的文本特征（域名、主机名、TLS 颁发者、OUI、用户代理）。
通过对丰富特征进行字符串匹配来识别供应商，并使用零样本的 LLM 分类推断功能。
在无需重新训练模型的情况下，为新设备类型启用目录更新。
为标注决策提供解释以支持人工验证。

提出的方法

从物联网设备网络流量中提取文本特征（域名、主机名、TLS 颁发者、OUI、用户代理）。
通过 SerpAPI 查询 Google 搜索结果来获取每个特征值的前 k 条描述，从而丰富特征。
通过对丰富特征与供应商目录进行字符串匹配来识别供应商。
使用零样本分类的 LLM（Roberta），结合面向供应商的或完整功能目录来识别功能。
通过对各特征的置信分数进行加权汇总（按特征类型权重）来选择最终标签并提供理由。
使用离线、被动的标注流程，随着通过目录更新出现的新设备类型可以进行更新。

Figure 1 : Example of Features for the SmartThing Hub: First, we present the features derived from the traffic, followed by a sample of the enriched features (the color correlates between the feature and the enriched feature). Words relevant to the vendor label decision are highlighted in bold, and

实验结果

研究问题

RQ1使用丰富的文本特征在供应商和功能方面对未见 IoT 设备进行标注的有效性如何？
RQ2不同特征类型及其丰富程度对标注准确性的影响是什么？
RQ3零样本 LLM 分类是否可以稳健地将丰富的特征映射到 IoT 功能？
RQ4目录更新在维持新设备类型标注准确性方面有多有效？

主要发现

通过丰富特征和字符串匹配进行供应商标注，HIT1 = 0.86，HIT2 = 0.89。
基于丰富特征使用 Roberta 的功能标注达到 HIT1 = 0.70 和 HIT2 = 0.77。
基于 OUI 的供应商识别仍然不太准确（0.64 HIT1）；更丰富的文本丰富提高了标注。
在来自 55 个供应商、21 种功能的 97 台独特设备上，该方法对未见设备表现出强劲的性能。
零样本分类使在不重新训练模型的情况下更新功能目录成为可能。
丰富特征（Domains, Hostname, TLS, User-Agents, OUI）对准确性有不同的贡献，其中 Domains+Hostname+TLS+User-Agents+OUI 为供应商标注提供了最佳结果。

Figure 2 : A schematic illustration of our IoT labeling solution. First, features are being extracted and then enriched. Second, we perform our vendor and function models labeling. The system’s output is label, confidence and justification for each device.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。