QUICK REVIEW

[论文解读] Shape and Substance: Dual-Layer Side-Channel Attacks on Local Vision-Language Models

Eyal Hadad, Mordechai Guri|arXiv (Cornell University)|Mar 26, 2026

Security and Verification in Computing被引用 0

一句话总结

论文揭示了在具备动态预处理的本地 Vision-Language 模型（VLM）中存在双层、输入相关的侧信道泄露，可通过时序与缓存信号推断几何（纵横比）与语义内容，并讨论缓解措施与设计建议。

ABSTRACT

On-device Vision-Language Models (VLMs) promise data privacy via local execution. However, we show that the architectural shift toward Dynamic High-Resolution preprocessing (e.g., AnyRes) introduces an inherent algorithmic side-channel. Unlike static models, dynamic preprocessing decomposes images into a variable number of patches based on their aspect ratio, creating workload-dependent inputs. We demonstrate a dual-layer attack framework against local VLMs. In Tier 1, an unprivileged attacker can exploit significant execution-time variations using standard unprivileged OS metrics to reliably fingerprint the input's geometry. In Tier 2, by profiling Last-Level Cache (LLC) contention, the attacker can resolve semantic ambiguity within identical geometries, distinguishing between visually dense (e.g., medical X-rays) and sparse (e.g., text documents) content. By evaluating state-of-the-art models such as LLaVA-NeXT and Qwen2-VL, we show that combining these signals enables reliable inference of privacy-sensitive contexts. Finally, we analyze the security engineering trade-offs of mitigating this vulnerability, reveal substantial performance overhead with constant-work padding, and propose practical design recommendations for secure Edge AI deployments.

研究动机与目标

展示动态高分辨率预处理（AnyRes）在本地 VLM 中创建输入相关工作负载，并引发侧信道泄漏。
演示一个两层攻击：第一层通过时序推断图像几何（纵横比），第二层通过 LLC 缓存竞争解决语义内容。
在不同模型（LLaVA-NeXT、Qwen2-VL）和硬件上评估泄露，以评估隐私风险。
分析缓解措施的安全权衡并提出实际可行的安全边缘 AI 设计建议。

提出的方法

对本地 VLM 的模型结构分析以及 AnyRes 动态预处理流水线的分析。
两层攻击框架：第一层在无特权情况下利用粗略时序推断图像几何（纵横比）。
第二层使用 LLC 缓存争用分析来推断图像内容的语义密度。
在搭载 llama.cpp 与 perf 基准测量的 Intel 与 AMD 硬件上的实验设置。
数据集设计包含几何基准（1:1 与 1:2）以及语义基准（密集与稀疏内容）。
结合二维特征建模，利用执行时间和 LLC 未命中率来对内容进行分类。

实验结果

研究问题

RQ1本地 VLM 的动态预处理是否可被利用为算法级侧信道？
RQ2无特权的同地攻击者在时序信号下在多大程度上能推断输入几何？
RQ3在同一几何下，微架构信号（LLC 未命中）能否揭示语义内容？
RQ4跨模型、跨体系结构的时序与缓存联合攻击有多有效？
RQ5哪些缓解措施会带来开销，哪些设计建议能提升安全的边缘 AI 部署？

主要发现

动态预处理引入一个确定性的时序信号，使输入在纵横比（几何）上可分离。
在相同几何下，LLC 未命中与视觉密度相关，第二层可进行语义推断。
联合攻击在总体上达到 84.0% 的准确率，对加密数据与胸部 X 光图像具有完美/接近完美的召回率（分别为 1.00 与 0.93）。
跨模型结果表明时序引导的几何泄露在 LLaVA v1.6、v1.5 和 Qwen2-VL 上依然存在，根本原因在于动态预处理而非权重。
跨架构结果表明几何信号在 Intel/AMD 平台上仍然存在，而基于缓存的语义信号随 LLC 大小变化（AMD 显示出较弱的语义信号）。
攻击揭示了本地 VLM 的隐私风险，并强调对某些缓解措施（如常量工作填充）所带来的显著性能开销。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。