QUICK REVIEW

[논문 리뷰] Shape and Substance: Dual-Layer Side-Channel Attacks on Local Vision-Language Models

Eyal Hadad, Mordechai Guri|arXiv (Cornell University)|2026. 03. 26.

Security and Verification in Computing인용 수 0

한 줄 요약

논문은 다중-계층의 입력 의존적인 사이드 채널 누출을 온-디바이스 비전-언어 모델에서 동적 전처리와 함께 보여주고, 시간 및 캐시 신호로 기하학(가로세로 비율) 및 의미 콘텐츠 추론을 가능하게 하며, 완화책 및 설계 권고를 논의한다.

ABSTRACT

On-device Vision-Language Models (VLMs) promise data privacy via local execution. However, we show that the architectural shift toward Dynamic High-Resolution preprocessing (e.g., AnyRes) introduces an inherent algorithmic side-channel. Unlike static models, dynamic preprocessing decomposes images into a variable number of patches based on their aspect ratio, creating workload-dependent inputs. We demonstrate a dual-layer attack framework against local VLMs. In Tier 1, an unprivileged attacker can exploit significant execution-time variations using standard unprivileged OS metrics to reliably fingerprint the input's geometry. In Tier 2, by profiling Last-Level Cache (LLC) contention, the attacker can resolve semantic ambiguity within identical geometries, distinguishing between visually dense (e.g., medical X-rays) and sparse (e.g., text documents) content. By evaluating state-of-the-art models such as LLaVA-NeXT and Qwen2-VL, we show that combining these signals enables reliable inference of privacy-sensitive contexts. Finally, we analyze the security engineering trade-offs of mitigating this vulnerability, reveal substantial performance overhead with constant-work padding, and propose practical design recommendations for secure Edge AI deployments.

연구 동기 및 목표

동적 고해상도 전처리(AnyRes)가 로컬 VLM에서 입력 의존적 작업부하를 만들고 사이드 채널 누출을 가능하게 한다.
타임잇(타이밍)으로 먼저 이미지 기하학(가로세로 비율)을 추론한 다음 LLC 캐시 점유 경쟁을 통해 의미 콘텐츠를 해석하는 이중 계층 공격을 시연한다.
모델(LLaVA-NeXT, Qwen2-VL)과 하드웨어 전반에서 누출을 평가하여 프라이버시 위험을 평가한다.
완화책의 보안 트레이드오프를 분석하고 실용적인 보안 에지 AI 설계 권고안을 제시한다.

제안 방법

로컬 VLM의 모델 아키텍처 분석과 AnyRes 동적 전처리 파이프라인.
이중 계층 공격 프레임워크: 1단계는 특권 없이 거친 타이밍을 사용해 이미지 기하학(가로세로 비율)을 추론한다.
2단계는 LLC 캐시 점유 경쟁 프로파일링을 사용해 이미지 콘텐츠의 의미 밀도를 추정한다.
llama.cpp와 perf 기반 측정을 사용한 Intel 및 AMD 하드웨어에서의 실험 설정.
기하학 벤치마크(1:1 대 1:2)와 의미 벤치마크(밀집 대 희소 콘텐츠)로 데이터셋 설계.
실행 시간과 LLC 미스를 사용한 2차원 특징 모델링을 결합해 콘텐츠를 분류.

실험 결과

연구 질문

RQ1로컬 VLM의 동적 전처리가 알고리즘 사이드 채널로 악용될 수 있는가?
RQ2권한이 없는 동 반자 공격자가 타임신호로 입력 기하학(가로세로 비율)을 어느 정도까지 추론할 수 있는가?
RQ3동일한 기하학에서 마이크로아키텍처 신호(LLC 미스)가 의미 콘텐츠를 드러낼 수 있는가?
RQ4다른 모델과 아키텍처에 걸친 결합 타이밍-캐시 공격의 효율성은 어떠한가?
RQ5오버헤드를 초래하는 완화책은 무엇이며, 안전한 에지 AI 배치를 개선하는 설계 권고안은 무엇인가?]
RQ6key_findings1. Dynamic preprocessing introduces a deterministic timing signal that separates inputs by aspect ratio (geometry).
RQ7Within the same geometry, LLC misses correlate with visual density, enabling semantic inference in a second layer.
RQ8Combined attack achieves 84.0% overall accuracy, with perfect/near-perfect recall for encrypted data and chest X-rays (1.00 and 0.93 recall).
RQ9Cross-model results show timing-based geometry leakage persists across LLaVA v1.6, v1.5, and Qwen2-VL, indicating root cause in dynamic preprocessing rather than weights.
RQ10Cross-architecture results indicate the geometry signal remains across Intel/AMD platforms, while cache-based semantic signal varies with LLC size (AMD shows reduced semantic signal).
RQ11The attack reveals privacy risks for on-device VLMs and highlights substantial performance overheads for certain mitigations like constant-work padding.

주요 결과

동적 전처리는 입력을 가로세로 비율(기하학)에 따라 구분하는 결정론적 타이밍 신호를 도입한다.
동일 기하학 내에서 LLC 미스는 시각 밀도와 상관관계가 있어 2단계에서 의미 추론을 가능하게 한다.
종합 공격은 전체 정확도 84.0%를 달성했으며, 암호화된 데이터와 흉부 X-레이의 재현율은 각각 1.00 및 0.93으로 완전/거의 완전에 가깝다.
모델 간 결과는 타이밍 기반 기하학 누출이 LLaVA v1.6, v1.5, Qwen2-VL 전반에서 지속되며, 원인은 가중치가 아닌 동적 전처리에 있음을 시사한다.
아키텍처 간 결과는 기하 신호가 Intel/AMD 플랫폼에서 유지되는 반면, 캐시 기반 의미 신호는 LLC 크기에 따라 달라지며(AMD에서 의미 신호 감소).
공격은 온-디바이스 VLM의 프라이버시 위험을 드러내며, 상수-작업 패딩과 같은 특정 완화책에 상당한 성능 비용을 강조한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.