QUICK REVIEW

[논문 리뷰] Entangled Watermarks as a Defense against Model Extraction

Hengrui Jia, Christopher A. Choquette-Choo|arXiv (Cornell University)|2020. 02. 27.

Adversarial Robustness in Machine Learning참고 문헌 54인용 수 46

한 줄 요약

Entangled Watermark Embedding (EWE)를 도입하여 워터마크 신호를 작업 표현과 얽히게 하고 소프트 최근접 이웃 손실(SNNL)을 사용하여 모델 추출에 대한 강건한 방어를 가능하게 한다.

ABSTRACT

Machine learning involves expensive data collection and training procedures. Model owners may be concerned that valuable intellectual property can be leaked if adversaries mount model extraction attacks. As it is difficult to defend against model extraction without sacrificing significant prediction accuracy, watermarking instead leverages unused model capacity to have the model overfit to outlier input-output pairs. Such pairs are watermarks, which are not sampled from the task distribution and are only known to the defender. The defender then demonstrates knowledge of the input-output pairs to claim ownership of the model at inference. The effectiveness of watermarks remains limited because they are distinct from the task distribution and can thus be easily removed through compression or other forms of knowledge transfer. We introduce Entangled Watermarking Embeddings (EWE). Our approach encourages the model to learn features for classifying data that is sampled from the task distribution and data that encodes watermarks. An adversary attempting to remove watermarks that are entangled with legitimate data is also forced to sacrifice performance on legitimate data. Experiments on MNIST, Fashion-MNIST, CIFAR-10, and Speech Commands validate that the defender can claim model ownership with 95\% confidence with less than 100 queries to the stolen copy, at a modest cost below 0.81 percentage points on average in the defended model's performance.

연구 동기 및 목표

워터마크가 작업으로부터 분리되어 학습되는 기존 워터마킹의 한계를 식별한다.
워터마크를 작업 표현과 얽히게 만들기 위한 Entangled Watermark Embedding (EWE)을 제안한다.
데이터셋 전반에 걸친 모델 유용성과 워터마크 강건성 간의 트레이드오프를 정량화한다.
비전 및 오디오 작업 전반에서 추출 및 백도어에 대한 Wassermark 강건성을 입증한다.

제안 방법

soft nearest neighbor loss (SNNL)을 사용하여 작업 데이터와 워터마크 데이터 간의 얽힘을 측정하고 강제한다.
워터마크 분포와 트리거를 선택하여 워터마크가 있는 입력을 생성한 다음, 의미적 분리를 최적화하고 얽힘을 강화하기 위해 입력을 perturb한다.
손실 조합으로 학습: L = L_CE - kappa * sum_l SNNL([X_w^(l), X_cT^(l)], Y', T^(l)).
학습 중 표준 작업 데이터 배치를 워터마크 데이터 배치와 교대로 섞는다.
학습 중 entanglement 강도 제어를 위해 온도 스케줄 T^(l)을 조정한다.
가설 검정을 통한 소유권 검증을 평가하고 추출 및 재학습 하에서의 워터마크 강건성을 정량화한다.

실험 결과

연구 질문

RQ1워터마크가 태스크 매니폴드와 얽혀 있을 때 모델 추출에서 얼마나 생존할 수 있는가?
RQ2SNNL을 통해 워터마크를 작업 데이터와 얽히게 하면 더 적은 질의로 소유권 검증이 향상되는가?
RQ3표준 벤치마크에서 EWE가 모델 유용성에 미치는 영향은 무엇인가?
RQ4더 깊은 아키텍처와 다양한 모달리티(비전 및 오디오)에 EWE가 얼마나 잘 확장되는가?

주요 결과

EWE는 워터마크 강건성을 향상시키며: 추출 후 기준선 대비 모든 데이터셋에서 워터마크 성공률이 더 높다.
EWE를 사용할 때 95% 신뢰도로 소유권을 주장하기 위해 필요한 질의 수가 더 적다(설정에 따라 일반적으로 약 30–100 질의).
EWE의 워터마크 성공률은 평균 38.39%(범위 18.74%–60%)인 반면, 기준선은 0.3%–9%(평균 5.77%)이다.
워터마크는 검증 정확도에의 최소한의 저하로 강건성을 유지한다(평균 약 0.81% 포인트; 최대 약 3).
얽힘은 워터마크 데이터와 합법 데이터 간의 표현 유사도(CKA 증가)에 기여하고, 활성 패턴이 겹치게 만들어 워터마크를 쉽게 분리하기 어렵게 한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.