QUICK REVIEW

[논문 리뷰] ReLAPSe: Reinforcement-Learning-trained Adversarial Prompt Search for Erased concepts in unlearned diffusion models

Ignacy Kolton, Kacper Marzol|arXiv (Cornell University)|2026. 01. 30.

Adversarial Robustness in Machine Learning인용 수 0

한 줄 요약

ReLAPSe는 unlearned diffusion 모델에서 개념 복원을 강화학습 문제로 재구성하여 검증 가능한 보상을 제공하고, 제거된 개념을 회복하는 효율적이고 전이 가능한 적대적 프롬프트를 가능하게 한다.

ABSTRACT

Machine unlearning is a key defense mechanism for removing unauthorized concepts from text-to-image diffusion models, yet recent evidence shows that latent visual information often persists after unlearning. Existing adversarial approaches for exploiting this leakage are constrained by fundamental limitations: optimization-based methods are computationally expensive due to per-instance iterative search. At the same time, reasoning-based and heuristic techniques lack direct feedback from the target model's latent visual representations. To address these challenges, we introduce ReLAPSe, a policy-based adversarial framework that reformulates concept restoration as a reinforcement learning problem. ReLAPSe trains an agent using Reinforcement Learning with Verifiable Rewards (RLVR), leveraging the diffusion model's noise prediction loss as a model-intrinsic and verifiable feedback signal. This closed-loop design directly aligns textual prompt manipulation with latent visual residuals, enabling the agent to learn transferable restoration strategies rather than optimizing isolated prompts. By pioneering the shift from per-instance optimization to global policy learning, ReLAPSe achieves efficient, near-real-time recovery of fine-grained identities and styles across multiple state-of-the-art unlearning methods, providing a scalable tool for rigorous red-teaming of unlearned diffusion models. Some experimental evaluations involve sensitive visual concepts, such as nudity. Code is available at https://github.com/gmum/ReLaPSe

연구 동기 및 목표

표준 프롬프트링을 넘어 텍스트-이미지 확산 모델에서 개념 제거의 강력한 평가를 유도한다.
잠재 시각 정보를 회복하기 위해 프롬프트를 모델의 잔류 표현과 일치시키도록 유도한다.
대상 및 제거 방법 간의 전이 가능성을 갖춘 적대적 프롬프트를 가능하게 하는 정책 기반 프레임워크를 제안한다.
언어적 진단 레드팀 테스트를 위한 확장 가능한 도구를 제공하여 제거의 강건성을 평가하고 개선한다.

제안 방법

개념 복원을 맥락에 조건부로 생성된 적대적 프롬프트를 가지는 정책으로 구성된 강화학습 과제로 형식화한다.
Reinforcement Learning with Verifiable Rewards (RLVR)을 사용하여 확산 모델의 노이즈 예측 손실에서 모델 내부의 피드백을 얻는다.
Group Relative Policy Optimization (GRPO)을 활용하여 프롬프트 그룹 내의 그룹 상대 이점을 기반으로 프롬프트 생성 정책을 업데이트한다.
보상은 다수의 확산 타임스텝에서 무조건 프롬프트 baseline 대비 노이즈 예측 정확도의 평균 개선으로 정의한다.
단일 프롬프트 최적화(Single-Prompt Optimization)와 다중 프롬프트 최적화(Multi-Prompt Optimization)의 두 가지 트레이닝 설정을 도입하여 대상에 대해 공유되고 전이 가능한 정책을 제공한다.

Figure 2 : Overview of our prompt optimization framework. A frozen, unlearned text-to-image diffusion model is probed by an LLM that generates candidate prompts. For each prompt, we measure the improvement in noise prediction accuracy relative to an unconditional baseline across multiple diffusion t

실험 결과

연구 질문

RQ1정책 기반 프롬프트 검색이 개별 인스턴스 최적화보다 unlearned diffusion 모델에서 제거된 개념을 더 효율적으로 복원할 수 있는가?
RQ2RLVR이 텍스트 프롬프트를 잠재 시각 잔류와 정렬시키는 검증 가능한 피드백 신호를 제공하는가?
RQ3다중 프롬프트(전역) 정책은 단일 프롬프트 최적화만큼 다양한 개념과 제거 방법에 대해 일반화되는가?
RQ4ReLAPSe가 최신의 제거 기술 전반에서 개념 제거의 강건성을 어느 정도까지 정량화하고 스트레스 테스트할 수 있는가?

주요 결과

ReLAPSe는 여러 제거 방법 및 개념 범주에 대해 최신 공격과 비교하여 경쟁력 있거나 우수한 공격 성공률을 달성한다.
단일 프롬프트 설정은 개별 인스턴스에 대해 가장 강한 복원을 제공하며 특정 대상에 대한 높은 적응성을 보인다.
다중 프롬프트 설정은 대상별 최적화 없이 확장 가능하고 전이 친화적인 프롬프트 생성을 보여주어 광범위한 레드팀 테스트에 적합하다.
ReLAPSe는 제거 후에도 지속적인 잠재 표현을 드러내며 현재의 제거 기술의 한계를 강조한다.
정성적 결과는 적대적 프롬프트가 Nudity, Object, Style 카테고리 전반에서 미세한 정체성과 스타일을 회복함을 보여준다.

Figure 3 : Qualitative comparison of nudity reconstruction across different methods. See Appendix B for full generation prompts.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.