QUICK REVIEW

[논문 리뷰] Two-Stage Data Synthesization: A Statistics-Driven Restricted Trade-off between Privacy and Prediction

Xiaotong Liu, Shao-Bo Lin|arXiv (Cornell University)|2026. 02. 09.

Privacy-Preserving Technologies in Data인용 수 0

한 줄 요약

본 논문은 먼저 합성-후 하이브드(synthesis-then-hybrid) 단계로 분포를 보존하고, 그다음 커널 릿지 회귀로 응답을 재구성하여 통계 기반의 제한된 프라이버시–예측 트레이드오프를 달성하는 두 단계 합성 데이터 생성 프레임워크를 제안한다.

ABSTRACT

Synthetic data have gained increasing attention across various domains, with a growing emphasis on their performance in downstream prediction tasks. However, most existing synthesis strategies focus on maintaining statistical information. Although some studies address prediction performance guarantees, their single-stage synthesis designs make it challenging to balance the privacy requirements that necessitate significant perturbations and the prediction performance that is sensitive to such perturbations. We propose a two-stage synthesis strategy. In the first stage, we introduce a synthesis-then-hybrid strategy, which involves a synthesis operation to generate pure synthetic data, followed by a hybrid operation that fuses the synthetic data with the original data. In the second stage, we present a kernel ridge regression (KRR)-based synthesis strategy, where a KRR model is first trained on the original data and then used to generate synthetic outputs based on the synthetic inputs produced in the first stage. By leveraging the theoretical strengths of KRR and the covariant distribution retention achieved in the first stage, our proposed two-stage synthesis strategy enables a statistics-driven restricted privacy--prediction trade-off and guarantee optimal prediction performance. We validate our approach and demonstrate its characteristics of being statistics-driven and restricted in achieving the privacy--prediction trade-off both theoretically and numerically. Additionally, we showcase its generalizability through applications to a marketing problem and five real-world datasets.

연구 동기 및 목표

다운스트림 예측의 정확성을 지원하는 개인정보 보호 데이터 공유의 필요성을 동기화한다.
통계성에만 집중하기보다 프라이버시와 예측 간의 균형을 맞추는 두 단계 SDG 프레임워크를 도입한다.
첫 번째 단계에서 공변 분포 보존을 보장하여 두 번째 단계에서의 신뢰할 수 있는 예측을 지원한다.
분포 변화 및 불일치 하에서의 예측 성능을 보장하기 위해 모형 기반 합성 단계를 통해 보장한다.

제안 방법

Stage 1은 제어 가능한 하이브럴 파라미터 alpha를 통해 공변 분포 보존을 가진 합성 입력을 생성하기 위해 synthesis-then-hybrid 전략을 사용한다.
Stage 2는 원래 데이터에 대해 커널 릿지 회귀 모델을 학습하고 이를 이용해 합성 입력으로부터 합성 출력이 나오도록 하여 응답 재구성을 실현한다.
첫 번째 단계는 Latin Hypercube Sampling, GANs, diffusion 모델 등 다양한 전략을 사용할 수 있으며, 본 논문은 LHS-H 접근법으로 이를 구체화한다.
The KRR-based second stage leverages the stability and distribution-mismatch robustness of kernel methods to preserve prediction performance.
결합된 LHS-H-KRR 파이프라인은 데이터 합성에 예측을 통합함으로써 통계 기반의 제한된 프라이버시–예측 트레이드오프를 실현하는 것을 목표로 한다.
이론적 정당화는 분포 이동하에서 공변 분포 보존을 최적 예측 보장과 연결한다.

실험 결과

연구 질문

RQ1두 단계 SDG 설계가 단일 단계_methods보다 더 나은 제어된 프라이버시–예측 트레이드오프를 제공할 수 있는가?
RQ2첫 번째 단계의 공변 분포 보존이 KRR 기반 두 번째 단계 사용 시 다운스트림 예측에 어떤 영향을 미치는가?
RQ3KRR 기반 생성기가 분포 변화 하에서 익명화된 데이터에 대해 원래의 회귀 관계를 신뢰성 있게 재구성하는가?
RQ4첫 번째 단계 합성 방식을 대체하는 것이 프라이버시 및 예측 결과에 어떤 영향을 미치는가?

주요 결과

두 단계 설계(LHS-H-KRR)가 데이터 합성에 예측을 명시적으로 통합하여 프라이버시–예측 트레이드오프를 달성한다.
합성-후 하이브드 단계가 공변 분포를 보존하여 분포 차이가 있을 때도 견고한 예측을 가능하게 한다.
KRR 기반 두 번째 단계가 안정적인 예측 성능과 분포 불일치에 대한 강건성을 제공한다.
LHS 기반 합성은 GANs 및 확산 모델보다 효율성과 해석가능성 측면에서 이점을 제공하면서도 주요 통계를 유지한다.
본 프레임워크는 마케팅 태스크와 다섯 개의 실제 데이터셋에 걸쳐 일반화 가능성을 시연한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.