[Paper Review] Evaluating Voice Conversion-based Privacy Protection against Informed Attackers
The paper evaluates how well voice-conversion based anonymization protects against attackers with varying levels of knowledge, showing that full knowledge (Informed attackers) largely defeats privacy protection, while partial knowledge attackers (Semi-Informed) can be mitigated by certain target-selection strategies; Ignorant attackers see strong unlinkability.
Speech data conveys sensitive speaker attributes like identity or accent. With a small amount of found data, such attributes can be inferred and exploited for malicious purposes: voice cloning, spoofing, etc. Anonymization aims to make the data unlinkable, i.e., ensure that no utterance can be linked to its original speaker. In this paper, we investigate anonymization methods based on voice conversion. In contrast to prior work, we argue that various linkage attacks can be designed depending on the attackers' knowledge about the anonymization scheme. We compare two frequency warping-based conversion methods and a deep learning based method in three attack scenarios. The utility of converted speech is measured via the word error rate achieved by automatic speech recognition, while privacy protection is assessed by the increase in equal error rate achieved by state-of-the-art i-vector or x-vector based speaker verification. Our results show that voice conversion schemes are unable to effectively protect against an attacker that has extensive knowledge of the type of conversion and how it has been applied, but may provide some protection against less knowledgeable attackers.
Motivation & Objective
- Assess unlinkability of voice-conversion (VC) anonymization under different attacker knowledge levels.
- Compare three VC methods (VoiceMask, VTLN-based VC, disentangled-representation VC) under varied target-selection strategies.
- Quantify privacy vs. utility by measuring speaker-verification EER and ASR WER on converted speech.
- Formalize threat models and provide guidance for privacy-preserving speech processing design.
Proposed method
- Evaluate three non-parallel, many-to-many, source- and language-independent VC methods: VoiceMask, VTLN-based VC, and disentangled representation VC.
- Define three target-selection strategies: const (fixed target), perm (random target per user), random (random target per utterance).
- Define attacker knowledge levels: Ignorant, Semi-Informed, Informed about VC method and parameters.
- Assess unlinkability via EER from i-vector/x-vector based speaker verification on converted data and ASR WER on converted data.
- Train x-vector and i-vector systems on LibriSpeech; evaluate ASR with a hybrid CTC/Attention model trained on converted data.
Experimental results
Research questions
- RQ1How does unlinkability vary with attacker knowledge (Ignorant, Semi-Informed, Informed) across VC methods and target-selection strategies?
- RQ2Which target-selection strategy best protects privacy under realistic attacker knowledge levels?
- RQ3What is the impact of VC on downstream ASR performance (WER) and speaker-verification metrics (EER) for each method?
Key findings
- Informed attackers achieve EER that is similar to or lower than baselines on some VC methods, indicating limited privacy protection when the attacker has full knowledge of the VC scheme and targets.
- Semi-Informed attackers gain substantial privacy protection, with the permutation strategy (perm) often providing the strongest unlinkability among strategies.
- Ignorant attackers show strong unlinkability, with much higher protection since they are unaware that VC has been applied.
- VTLN-based VC with appropriate target-selection strategy offers reasonable privacy protection against linkage attacks with partial knowledge, while VoiceMask is more vulnerable under informed knowledge.
- Disentangled-representation VC yields large WER increases, indicating poor utility under the evaluated setups, though its privacy profile varies with attacker knowledge and target strategy.
- Baseline EERs on untransformed data: i-vector 4.61% and x-vector 4.31%; ASR WER baseline 9.4%.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.