QUICK REVIEW

[论文解读] Stream-Voice-Anon: Enhancing Utility of Real-Time Speaker Anonymization via Neural Audio Codec and Language Models

Nikita Kuzmin, Songting Liu|arXiv (Cornell University)|Jan 20, 2026

Speech Recognition and Synthesis被引用 0

一句话总结

Stream-Voice-Anon 将神经音频编解码器和语言模型条件化的流式架构用于实时说话人匿名化，在保持隐私保护的同时达到更高的可懂度与情感保留，且延迟与现有方法相当。

ABSTRACT

Protecting speaker identity is crucial for online voice applications, yet streaming speaker anonymization (SA) remains underexplored. Recent research has demonstrated that neural audio codec (NAC) provides superior speaker feature disentanglement and linguistic fidelity. NAC can also be used with causal language models (LM) to enhance linguistic fidelity and prompt control for streaming tasks. However, existing NAC-based online LM systems are designed for voice conversion (VC) rather than anonymization, lacking the techniques required for privacy protection. Building on these advances, we present Stream-Voice-Anon, which adapts modern causal LM-based NAC architectures specifically for streaming SA by integrating anonymization techniques. Our anonymization approach incorporates pseudo-speaker representation sampling, a speaker embedding mixing and diverse prompt selection strategies for LM conditioning that leverage the disentanglement properties of quantized content codes to prevent speaker information leakage. Additionally, we compare dynamic and fixed delay configurations to explore latency-privacy trade-offs in real-time scenarios. Under the VoicePrivacy 2024 Challenge protocol, Stream-Voice-Anon achieves substantial improvements in intelligibility (up to 46% relative WER reduction) and emotion preservation (up to 28% UAR relative) compared to the previous state-of-the-art streaming method DarkStream while maintaining comparable latency (180ms vs 200ms) and privacy protection against lazy-informed attackers, though showing 15% relative degradation against semi-informed attackers.

研究动机与目标

为在严格延迟约束的流媒体应用中动机化实时说话人匿名化（SA）。
利用神经音频编解码（NAC）表示与因果语言模型，将内容与说话人身份解耦。
引入伪说话人采样、说话人嵌入混合和基于提示的LM条件化等匿名化技术以提升隐私。
在动态延迟与流式架构中探讨延迟-隐私-效用的权衡。
在 VoicePrivacy 2024 基准测试中评估并将效用、隐私与延迟与现有方法进行比较。

提出的方法

使用带有 VQ 瓶颈的因果流内容编码器，从 HuBERT 派生特征中提取说话人不变的内容令牌。
采用两阶段自回归语音转换（ARVC）模型，配备 Slow AR 和 Fast AR 解码器，在每帧生成多个声学码本。
将 ARVC 条件化于全局说话人嵌入和提示派生的声学上下文；利用动态的每 utterance 延迟 d 以平衡延迟与质量。
通过提示池化和说话人嵌入混合在推理时进行匿名化，包括对提示嵌入进行平均和采样高斯匿名化的说话人嵌入。
使用交错的帧级 AR 因子分解进行训练，以符合流式 I/O，并使用两阶段解码处理每帧的多个码本。
在 VoicePrivacy 2024 下通过 EER（隐私）、WER（可懂度）和 UAR（情感保留）进行评估。

实验结果

研究问题

RQ1流式 NAC 基架能否在实时保留语言内容与情感的前提下提供具有竞争力的隐私保护？
RQ2在流 SA 中的动态延迟下，隐私-效用-延迟之间的权衡是什么？
RQ3提示多样性与说话人嵌入混合如何影响攻击者的成功率和下游任务性能？
RQ4在线 SA 方法在隐私与可懂度方面能否接近离线基线？

主要发现

在可懂度和情感保留方面超越先前的流式状态-art（DarkStream），且延迟与隐私水平相当。
在类似的延迟预算下，相对 WER 降幅最高可达 46% 的提升。
通过提示条件化，情感保留相对于 DarkStream 最高提升约 28% 的 UAR。
对懒惰知情攻击者的隐私保护与 DarkStream 相当（EER ~47.3%）；对半知情攻击者略有下降（EER ~18.6–21.8%）。
动态延迟实现了延迟-质量的权衡而无需重新训练；固定延迟仅带来很小的 ASR 提升且无隐私收益。
提示多样性（如 vctk-1fix、vctk-1rnd、vctk-4rnd、cross-ds-4rnd）提高了对半知情攻击者的 EER，阻碍攻击者适应。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。