QUICK REVIEW

[论文解读] S$^2$Voice: Style-Aware Autoregressive Modeling with Enhanced Conditioning for Singing Style Conversion

Ziqian Wang, Xianjun Xia|arXiv (Cornell University)|Jan 20, 2026

Music and Audio Processing被引用 0

一句话总结

S$^2$Voice 通过在自回归大语言模型中加入风格条件并在流式解码器中加入全局说话人条件、 plus 大规模精选语料库以及 SFT+DPO 训练，在域内与零-shot 任务上实现 SVCC 2025 最高成绩，推动歌唱风格转换的发展。

ABSTRACT

We present S$^2$Voice, the winning system of the Singing Voice Conversion Challenge (SVCC) 2025 for both the in-domain and zero-shot singing style conversion tracks. Built on the strong two-stage Vevo baseline, S$^2$Voice advances style control and robustness through several contributions. First, we integrate style embeddings into the autoregressive large language model (AR LLM) via a FiLM-style layer-norm conditioning and a style-aware cross-attention for enhanced fine-grained style modeling. Second, we introduce a global speaker embedding into the flow-matching transformer to improve timbre similarity. Third, we curate a large, high-quality singing corpus via an automated pipeline for web harvesting, vocal separation, and transcript refinement. Finally, we employ a multi-stage training strategy combining supervised fine-tuning (SFT) and direct preference optimization (DPO). Subjective listening tests confirm our system's superior performance: leading in style similarity and singer similarity for Task 1, and across naturalness, style similarity, and singer similarity for Task 2. Ablation studies demonstrate the effectiveness of our contributions in enhancing style fidelity, timbre preservation, and generalization. Audio samples are available~\footnote{https://honee-w.github.io/SVC-Challenge-Demo/}.

研究动机与目标

通过更好地对风格与音色的解耦以及提高对未见歌手的泛化能力，推动稳健的歌唱风格转换（SSC）。
通过在自回归内容–风格模型中进行显式风格条件化，改进细粒度风格建模。
通过声学解码器中的全局说话人嵌入提升音色保持。
组装一个高质量的歌唱语料库并采用多阶段训练策略，以提升稳定性和零-shot 表现。

提出的方法

在 Vevo 基础上提出一个两阶段框架：一个自回归内容–风格模型，后跟一个流对齐的声学解码器。
引入 FiLM 风格的层归一化和风格感知的跨注意力，将全局与局部风格信息注入到自回归 LLM。
使用从预训练说话人验证网络获得的全局说话人嵌入来对声学解码器进行条件化，以保持音色。
通过网络抓取、语音分离、转录 Refinement 和质量过滤，筛选出约 500 小时的大型歌唱语料库。
采用有监督微调（SFT）随后直接偏好优化（DPO）进行训练，以提升感知质量与稳定性。

Fig. 1 : Autoregressive transformer block. (a) Original AR block with standard self-attention and feed-forward layers using conventional LayerNorm. (b) Modified AR block used in our AR-LLM: FiLM-style layer-norm modulation injects global style scale and shift ( $\gamma,\beta$ ) produced by the style

实验结果

研究问题

RQ1风格嵌入是否可以有效注入自回归 LLM，以实现对歌唱风格的细粒度控制？
RQ2声学解码器中的全局说话人嵌入是否能在零-shot SSC 中提升音色相似度？
RQ3大型精选歌唱语料库与多阶段训练（SFT+DPO）对 SSC 的自然度与风格/歌手相似度有何影响？
RQ4消融组件（FiLM、风格感知跨注意力、全局说话人嵌入、DPO）对风格保真度、音色保持与生成稳定性有何贡献？

主要发现

S2 Voice 在 SVCC 2025 的自然度、风格相似度和歌手相似度两个轨道均排名第一。
风格相似度的提升主要来自自回归 LLM 中的 FiLM 与风格感知跨注意力。
全局说话人嵌入提升了声学模型中的歌手（音色）相似度。
经过筛选的约 500 小时歌唱语料库以及 SFT+DPO 提升了稳定性与零-shot 泛化。
消融结果显示各组件对风格保真度、音色保持与生成稳定性均有积极贡献；尽管指标略有变化，DPO 仍有助于降低低质量异常样本。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。