QUICK REVIEW

[論文レビュー] Training-Free Multi-Step Inference for Target Speaker Extraction

Zhenghai You, Ying Shi|arXiv (Cornell University)|Mar 11, 2026

Speech Recognition and Synthesis被引用数 0

ひとこと要約

The paper proposes a training-free multi-step inference framework that refines target speaker extraction by interpolating between the mixture and previous estimates using a frozen model, guided by a joint non-intrusive quality and speaker similarity score.

ABSTRACT

Target speaker extraction (TSE) aims to recover a target speaker's speech from a mixture using a reference utterance as a cue. Most TSE systems adopt conditional auto-encoder architectures with one-step inference. Inspired by test-time scaling, we propose a training-free multi-step inference method that enables iterative refinement with a frozen pretrained model. At each step, new candidates are generated by interpolating the original mixture and the previous estimate, and the best candidate is selected for further refinement until convergence. Experiments show that, when ground-truth target speech is available, optimizing an intrusive metric (SI-SDRi) yields consistent gains across multiple evaluation metrics. Without ground truth, optimizing non-intrusive metrics (UTMOS or SpkSim) improves the corresponding metric but may hurt others. We therefore introduce joint metric optimization to balance these objectives, enabling controllable extraction preferences for practical deployment.

研究の動機と目的

Motivate target speaker extraction (TSE) in multi-speaker scenarios where reference signals exist.
Introduce a training-free, iterative refinement procedure that uses a frozen TSE model at test time.
Balance perceptual quality and target-speaker consistency without retraining via a joint scoring function.
Demonstrate gains over one-step inference across multiple backbones and analyze reliability of the approach.

提案手法

Use a frozen pretrained TSE model to generate multiple candidate inputs via interpolation between the mixture and the current estimate.
Compute candidate outputs with the same frozen model and select the best candidate per iteration using a scoring function R.
Option 1: use oracle SI-SDRi as the selector to establish upper-bound headroom.
Option 2: deployable selectors use non-intrusive metrics like UTMOS and SpkSim, and a joint score combining both (Eq. 5).
Provide an analysis of non-decreasing property and error bounds to ensure reliability of greedy selection.

実験結果

リサーチクエスチョン

RQ1Can inference-time search with interpolation-based candidates improve TSE without retraining?
RQ2How do deployable non-intrusive metrics (UTMOS, SpkSim) fare in guiding multi-step refinement?
RQ3Does a joint metric balancing perceptual quality and speaker similarity provide more stable improvements than single metrics?
RQ4What is the reliability of greedy selection given imperfect scoring in a training-free setup?

主な発見

Selector	Step	DPRNN SI-SDRi (dB)	DPRNN UTMOS	DPRNN SpkSim	SpEx+ SI-SDRi (dB)	SpEx+ UTMOS	SpEx+ SpkSim
Baseline	0	14.422	3.058	0.671	13.729	2.863	0.629
SI-SDRi (oracle)	1	15.369	3.107	0.672	14.380	2.935	0.633
SI-SDRi (oracle)	3	15.241	3.107	0.671	14.387	2.932	0.632
SI-SDRi (oracle)	5	15.241	3.111	0.672	14.404	2.931	0.631
UTMOS	1	14.287	3.206	0.673	13.693	3.019	0.629
UTMOS	3	14.037	3.242	0.674	13.596	3.036	0.626
UTMOS	5	13.904	3.246	0.674	13.536	3.033	0.624
SpkSim	1	13.845	3.064	0.692	13.897	2.867	0.649
SpkSim	3	13.215	3.057	0.697	13.701	2.849	0.652
SpkSim	5	12.897	3.049	0.698	13.627	2.839	0.652
Joint	1	14.311	3.204	0.677	13.876	3.013	0.634
Joint	3	14.181	3.238	0.679	13.772	3.028	0.634
Joint	5	14.144	3.242	0.679	13.728	3.025	0.634

Oracle SI-SDRi selection yields consistent gains over one-step inference for both backbones (DPRNN and SpEx+).
Deployable selectors show improvement in their respective metrics, with noticeable trade-offs when optimizing a single proxy.
Joint scoring (UTMOS + SpkSim) achieves more balanced improvements in perceptual quality and target-speaker consistency across backbones.
SpEx+ benefits from deeper multi-step refinement, while DPRNN gains earlier in steps, reflecting backbone-specific dynamics.
The approach demonstrates non-decreasing performance relative to the initial one-step output under the chosen selector and provides an interpretable stability bound when selectors are imperfect.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。