QUICK REVIEW

[논문 리뷰] Input-Adaptive Spectral Feature Compression by Sequence Modeling for Source Separation

Kohei Saijo, Yoshiaki Bando|arXiv (Cornell University)|2026. 02. 09.

Speech and Audio Processing인용 수 0

한 줄 요약

본 논문은 Spectral Feature Compression (SFC)을 소개합니다. SFC는 TF-domain 소스 분리에서 주파수 정보를 압축하기 위한 입력 적응형이자 매개변수 효율적인 대안으로, 두 가지 변형(SFC-CA와 SFC-Mamba)과 MSS 및 CASS에서의 평가를 포함합니다.

ABSTRACT

Time-frequency domain dual-path models have demonstrated strong performance and are widely used in source separation. Because their computational cost grows with the number of frequency bins, these models often use the band-split (BS) module in high-sampling-rate tasks such as music source separation (MSS) and cinematic audio source separation (CASS). The BS encoder compresses frequency information by encoding features for each predefined subband. It achieves effective compression by introducing an inductive bias that places greater emphasis on low-frequency parts. Despite its success, the BS module has two inherent limitations: (i) it is not input-adaptive, preventing the use of input-dependent information, and (ii) the parameter count is large, since each subband requires a dedicated module. To address these issues, we propose Spectral Feature Compression (SFC). SFC compresses the input using a single sequence modeling module, making it both input-adaptive and parameter-efficient. We investigate two variants of SFC, one based on cross-attention and the other on Mamba, and introduce inductive biases inspired by the BS module to make them suitable for frequency information compression. Experiments on MSS and CASS tasks demonstrate that the SFC module consistently outperforms the BS module across different separator sizes and compression ratios. We also provide an analysis showing that SFC adaptively captures frequency patterns from the input.

연구 동기 및 목표

정확도를 희생하지 않으면서 TF-domain 듀얼 패스 소스 분리의 계산 비용을 감소시키는 것.
입력 비적응적이고 다중 서브 인코더인 BS를 단일 시퀀스 모델링 모듈로 대체한다.
심리음향 유도 편향을 가진 두 가지 SFC 변형(SFC-CA의 교차 주의 및 Mamba 기반 순환) 설계.
SFC가 매개변수 효율적이고 입력 주파수 패턴에 적응함을 보인다.

제안 방법

SFC는 학습 가능한 쿼리 K개를 가진 단일 시퀀스 모델링 모듈로 TF 스펙트로그램을 인코딩합니다.
SFC-CA에서 교차 주의에 주파수 밴드 인지형 위치 편향을 도입하는 심리음향적으로 동기화된 귀납적 편향을 포함합니다.
SFC-Mamba에서는 밴드별 귀납적 편향을 부과하기 위해 신중하게 선택된 쿼리 삽입 전략과 함께 양방향 교대형 Mamba를 사용합니다.
인코더와 디코더는 대칭이며, QS(Queries) 메커니즘은 밴드별 서브인코더 없이 적응형 압축을 가능하게 합니다.
대역 구성은 미리 정의된 G_k 대역을 갖는 Musical scale을 따라 주파수 처리를 낮은 주파수 쪽으로 편향합니다.
모델은 TF-Locoformer 분리기로 엔드-투-엔드로 학습되며 MSS 및 CASS 작업에서 BS와 비교합니다.

실험 결과

연구 질문

RQ1다른 구분기 크기(작음/중간)와 압축 비율에서도 SFC가 일관되게 BS를 능가합니까?
RQ2귀납적 편향(주파수 인지 위치 편향 또는 쿼리 삽입 전략)이 성능과 수용 영역에 어떤 영향을 미칩니까?
RQ3주의/가중치 분석에서 보인 것처럼 SFC가 입력 주파수 패턴을 적응적으로 포착할 수 있습니까?
RQ4SFC 변형이 매개변수를 더 적게 필요로 하면서 BS에 비해 분리 품질을 유지하거나 향상시키나요?

주요 결과

SFC는 MSS 및 CASS 작업에서 다양한 구분기 크기와 다양한 압축 비율에 대해 BS 모듈을 능가합니다.
SFC는 입력으로부터 주파수 패턴을 적응적으로 포착하며, 분석된 주의 가중치로 입증됩니다.
SFC는 BS 기반 인코더/디코더보다 훨씬 적은 매개변수로 비슷하거나 더 나은 성능을 달성합니다.
두 가지 실행 가능한 변형이 존재합니다: SFC-CA(교차 주의에 의한 편향)와 SFC-Mamba(인터리빙 및 대역 기반 전략을 갖춘 순환).
심리음향에서 영감을 받은 대역 기반 귀납 편향(Musical scale)이 효과적인 스펙트럴 압축에 결정적입니다.
본 연구는 SFC의 적응성 및 효과성을 지지하는 차폐 연구(ablation) 및 시각화를 포함합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.