QUICK REVIEW

[論文レビュー] Input-Adaptive Spectral Feature Compression by Sequence Modeling for Source Separation

Kohei Saijo, Yoshiaki Bando|arXiv (Cornell University)|Feb 9, 2026

Speech and Audio Processing被引用数 0

ひとこと要約

要約: 本論文は Spectral Feature Compression (SFC) を提案する。SFC は入力適応型でパラメータ効率の高い周波数情報圧縮手法で、TF領域のデュアルパス分離における band-split (BS) モジュールの代替として機能する。2つの変種（SFC-CA と SFC-Mamba）を持ち、MSS と CASS で評価される。

ABSTRACT

Time-frequency domain dual-path models have demonstrated strong performance and are widely used in source separation. Because their computational cost grows with the number of frequency bins, these models often use the band-split (BS) module in high-sampling-rate tasks such as music source separation (MSS) and cinematic audio source separation (CASS). The BS encoder compresses frequency information by encoding features for each predefined subband. It achieves effective compression by introducing an inductive bias that places greater emphasis on low-frequency parts. Despite its success, the BS module has two inherent limitations: (i) it is not input-adaptive, preventing the use of input-dependent information, and (ii) the parameter count is large, since each subband requires a dedicated module. To address these issues, we propose Spectral Feature Compression (SFC). SFC compresses the input using a single sequence modeling module, making it both input-adaptive and parameter-efficient. We investigate two variants of SFC, one based on cross-attention and the other on Mamba, and introduce inductive biases inspired by the BS module to make them suitable for frequency information compression. Experiments on MSS and CASS tasks demonstrate that the SFC module consistently outperforms the BS module across different separator sizes and compression ratios. We also provide an analysis showing that SFC adaptively captures frequency patterns from the input.

研究の動機と目的

TF-領域デュアルパス音源分離の計算コストを抑えつつ精度を維持する。
入力非適応的なマルチサブエンコーダ BS を、単一のシーケンスモデリングモジュールに置換する。
二つの SFC 変種（クロスアテンションと Mamba ベースの再帰）を、ピ Psychoacoustic inductive biases を用いて設計する。
SFC がパラメータ効率的で、入力の周波数パターンへ適応することを示す。

提案手法

SFC は K 個の学習可能クエリを用いて TF スペクトログラムを単一のシーケンスモデリングモジュールでエンコードする。
SFC-CA では、クロスアテンションに周波数帯を意識した位置バイアスという心理音響動機づけバイアスを組み込む。
SFC-Mamba では、間引きされた挿入戦略を用いた双方向の Mamba を用いて帯域ごとの inductive bias を課す。
エンコーダとデコーダは対称で、QS（queries）機構により帯ごとのサブエンコーダを用意せずに適応圧縮を実現する。
帯域構成は Musical scale に従い predefined な G_k 帯を用いて低周波数への処理バイアスをかける。
モデルは end-to-end で TF-Locoformer セパレータとともに学習し、MSS および CASS タスクで BS と比較する。

実験結果

リサーチクエスチョン

RQ1SFC は小型/中型の分離器サイズおよび圧縮比の異なる場合でも BS より一貫して優れるか。
RQ2 inductive biases（周波数認識バイアスまたはクエリ挿入戦略）は性能と受容野にどのような影響を与えるか。
RQ3 attention/重み分析から SFC が入力の周波数パターンを適応的に捉えることができるか。
RQ4SFC の変種は BS に比べてパラメータ数を減らしつつ分離品質を維持または向上できるか。

主な発見

SFC は MSS および CASS タスクにおいて、さまざまな分離器サイズと圧縮比で BS モジュールを上回る。
SFC は入力から周波数パターンを適応的に捉えることが、解析済みの重みにより示唆される。
SFC は BS ベースのエンコーダ/デコーダよりはるかに少ないパラメータでも同等以上の性能を達成。
二つの有力な変種が存在する：SFC-CA（ inductive bias を伴うクロスアテンション）と SFC-Mamba（帯域ベース戦略を用いた相互挿入の再帰）。
心理音響に着想を得た帯域ベースの inductive bias（Musical scale）がスペクトル圧縮の効果に重要。
本研究には、SFC の適応性と有効性を裏付けるアブレーションと視覚化が含まれる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。