QUICK REVIEW

[論文レビュー] Adaptive Evidence Weighting for Audio-Spatiotemporal Fusion

Oscar Ovanger, Levi Harris|arXiv (Cornell University)|Feb 3, 2026

Animal Vocal Communication and Behavior被引用数 0

ひとこと要約

FINCHは frozen audio classifier と spatiotemporal prior を結合する際、サンプルごとに文脈的時空証拠の適応的重みを評価し、音声のみの基準と固定重み統合よりも精度とロバスト性を向上させる。音声のみのフォールバックを保持し、文脈的影響を制限する。

ABSTRACT

Many machine learning systems have access to multiple sources of evidence for the same prediction target, yet these sources often differ in reliability and informativeness across inputs. In bioacoustic classification, species identity may be inferred both from the acoustic signal and from spatiotemporal context such as location and season; while Bayesian inference motivates multiplicative evidence combination, in practice we typically only have access to discriminative predictors rather than calibrated generative models. We introduce extbf{F}usion under extbf{IN}dependent extbf{C}onditional extbf{H}ypotheses ( extbf{FINCH}), an adaptive log-linear evidence fusion framework that integrates a pre-trained audio classifier with a structured spatiotemporal predictor. FINCH learns a per-sample gating function that estimates the reliability of contextual information from uncertainty and informativeness statistics. The resulting fusion family \emph{contains} the audio-only classifier as a special case and explicitly bounds the influence of contextual evidence, yielding a risk-contained hypothesis class with an interpretable audio-only fallback. Across benchmarks, FINCH consistently outperforms fixed-weight fusion and audio-only baselines, improving robustness and error trade-offs even when contextual information is weak in isolation. We achieve state-of-the-art performance on CBI and competitive or improved performance on several subsets of BirdSet using a lightweight, interpretable, evidence-based approach. Code is available: exttt{\href{https://anonymous.4open.science/r/birdnoise-85CD/README.md}{anonymous-repository}}

研究の動機と目的

同一ターゲットに対して異種・概ね独立する証拠をロバストに統合する動機づけ。
基礎予測器の再訓練を伴わずに文脈（時空）的証拠を適応的に重み付けする per-sample ゲーティング機構を開発。
明示的な音声のみフォールバックを備えた理論的・経験的に安全な統合フレームワークを提供。
軽量で解釈可能なアプローチを用い、大規模な生物音響ベンチマークで最先端または競争力のある性能を示す。

提案手法

対数線形（専門家の積）統合を採用: log p(y|x,s) = log p_theta(y|x) + omega(x,s) * log p_psi(y|s).
非負・サンプルごとの統合重み omega(x,s) を二層MLPゲーティングネットワークで学習。
音声と文脈の不確実性・情報量特徴およびメタデータから omega(x,s) を計算; omega を [epsilon, omega_max] に制約。
音声エンコーダを固定化し、統合/ゲーティング部と文脈 prior（CBI 上の AdaSTEM prior、BirdSet のメタデータMLP）のみを訓練。
ゲート崩壊を避け真の適応性を保証する分散ベースの正則化を含む。
音声のみのフォールバック（omega=0）を維持し、文脈的影響を robostさせるために上限を設定する。

実験結果

リサーチクエスチョン

RQ1文脈証拠のサンプルごとの適応的重み付けは、固定重み統合や音声のみのベースラインを超えて生物音響分類を改善するか？
RQ2FINCH フレームワークは情報量の多い場合に文脈 priors を活用しつつ音声のみのフォールバックを維持できるか？
RQ3大規模な鳥の音響ベンチマークにおいて異種で弱い文脈信号に対して適応的ゲーティングはどの程度機能するか？
RQ4FINCH 内で異なる spatiotemporal priors（CBI の AdaSTEM と BirdSet の学習済みメタデータ prior）を用いる場合の影響は？

主な発見

Model	CBI Acc	BirdSet PER (R/m/A)	BirdSet NES (R/m/A)	BirdSet UHH (R/m/A)	BirdSet SSW (R/m/A)
FINCH (ours)	0.826	0.824 / 0.232 / 0.429	0.936 / 0.245 / 0.679	0.927 / 0.536 / 0.747	0.642 / 0.025 / 0.688
Audio-only ( p_theta(y\|x) )	0.806	–	–	–	–

FINCH は CBI で音声のみ（0.806）に対してテスト精度を上回り、0.826 を達成。線形プローブプロトコル下。
BirdSet のサブセットで、FINCH は再検索 AUROC、検出 cmAP、Top-1 精度のいずれも音声のみベースラインを上回るか同等。
固定グローバル統合重みは限定的な利得しか生まず、サンプルごとの適応性の価値を浮き彫りにする。
文脈 priors 単独では孤立しては良い成績を出さず、 gains は文脈だけではなく選択的統合から来ることを確認。
FINCH は CBI で最先端、BirdSet では軽量で解釈可能な統合を用いて競争力のある結果を提供。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。