QUICK REVIEW

[论文解读] Adaptive Evidence Weighting for Audio-Spatiotemporal Fusion

Oscar Ovanger, Levi Harris|arXiv (Cornell University)|Feb 3, 2026

Animal Vocal Communication and Behavior被引用 0

一句话总结

FINCH 在将冻结的音频分类器与时空先验进行融合时，按样本自适应权衡上下文时空证据，提升了相较于音频单一与固定权重融合的准确性与鲁棒性。它保留音频单一回退，并对上下文影响进行界定。

ABSTRACT

Many machine learning systems have access to multiple sources of evidence for the same prediction target, yet these sources often differ in reliability and informativeness across inputs. In bioacoustic classification, species identity may be inferred both from the acoustic signal and from spatiotemporal context such as location and season; while Bayesian inference motivates multiplicative evidence combination, in practice we typically only have access to discriminative predictors rather than calibrated generative models. We introduce extbf{F}usion under extbf{IN}dependent extbf{C}onditional extbf{H}ypotheses ( extbf{FINCH}), an adaptive log-linear evidence fusion framework that integrates a pre-trained audio classifier with a structured spatiotemporal predictor. FINCH learns a per-sample gating function that estimates the reliability of contextual information from uncertainty and informativeness statistics. The resulting fusion family \emph{contains} the audio-only classifier as a special case and explicitly bounds the influence of contextual evidence, yielding a risk-contained hypothesis class with an interpretable audio-only fallback. Across benchmarks, FINCH consistently outperforms fixed-weight fusion and audio-only baselines, improving robustness and error trade-offs even when contextual information is weak in isolation. We achieve state-of-the-art performance on CBI and competitive or improved performance on several subsets of BirdSet using a lightweight, interpretable, evidence-based approach. Code is available: exttt{\href{https://anonymous.4open.science/r/birdnoise-85CD/README.md}{anonymous-repository}}

研究动机与目标

为同一目标对异质、近似独立的证据进行鲁棒融合提供动机。
开发一个按样本门控的机制，在不重新训练基础预测器的情况下自适应地加权上下文（时空）证据。
提供一个理论上和经验上安全的融合框架，并具备显式的仅音频回退。
在大规模生物声学基准上使用轻量、可解释的方法展示最先进或具竞争力的性能。

提出的方法

采用对数线性（专家乘积）融合：log p(y|x,s) = log p_theta(y|x) + omega(x,s) * log p_psi(y|s).
通过两层MLP门控网络学习非负的按样本的融合权重 omega(x,s).
从音频和上下文的不确定性与信息性特征，以及元数据中计算 omega(x,s)；将 omega 限制在 [epsilon, omega_max] 区间。
冻结音频编码器；仅训练融合/门控组件和上下文先验（CBI 上的 AdaSTEM 先验、BirdSet 的元数据 MLP）。
包含基于方差的正则化以避免门控塌陷并确保真正的自适应性。
保留音频单一回退（omega=0）并约束上下文影响以确保鲁棒性。

实验结果

研究问题

RQ1按样本自适应加权上下文证据能否在生物声学分类中超越固定权重融合和音频单一基线？
RQ2FINCH 框架在有用信息时是否在保持音频单一回退的同时利用上下文先验？
RQ3在大型鸟类音频基准上，面对异质且弱的上下文信号时，自适应门控的表现如何？
RQ4在 FINCH 中使用不同的时空先验（CBI 上的 AdaSTEM 与 BirdSet 上学习得到的元数据先验）会有何影响？

主要发现

Model	CBI Acc	BirdSet PER (R/m/A)	BirdSet NES (R/m/A)	BirdSet UHH (R/m/A)	BirdSet SSW (R/m/A)
FINCH (ours)	0.826	0.824 / 0.232 / 0.429	0.936 / 0.245 / 0.679	0.927 / 0.536 / 0.747	0.642 / 0.025 / 0.688
Audio-only ( p_theta(y\|x) )	0.806	–	–	–	–

FINCH 在 CBI 的测试准确率为 0.826，优于音频单一 0.806（线性探针协议）。
在 BirdSet 子集上，FINCH 在检索 AUROC、检测 cmAP 和 Top-1 准确率方面均与音频单一基线相当或有所提升。
固定全局融合权重的收益有限，凸显按样本自适应的重要性。
单独的上下文先验在孤立情况下表现不佳，证实增益来自对上下文的有选择性整合，而非仅凭上下文。
FINCH 在 CBI 上提供了最先进的性能，在 BirdSet 上也展现了有竞争力的结果，且方法轻量、可解释。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。