QUICK REVIEW

[论文解读] BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment

Risa Shinoda, Kaede Shiohara|arXiv (Cornell University)|Mar 25, 2026

Animal Vocal Communication and Behavior被引用 0

一句话总结

BioVITA 引入百万规模的三模态数据集（音频、图像、文本），一个两阶段训练的统一表示模型，以及一个覆盖六个方向和三个分类等级的跨模态基准，在生物多样性研究中推进视觉-文本-声学对齐。

ABSTRACT

Understanding animal species from multimodal data poses an emerging challenge at the intersection of computer vision and ecology. While recent biological models, such as BioCLIP, have demonstrated strong alignment between images and textual taxonomic information for species identification, the integration of the audio modality remains an open problem. We propose BioVITA, a novel visual-textual-acoustic alignment framework for biological applications. BioVITA involves (i) a training dataset, (ii) a representation model, and (iii) a retrieval benchmark. First, we construct a large-scale training dataset comprising 1.3 million audio clips and 2.3 million images, covering 14,133 species annotated with 34 ecological trait labels. Second, building upon BioCLIP2, we introduce a two-stage training framework to effectively align audio representations with visual and textual representations. Third, we develop a cross-modal retrieval benchmark that covers all possible directional retrieval across the three modalities (i.e., image-to-audio, audio-to-text, text-to-image, and their reverse directions), with three taxonomic levels: Family, Genus, and Species. Extensive experiments demonstrate that our model learns a unified representation space that captures species-level semantics beyond taxonomy, advancing multimodal biodiversity understanding. The project page is available at: https://dahlian00.github.io/BioVITA_Page/

研究动机与目标

构建 BioVITATrain：一个包含对14k物种与34个生态特征的音频、图像和分类文本注释的百万规模训练数据集。
开发 BioVITAModel：一个统一的音频-图像-文本表示模型，采用两阶段框架对齐音频与视觉及文本模态。
创建 BioVITABench：一个以物种为单位、跨越六个方向和三个分类等级的跨模态检索基准，用于全面评估。

提出的方法

使用 HTS-AT 作为音频编码器，从梅尔谱图生成 768 维嵌入。
采用预训练的 BioCLIP 2 图像与文本编码器（ViT-L/14 和 12 层 Transformer）生成 768 维嵌入。
实现两阶段训练策略：阶段1 通过音频-文本对比损失（ATC）对齐音频与文本；阶段2 通过 ATC、AIC（音频-图像）和 ITC（图像-文本）损失联合对齐音频、图像和文本。
阶段1：仅训练音频-文本，在一批音频-标签对与随机文本提示的条件下进行；阶段2：在三种编码器之上进行训练，使用对比损失的加权和，并逐步提高 L_AIC 与 L_ITC 的权重。

实验结果

研究问题

RQ1一个统一的 VITA（视觉-文本-声学）嵌入在图像、文本和音频的跨模态检索中对生物多样性数据的支持程度如何？
RQ2两阶段训练是否比从一开始就使用所有模态进行训练更能提升跨模态对齐？
RQ3BioVITA 如何对未见物种实现泛化，以及在不同分类等级（物种、属、科）上的表现如何？
RQ4在文本提示中使用科学名与常用名对检索性能的影响是多少？

主要发现

BioVITA（阶段2）在物种级跨模态检索中表现出色，在六个方向上的平均 Top-1 与 Top-5 分别为 71.7% 和 89.2%。
BioVITA 阶段1 已经提升了音频-文本对齐，阶段2 通过引入视觉信息进一步增强了所有方向。
在未见物种子集上，BioVITA 的平均 Top-1 与 Top-5 分别为 51.9% 与 73.0%，显示出鲁棒的泛化能力。
在若干方向上， taxonomy-aware 的提示和使用科学名比常用名带来更高的检索准确性。
高层次（属/科）的检索仍然具有挑战性，但 BioVITA 展现了层次结构的捕捉能力，在误分类中表现出有意义的属/科层级一致性。
特征预测结果表明行为特征等生态特征在音频模态下的预测效果更好，例如迁徙和栖息地特征。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。