QUICK REVIEW

[論文レビュー] BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment

Risa Shinoda, Kaede Shiohara|arXiv (Cornell University)|Mar 25, 2026

Animal Vocal Communication and Behavior被引用数 0

ひとこと要約

BioVITAは音声、画像、テキストの百万規模三模態データセット、二段階で訓練される統一表現モデル、六方向×三階層の包括的な検索ベンチマークを提供し、生物多様性研究における視覚-テキスト-音響の整合性を進化させる。

ABSTRACT

Understanding animal species from multimodal data poses an emerging challenge at the intersection of computer vision and ecology. While recent biological models, such as BioCLIP, have demonstrated strong alignment between images and textual taxonomic information for species identification, the integration of the audio modality remains an open problem. We propose BioVITA, a novel visual-textual-acoustic alignment framework for biological applications. BioVITA involves (i) a training dataset, (ii) a representation model, and (iii) a retrieval benchmark. First, we construct a large-scale training dataset comprising 1.3 million audio clips and 2.3 million images, covering 14,133 species annotated with 34 ecological trait labels. Second, building upon BioCLIP2, we introduce a two-stage training framework to effectively align audio representations with visual and textual representations. Third, we develop a cross-modal retrieval benchmark that covers all possible directional retrieval across the three modalities (i.e., image-to-audio, audio-to-text, text-to-image, and their reverse directions), with three taxonomic levels: Family, Genus, and Species. Extensive experiments demonstrate that our model learns a unified representation space that captures species-level semantics beyond taxonomy, advancing multimodal biodiversity understanding. The project page is available at: https://dahlian00.github.io/BioVITA_Page/

研究の動機と目的

BioVITATrainの構築：14k種と34の生態的特徴のための音声、画像、分類学テキスト注釈を含む百万規模の訓練データセット。
BioVITAModelの開発：音声・画像・テキストを整合させる二段階フレームワークで訓練された統一的な音声-画像-テキスト表現モデル。
BioVITABenchの作成：六方向×三階層の種レベルのクロスモーダル検索ベンチマークで、包括的評価を提供。

提案手法

音声エンコーダとしてHTS-ATを用い、melスペクトログラムから768次元の埋め込みを生成。
前学習済みBioCLIP 2の画像・テキストエンコーダ（ViT-L/14および12層Transformer）を採用し、768次元の埋め込みを生成。
二段階訓練戦略を実装：Stage 1は音声-テキスト対比損失（ATC）で音声-テキストを整合、Stage 2はATC、AIC（音声-画像）、ITC（画像-テキスト）損失の重み付き和で音声・画像・テキストを共同整合。
Stage 1: 音声-テキストのみを訓練し、音声ラベル対のバッチとランダムなテキストプロンプトを用いる。Stage 2: 三つのエンコーダ全体を訓練し、対照学習損失の重みを段階的に増やしつつL_AICとL_ITCの比重を高める。

実験結果

リサーチクエスチョン

RQ1統一されたVITA（視覚-テキスト-音響）埋め込みは、生物多様性データの画像・テキスト・音声間のクロスモーダル検索をどれだけサポートできるか。
RQ2二段階訓練アプローチは、開始時に全モダリティを用いた訓練よりクロスモーダル整合を改善するか。
RQ3BioVITAは未知の種へどの程度一般化し、種・属・科の異なる階層でどう機能するか。
RQ4Text promptsにおける科学名と一般名の使用が検索性能に与える影響は。

主な発見

BioVITA（Stage 2）は種レベルの強力なクロスモーダル検索を達成し、六方向で平均Top-1は71.7%、Top-5は89.2%。
BioVITA Stage 1は音声-テキスト整合をすでに改善し、Stage 2は視覚的手掛かりを取り入れることで全方向をさらに強化。
未知の種サブセットで、BioVITAは平均Top-1 51.9%、Top-5 73.0%を達成し、堅牢な一般化を示す。
分類学的プロンプティングと科学名の使用は、いくつかの方向で一般名より高い検索精度を示す。
高レベル（属/科）の検索は依然難易度が高いが、BioVITAは階層構造を捉え、誤分類でも属・科レベルの一貫性が意味のある形で見られる。
特徴予測の結果は、行動特性（移動・生息地）などの生態的特徴が音声モダリティでより良く予測されることを示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。