QUICK REVIEW

[논문 리뷰] BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment

Risa Shinoda, Kaede Shiohara|arXiv (Cornell University)|2026. 03. 25.

Animal Vocal Communication and Behavior인용 수 0

한 줄 요약

BioVITA는 백만 규모의 삼모달 데이터셋(audio, image, text), 두 단계로 학습된 통합 표현 모델, 그리고 여섯 방향과 세 가지 분류 수준에 걸친 검색을 위한 포괄적 교차 모달 벤치마크를 도입하여 생물다양성 연구에서 시각-텍스트-음향 정렬을 발전시킨다.

ABSTRACT

Understanding animal species from multimodal data poses an emerging challenge at the intersection of computer vision and ecology. While recent biological models, such as BioCLIP, have demonstrated strong alignment between images and textual taxonomic information for species identification, the integration of the audio modality remains an open problem. We propose BioVITA, a novel visual-textual-acoustic alignment framework for biological applications. BioVITA involves (i) a training dataset, (ii) a representation model, and (iii) a retrieval benchmark. First, we construct a large-scale training dataset comprising 1.3 million audio clips and 2.3 million images, covering 14,133 species annotated with 34 ecological trait labels. Second, building upon BioCLIP2, we introduce a two-stage training framework to effectively align audio representations with visual and textual representations. Third, we develop a cross-modal retrieval benchmark that covers all possible directional retrieval across the three modalities (i.e., image-to-audio, audio-to-text, text-to-image, and their reverse directions), with three taxonomic levels: Family, Genus, and Species. Extensive experiments demonstrate that our model learns a unified representation space that captures species-level semantics beyond taxonomy, advancing multimodal biodiversity understanding. The project page is available at: https://dahlian00.github.io/BioVITA_Page/

연구 동기 및 목표

BioVITATrain 구성: 14k 종과 34개 생태 특성에 대한 오디오, 이미지 및 분류학적 텍스트 주석의 백만 규모 학습 데이터셋.
BioVITAModel 개발: 오디오와 시각/텍스트 모달리티를 정렬하기 위한 두 단계 프레임워크로 학습된 통합 오디오-이미지-텍스트 표현 모델.
BioVITABench 생성: 포괄적 평가를 위한 여섯 방향과 세 가지 분류 수준에 걸친 종 수준 교차 모달 검색 벤치마크.

제안 방법

Mel-스펙트로그램으로부터 768-d 임베딩을 생성하기 위해 오디오 인코더로 HTS-AT를 사용한다.
사전 학습된 BioCLIP 2 이미지 및 텍스트 인코더(ViT-L/14 및 12-layer Transformer)를 채택하여 768-d 임베딩을 생성한다.
두 단계 학습 전략 구현: 1단계는 오디오-텍스트 대비 손실(ATC)을 통해 오디오-텍스트를 정렬; 2단계은 ATC, AIC(오디오-이미지), ITC(이미지-텍스트) 손실의 가중합으로 오디오, 이미지, 텍스트를 공동으로 정렬한다.
1단계: 오디오-레이블 쌍 배치와 임의 텍스트 프롬프트로 오디오-텍스트만 학습; 2단계: 세 인코더를 대비 손실의 가중합으로 학습하되 L_AIC와 L_ITC의 가중치를 점진적으로 증가시킨다.

실험 결과

연구 질문

RQ1통합 VITA(시각-텍스트-음향) 임베딩이 생물다양성 데이터를 위한 이미지, 텍스트, 오디오 간의 교차 모달 검색을 얼마나 잘 지원하는가?
RQ2처음부터 모든 모달리티로 학습하는 것보다 두 단계 학습 방식이 교차 모달 정합성을 더 향상시키는가?
RQ3BioVITA가 보지 못한 종에 얼마나 일반화되고 다른 분류 수준(Species, Genus, Family)에서 어떤 성능을 보이는가?
RQ4텍스트 프롬프트에서 과학명과 일반명을 사용하는 것이 검색 성능에 어떤 영향을 미치는가?

주요 결과

BioVITA(Stage 2)는 여섯 방향에 걸친 종 수준의 강력한 교차 모달 검색을 달성하며, 평균 Top-1 및 Top-5 정확도는 각각 71.7%와 89.2%이다.
BioVITA Stage 1은 이미 오디오-텍스트 정렬을 개선하고, Stage 2는 시각적 신호를 도입하여 모든 방향을 더욱 향상시킨다.
보지 않은 종 집합에서 BioVITA는 평균 Top-1 51.9%, Top-5 73.0%를 달성하여 강력한 일반화를 보여준다.
분류학 정보가 반영된 프롬프트와 과학명을 사용하는 것이 여러 방향에서 일반명보다 높은 검색 정확도를 낳는다.
상위 수준(Genus/Family) 검색은 여전히 더 어렵지만, BioVITA는 위계적 구조 포착을 보이며 오분류에서도 의미 있는 속/가족 수준 일관성을 보인다.
특성 예측 결과는 행동 특성인 이동(migration) 및 trohabitat과 같은 생태 특성에서 오디오 모달리티가 더 잘 예측된다는 것을 시사한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.