QUICK REVIEW

[논문 리뷰] NatureLM-audio: an Audio-Language Foundation Model for Bioacoustics

David J. Robinson, Marius Miron|arXiv (Cornell University)|2024. 11. 11.

Animal Vocal Communication and Behavior인용 수 5

한 줄 요약

NatureLM-audio는 생물음향에 맞춤화된 최초의 음향-언어 기반 모델로, BEANS-Zero의 다수 태스크에서 제로샷 최첨단 성능을 달성하고 음성/음악에서 생물음향 도메인으로 표현을 전이합니다.

ABSTRACT

Large language models (LLMs) prompted with text and audio have achieved state-of-the-art performance across various auditory tasks, including speech, music, and general audio, showing emergent abilities on unseen tasks. However, their potential has yet to be fully demonstrated in bioacoustics tasks, such as detecting animal vocalizations in large recordings, classifying rare and endangered species, and labeling context and behavior -- tasks that are crucial for conservation, biodiversity monitoring, and animal behavior studies. In this work, we present NatureLM-audio, the first audio-language foundation model specifically designed for bioacoustics. Our training dataset consists of carefully curated text-audio pairs spanning bioacoustics, speech, and music, designed to address the field's limited availability of annotated data. We demonstrate successful transfer of learned representations from music and speech to bioacoustics, and our model shows promising generalization to unseen taxa and tasks. We evaluate NatureLM-audio on a novel benchmark (BEANS-Zero) and it sets a new state of the art on several bioacoustics tasks, including zero-shot classification of unseen species. To advance bioacoustics research, we release our model weights, benchmark data, and open-source the code for training and benchmark data generation and model training.

연구 동기 및 목표

생물음향에 특화된 음향-언어 기초 모델을 개발하여 분류, 탐지, 캡션 태스크를 다룬다.
음성, 음악, 일반 오디오로부터의 교차 도메인 전이를 활용해 생물음향 일반화를 향상시킨다.
BEANS-Zero 벤치마크를 통해 알려지지 않은 분종과 새로운 태스크(캡션, 생애 단계, 개체 수 카운트)를 포함하도록 생물음향 평가를 확장한다.
연구 및 재현성을 가속하기 위해 오픈소스 학습 및 벤치마킹 데이터를 제공한다.

제안 방법

사전 학습된 BEATs 오디오 인코더와 Q-Former를 사용하여 LoRA 어댑터를 통해 LLM(Llama-3.1-8b)과 인터페이스하는 오디오-텍스트 아키텍처를 활용한다.
교과과정 학습에서 영감을 받은 두 단계의 훈련: 1단계 인지 사전학습은 특정 종 분류에 집중; 2단계 일반화 미세조정은 탐지, 캡션, 생애주기, 호출 유형, 그리고 음성/음악 데이터를 포함한다.
생물음향, 음성, 음악에 걸친 프롬프트 기반 라벨링 및 절차적으로 증강된 데이터를 포함한 다양한 텍스트-오디오 학습 세트를 구성한다.
BEANS를 BEANS-Zero로 확장하여 보이지 않는 taxa 및 새로운 태스크(캡션, 카운트)에 대한 제로샷 전이를 평가한다.
기준모형(CLAP 유사 모델, BirdNET, Perch, SALMONN, Qwen-audio)과 비교하고 여러 BEANS-Zero 데이터셋에서 SotA 제로샷 성능을 보여준다.

실험 결과

연구 질문

RQ1생물음향, 음성, 음악으로 학습된 음향-언어 기초 모델이 보이지 않는 taxa와 태스크에 대해 생물음향에서 일반화할 수 있는가?
RQ2음성 및 음악에서 표현을 전이하는 것이 생물음향 제로샷 분류/탐지를 향상시키는가?
RQ3새로운 BEANS-Zero 태스크(예: 생애 단계, 호출 유형, 캡션 작성, 얼룩제비새 개체수 세기)에서 NatureLM-audio의 성능은 어떤가?
RQ4음성/음악 데이터를 제외하는 것이 다운스트림 생물음향 태스크 성능에 어떤 영향을 미치는가?

주요 결과

NatureLM-audio는 미지의 종 분류를 포함한 여러 BEANS-Zero 태스크에서 제로샷 최첨단 성능을 달성한다.
모델은 음성과 음악에서 생물음향으로의 교차 도메인 전이를 통해 보이지 않는 taxa에 대한 일반화를 향상시킨다.
BEANS-Zero 새로운 태스크(예: 생애 단계, 호출 유형, 캡션 작성, zebra finch counting)에서 SotA를 재설정한다.
보이지 않는 종 평가에서 NatureLM-audio는 일반 도메인 모델 및 CLAP 기반 접근법을 크게 능가했다.
절삭(Ablation) 분석에 따르면 2단계 훈련에 음성/음악 데이터를 포함시키는 것이 얼룩제비새의 개수 세기 성능을 의미 있게 향상시킨다.

Figure 2: Examples of training instances

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.