[论文解读] NatureLM-audio: an Audio-Language Foundation Model for Bioacoustics
NatureLM-audio 是首个为生物声学定制的音频-语言基础模型,在多项 BEANS-Zero 任务上实现了零样本的最先进性能,并将语音/音乐的表示迁移到生物声学领域。
Large language models (LLMs) prompted with text and audio have achieved state-of-the-art performance across various auditory tasks, including speech, music, and general audio, showing emergent abilities on unseen tasks. However, their potential has yet to be fully demonstrated in bioacoustics tasks, such as detecting animal vocalizations in large recordings, classifying rare and endangered species, and labeling context and behavior -- tasks that are crucial for conservation, biodiversity monitoring, and animal behavior studies. In this work, we present NatureLM-audio, the first audio-language foundation model specifically designed for bioacoustics. Our training dataset consists of carefully curated text-audio pairs spanning bioacoustics, speech, and music, designed to address the field's limited availability of annotated data. We demonstrate successful transfer of learned representations from music and speech to bioacoustics, and our model shows promising generalization to unseen taxa and tasks. We evaluate NatureLM-audio on a novel benchmark (BEANS-Zero) and it sets a new state of the art on several bioacoustics tasks, including zero-shot classification of unseen species. To advance bioacoustics research, we release our model weights, benchmark data, and open-source the code for training and benchmark data generation and model training.
研究动机与目标
- Develop an audio-language foundation model specialized for bioacoustics to tackle classification, detection, and captioning tasks.
- Leverage cross-domain transfer from speech, music, and general audio to improve bioacoustic generalization.
- Enhance bioacoustic evaluation with the BEANS-Zero benchmark to include unseen taxa and new tasks (captioning, live-stages, counting).
- Open-source training and benchmarking data to accelerate research and reproducibility in bioacoustics.
提出的方法
- Use an audio-to-text architecture with a pretrained BEATs audio encoder and a Q-Former to interface with an LLM (Llama-3.1-8b) via LoRA adapters.
- Train in two stages inspired by curriculum learning: Stage 1 perception pretraining on focal species classification; Stage 2 generalization fine-tuning across detection, captioning, lifestage, call-type, plus speech/music data.
- Curate a diverse text-audio training set spanning bioacoustics, speech, and music, including prompt-based labeling and procedurally augmented data.
- Extend BEANS with BEANS-Zero to assess zero-shot transfer to unseen taxa and novel tasks (captioning, counting).
- Compare against baselines (CLAP-like models, BirdNET, Perch, SALMONN, Qwen-audio) and demonstrate SotA zero-shot performance on several BEANS-Zero datasets.

实验结果
研究问题
- RQ1Can an audio-language foundation model trained on bioacoustics, speech, and music generalize to unseen taxa and tasks in bioacoustics?
- RQ2Does transferring representations from speech and music improve bioacoustic zero-shot classification/detection?
- RQ3How well does NatureLM-audio perform on new BEANS-Zero tasks such as captioning and individual counting?
- RQ4What is the effect of excluding speech/music data on performance in downstream bioacoustic tasks?
主要发现
- NatureLM-audio achieves state-of-the-art zero-shot performance on multiple BEANS-Zero tasks, including unseen species classification.
- The model shows strong cross-domain transfer from speech and music to bioacoustics, improving generalization to unseen taxa.
- On BEANS-Zero novel tasks (e.g., lifestage, call-type, captioning, zebra finch counting), the model sets new SotA.
- In unseen-species evaluation, NatureLM-audio substantially outperforms baseline general-domain models and CLAP-based approaches.
- Ablation shows including speech/music data in stage-2 training meaningfully improves counting performance for zebra finches.

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。