QUICK REVIEW

[논문 리뷰] SALMONN: Towards Generic Hearing Abilities for Large Language Models

Changli Tang, Wenyi Yu|arXiv (Cornell University)|2023. 10. 20.

Music and Audio Processing인용 수 18

한 줄 요약

SALMONN은 일반적인 음성-오디오-언어-음악 입력을 인지하고 추론하기 위해 듀얼 청각 인코더와 LLM을 통합하는 오픈 신경망으로, 학습된 크로스모달 능력과 emergent 크로스모달 능력을 모두 가능하게 한다.

ABSTRACT

Hearing is arguably an essential ability of artificial intelligence (AI) agents in the physical world, which refers to the perception and understanding of general auditory information consisting of at least three types of sounds: speech, audio events, and music. In this paper, we propose SALMONN, a speech audio language music open neural network, built by integrating a pre-trained text-based large language model (LLM) with speech and audio encoders into a single multimodal model. SALMONN enables the LLM to directly process and understand general audio inputs and achieve competitive performances on a number of speech and audio tasks used in training, such as automatic speech recognition and translation, auditory-information-based question answering, emotion recognition, speaker verification, and music and audio captioning etc. SALMONN also has a diverse set of emergent abilities unseen in the training, which includes but is not limited to speech translation to untrained languages, speech-based slot filling, spoken-query-based question answering, audio-based storytelling, and speech audio co-reasoning etc. The presence of cross-modal emergent abilities is studied, and a novel few-shot activation tuning approach is proposed to activate such abilities. To our knowledge, SALMONN is the first model of its type and can be regarded as a step towards AI with generic hearing abilities. The source code, model checkpoints and data are available at https://github.com/bytedance/SALMONN.

연구 동기 및 목표

음성만이 아닌 일반적인 청각 정보를 인지하고 이해할 수 있는 AI의 필요성을 고취한다(음성, 오디오 이벤트, 음악 포함).
음성 인코더와 오디오 인코더를 LLM과 융합하여 다양한 오디오 태스크를 처리하는 단일 다중 모달 LLM인 SALMONN을 제안한다.
출현하는 크로모달 능력을 조사하고 이를 few-shot 활성화 튜닝 단계로 활성화하는 방법을 탐구한다.

제안 방법

Whisper(음성)와 BEATs(비음성 오디오) 인코더를 하나의 모델로 결합하여 듀얼 음향 인코더 설정을 사용하는 방법을 채택한다.
윈도우 수준의 Q-Former를 연결 모듈로 사용하여 LLM 입력 공간에 맞춘 보강된 오디오 토큰을 생성한다.
LoRA 어댑터로 미세조정하여 보강된 입력 공간을 LLM 출력 공간에 맞추되 LLM과 인코더는 고정된 상태를 유지한다.
음성 인식 및 오디오 캡션 데이터로 사전 학습하여 오디오와 텍스트 간의 다중 모달 정렬을 확립한다.
음성, 오디오 및 음악 태스크 모음에 대한 지시문 튜닝을 수행하여 태스크별 행동을 형성한다.
LoRA 스케일링을 낮추는 활성화 튜닝 단계를 도입하여 학습된 태스크에 과적합되지 않으면서 크로모달 emergent 능력을 깨운다.

실험 결과

연구 질문

RQ1단일 모델이 음성, 오디오 이벤트, 음악으로 구성된 일반적인 오디오 입력을 인지하고 이해할 수 있는가?
RQ2이러한 모델에 크로모달 출현 능력이 존재하며 가벼운 학습 기법으로 활성화할 수 있는가?
RQ3활성화 튜닝이 학습된 태스크와 학습되지 않은 크로모달 태스크의 성능에 어떤 영향을 미치는가?
RQ4종단 간 추론을 위해 오디오 인코딩을 LLM과 정렬하기 위해 필요한 데이터, 프롬프트, 그리고 아키텍처 선택은 무엇인가?

주요 결과

SALMONN은 ASR, 번역, 오디오 캡션팅과 같은 학습된 태스크에서 경쟁력 있는 성과를 달성한다.
활성화 튜닝은 오디오 기반 스토리텔링 및 음성-오디오 공동 추론과 같은 emergent 능력을 가능하게 하며 2단계(level-2) 및 3단계(level-3) 태스크에서 성능이 향상된다.
테스트 시 LoRA 스케일링 계수를 무시하면 few-shot 방식으로 크로모달 추론 능력이 드러날 수 있다.
활성화 튜닝은 도전적인 태스크(SQQA, Story, SAC 등)에서 성능 향상을 상당히 증가시킨다.
활성화 튜닝 후에도 모델은 학습된 태스크에서 강한 성능을 유지하면서 새로운 emergent 능력을 얻는다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.