QUICK REVIEW

[論文レビュー] SALMONN: Towards Generic Hearing Abilities for Large Language Models

Changli Tang, Wenyi Yu|arXiv (Cornell University)|Oct 20, 2023

Music and Audio Processing被引用数 18

ひとこと要約

SALMONNは、一般の音声入力を知覚し推論できるスピーチ・オーディオ・ミュージックのオープンニューラルネットワークで、双重聴覚エンコーダとLLMを統合して訓練済みおよび新たな跨モーダル能力を可能にします。

ABSTRACT

Hearing is arguably an essential ability of artificial intelligence (AI) agents in the physical world, which refers to the perception and understanding of general auditory information consisting of at least three types of sounds: speech, audio events, and music. In this paper, we propose SALMONN, a speech audio language music open neural network, built by integrating a pre-trained text-based large language model (LLM) with speech and audio encoders into a single multimodal model. SALMONN enables the LLM to directly process and understand general audio inputs and achieve competitive performances on a number of speech and audio tasks used in training, such as automatic speech recognition and translation, auditory-information-based question answering, emotion recognition, speaker verification, and music and audio captioning etc. SALMONN also has a diverse set of emergent abilities unseen in the training, which includes but is not limited to speech translation to untrained languages, speech-based slot filling, spoken-query-based question answering, audio-based storytelling, and speech audio co-reasoning etc. The presence of cross-modal emergent abilities is studied, and a novel few-shot activation tuning approach is proposed to activate such abilities. To our knowledge, SALMONN is the first model of its type and can be regarded as a step towards AI with generic hearing abilities. The source code, model checkpoints and data are available at https://github.com/bytedance/SALMONN.

研究の動機と目的

一般的な聴覚情報を知覚・理解できるAIの必要性を動機づける（スピーチだけでなく、音声イベント、音楽を含む）。
SALMONNという単一のマルチモーダルLLMを提案し、スピーチとオーディオエンコーダをLLMと融合して多様な音声タスクを扱えるようにする。
emergent cross-modal abilities を調査し、few-shotアクティベーションチューニング段階でそれらをどのように活性化するか。

提案手法

Whisper（スピーチ）とBEATs（非スピーチオーディオ）エンコーダを組み合わせて単一モデルに統合する、デュアルオーディトリーエンコーダー構成を使用する。
窓分位のQ-Formerを接続モジュールとして用い、LLM入力空間と整合した拡張オーディオトークンを生成する。
LoRAアダプタを用いて拡張入力空間をLLM出力空間と整合させつつ、LLMとエンコーダを凍結したままファインチューニングする。
音声認識と音声キャプションデータで事前学習を行い、音声とテキストの間でクロスモーダル整合を確立する。
スピーチ、オーディオ、ミュージックのタスク群を用いた指示調整を実施して、タスク固有の挙動を形作る。
トレーニングされたタスクを過剰適合させず、クロスモーダル出現能力を覚醒させるようLoRAスケーリングを低下させるアクティベーションチューニング段階を導入する。

実験結果

リサーチクエスチョン

RQ1単一のモデルは、スピーチ、音声イベント、音楽を含む一般的なオーディオ入力を知覚し理解できるか？
RQ2このようなモデルにクロスモーダルの出現能力は存在するか、軽量なトレーニング手法でそれらを活性化できるか？
RQ3activation tuningは訓練済みタスクと訓練されていないクロスモーダルタスクのパフォーマンスにどのように影響するか？
RQ4エンドツーエンド推論のために、オーディオエンコーディングをLLMと整合させるのに必要なデータ、プロンプト、アーキテクチャ設計は何か？

主な発見

SALMONNはASR、翻訳、音声キャプションなどの訓練済みタスクで競争力のある結果を達成する。
アクティベーションチューニングにより、音声ベースのストーリーテリングやspeech- audio共推論などの出現能力が有効になり、レベル2およびレベル3タスクのパフォーマンスが向上する。
テスト時にLoRAのスケーリングファクターを割引くことで、few-shotでのクロスモーダル推論能力を明らかにできる。
Activation tuningは難関タスク（例：SQQA、Story、SAC）での追随率を大幅に向上させる。
訓練済みタスクでの強い性能を維持しつつ、Activation tuning後に新しい出現的能力を獲得する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。