QUICK REVIEW

[論文レビュー] Listen, Think, and Understand

Yuan Gong, Hongyin Luo|arXiv (Cornell University)|May 18, 2023

Music and Audio Processing被引用数 18

ひとこと要約

LTUは、強力な音声知覚エンコーダとオープンソースのLLMを組み合わせて、音声理解と新たな音声推論の両方を可能にするマルチモーダル音声基盤モデルである。音声質問応答ペアの大規模データセットOpenAQA-5Mで訓練されている。

ABSTRACT

The ability of artificial intelligence (AI) systems to perceive and comprehend audio signals is crucial for many applications. Although significant progress has been made in this area since the development of AudioSet, most existing models are designed to map audio inputs to pre-defined, discrete sound label sets. In contrast, humans possess the ability to not only classify sounds into general categories, but also to listen to the finer details of the sounds, explain the reason for the predictions, think about what the sound infers, and understand the scene and what action needs to be taken, if any. Such capabilities beyond perception are not yet present in existing audio models. On the other hand, modern large language models (LLMs) exhibit emerging reasoning ability but they lack audio perception capabilities. Therefore, we ask the question: can we build a model that has both audio perception and a reasoning ability? In this paper, we propose a new audio foundation model, called LTU (Listen, Think, and Understand). To train LTU, we created a new OpenAQA-5M dataset consisting of 1.9 million closed-ended and 3.7 million open-ended, diverse (audio, question, answer) tuples, and have used an autoregressive training framework with a perception-to-understanding curriculum. LTU demonstrates strong performance and generalization ability on conventional audio tasks such as classification and captioning. More importantly, it exhibits emerging audio reasoning and comprehension abilities that are absent in existing audio models. To the best of our knowledge, LTU is one of the first multimodal large language models that focus on general audio (rather than just speech) understanding.

研究の動機と目的

音声の知覚を超えて音や場面について推論するAIシステムの構築を促す。
高性能な音声知覚モジュールを大規模言語モデルと統合し、統一的な音声理解と推論を実現する。
音声に基づく指示遵守モデルを訓練するための大規模で多様な音声QAデータセット(OpenAQA-5M)を作成する。
LTUが従来の音声タスク（分類、キャプション付け）と新たな音声推論の両方を実行できることを示す。
知覚から推論への学習を導く学習カリキュラム（知覚→理解）を提供する。

提案手法

音声エンコーダとしてAudio Spectrogram Transformer (AST)を用い、AudioSet-2Mでファインチューニングし、LLaMAの埋め込みに合わせて4096次元へ射影する。
LLaMAの自己注意層にLoRAアダプタを接続して、小さなパラメータ予算（4.2M）で効率的なファインチューニングを可能にする。
32個の音声埋め込みとテキスト埋め込みを組み合わせて、LLaMAベースの言語モデルへのLTU入力を形成する。
閉形式のタスクから始め、徐々にオープンエンドQAを含むように、知覚から理解へのカリキュラムを段階的に適用してLTUを訓練する。
8データセットから再ラベルされた音声とGPT支援のAudio Instruction Generation (AIG)を用いて多様な音声QAペアを生成し、OpenAQA-5Mを作成する。
過去のトークンと参照音声を条件とした次トークン予測で訓練を最適化する；生成設定は（temperature 0.1、TopK 500、TopP 0.95）を使用する。

Figure 1: The LTU architecture and real samples showing how it listens, thinks, and understands.

実験結果

リサーチクエスチョン

RQ1単一のモデルは、一般的な音声イベントを知覚しつつ、音声シーンについて推論を行うことができるのか？
RQ2音声エンコーダをLLMと統合することで、事実関係の正確さと指示遵守を伴うオープンエンドの音声質問応答が可能になるか？
RQ3音声重視のマルチモーダルLLMを効果的に訓練するには、どのデータとカリキュラムが必要か？
RQ4LTUは従来の音声タスクと出現的音声推論タスクのどちらでどのように性能を示すか？
RQ5適切なときに幻覚を避け、回答不能な質問を拒否する能力はLTUにあるか？

主な発見

LTUは8つの分類ベンチマークで従来の音声-テキストCLAPを上回り、平均相対改善率23.6%を示す。
LTUは、より多様なデータで訓練されているにもかかわらず、AudioCapsとClothoにおけるSPICE指標で最先端のキャプション生成モデルと競合する。
LTUは新たな音声推論と理解能力を示し、オープンエンドの質問応答を含み、指示遵守率約82.9%の事実的正確性を人間評価で示す。
LTUは推論時に事前定義済みのラベル集合を用いず直接テキストラベルを出力し、未知データセットやタスクに対して強い一般化を示す。
知覚から理解へのカリキュラムは重要であり、それを削除すると性能が著しく低下する。LoRAアダプタを用いた固定化されたLLaMAは、致命的な忘却なしに効果的な音声 groundingを実現する。

Figure 2: Screenshot of the human subjective evaluation questionnaire.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。