QUICK REVIEW

[논문 리뷰] AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

Rongjie Huang, Mingze Li|arXiv (Cornell University)|2023. 04. 25.

Music and Audio Processing인용 수 21

한 줄 요약

AudioGPT는 다중 라운드 대화에서 ChatGPT를 오디오 기초 모델 및 모달리티 트랜스포머와 연결하여 오디오를 이해하고 생성하며, 일관성, 능력, 견고함을 평가합니다.

ABSTRACT

Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. Despite the recent success, current LLMs are not capable of processing complex audio information or conducting spoken conversations (like Siri or Alexa). In this work, we propose a multi-modal AI system named AudioGPT, which complements LLMs (i.e., ChatGPT) with 1) foundation models to process complex audio information and solve numerous understanding and generation tasks; and 2) the input/output interface (ASR, TTS) to support spoken dialogue. With an increasing demand to evaluate multi-modal LLMs of human intention understanding and cooperation with foundation models, we outline the principles and processes and test AudioGPT in terms of consistency, capability, and robustness. Experimental results demonstrate the capabilities of AudioGPT in solving AI tasks with speech, music, sound, and talking head understanding and generation in multi-round dialogues, which empower humans to create rich and diverse audio content with unprecedented ease. Our system is publicly available at \url{https://github.com/AIGC-Audio/AudioGPT}.

연구 동기 및 목표

Motivate the need for audio-enabled LLMs and address data and compute constraints for training from scratch.
Propose a system that uses ChatGPT as a general-purpose interface paired with audio foundation models and a modality transformer.
Outline evaluation criteria for multi-modal LLMs focusing on consistency, capability, and robustness.
Demonstrate AudioGPT’s ability to handle tasks involving speech, music, sound, and talking head in multi-round dialogues.

제안 방법

Define AudioGPT as a prompt-based system comprising a modality transformer, an LLM, a prompt manager, a task handler, and a set of audio foundation models.
Transform diverse input modalities into a consistent textual query via the modality transformer.
Use a task analysis pipeline where the LLM and prompt manager extract structured task arguments for the selected audio model.
Assign and execute the selected audio foundation model with appropriate inputs to obtain task outputs.
Generate final responses by fusing model outputs with contextual information from the query and dialogue history.

실험 결과

연구 질문

RQ1How effectively can an LLM coordinate multiple audio foundation models to understand and generate diverse audio modalities in a spoken dialogue?
RQ2Can AudioGPT maintain context and handle multi-round interactions across speech, music, sound, and talking head tasks?
RQ3What are the design principles for evaluating consistency, capability, and robustness of multi-modal LLMs like AudioGPT?

주요 결과

AudioGPT enables multi-round dialogue capabilities over speech, music, sound, and talking head tasks by connecting ChatGPT with audio foundation models.
Evaluation shows AudioGPT’s approach to consistency, capability, and robustness, including human-annotated prompt tests and crowd-sourced assessments.
Audio foundation models are demonstrated across a broad set of tasks with metrics such as WER, BLEU, MOS, PESQ, STOI, FID, and others (per task).
The system supports diverse outputs (waveforms, text, audio, video) depending on the task, including talking head synthesis and text-to-speech.
Experiments indicate AudioGPT can maintain conversation context and handle follow-ups in a 12-round dialogue case.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.