QUICK REVIEW

[논문 리뷰] LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT

Zhihao Du, Jiaming Wang|arXiv (Cornell University)|2023. 10. 07.

Speech Recognition and Synthesis인용 수 18

한 줄 요약

LauraGPT 는 오디오와 텍스트 입력 및 출력을 모두 처리하는 단일 디코더 기반 프레임워크에서 ASR, S2TT, TTS, SE, AAC, SER, SLU 등 다양한 기능을 가능하게 하는 통합 오디오-텍스트 GPT 모델이다.

ABSTRACT

Generative Pre-trained Transformer (GPT) models have achieved remarkable performance on various natural language processing tasks, and have shown great potential as backbones for audio-and-text large language models (LLMs). Previous mainstream audio-and-text LLMs use discrete audio tokens to represent both input and output audio; however, they suffer from performance degradation on tasks such as automatic speech recognition, speech-to-text translation, and speech enhancement over models using continuous speech features. In this paper, we propose LauraGPT, a novel unified audio-and-text GPT-based LLM for audio recognition, understanding, and generation. LauraGPT is a versatile LLM that can process both audio and text inputs and generate outputs in either modalities. We propose a novel data representation that combines continuous and discrete features for audio: LauraGPT encodes input audio into continuous representations using an audio encoder and generates output audio from discrete codec codes. We propose a one-step codec vocoder to overcome the prediction challenge caused by the multimodal distribution of codec tokens. We fine-tune LauraGPT using supervised multi-task learning. Extensive experiments show that LauraGPT consistently achieves comparable to superior performance compared to strong baselines on a wide range of audio tasks related to content, semantics, paralinguistics, and audio-signal analysis, such as automatic speech recognition, speech-to-text translation, text-to-speech synthesis, speech enhancement, automated audio captioning, speech emotion recognition, and spoken language understanding.

연구 동기 및 목표

오디오와 텍스트 두 모달리티를 모두 다루는 단일 모델을 GPT 프레임워크 내에서 구축하도록 동기를 부여한다.
오디오 충실도(연속 입력)를 보존하면서 자기회귀 생성(이산 출력)을 가능하게 하는 데이터 표현을 개발한다.
하나의 모델에서 광범위한 오디오 관련 태스크를 다루는 다중 태스크 파인튜닝을 가능하게 한다.
최첨단 기준선과 비교하여 다양한 오디오 벤치마크에서 경쟁력 있거나 우수한 성능을 입증한다.

제안 방법

오디오-텍스트 모델링용으로 보강된 디코더 전용 트랜스포머(Qwen)를 사용한다.
입력 오디오를 연속 특성을 생성하는 Conformer 기반 인코더로 표현하고; 출력 오디오는 코덱 기반 이산 토큰으로 표현한다.
집계된 코덱 토큰 임베딩에서 파형을 재구성하기 위한 한 단계 코덱 보코더를 도입한다.
대상 태스크를 나타내는 특수 태스크 토큰과 함께 통일된 교차 엔트로피 목적 함수를 사용하여 다수의 오디오/텍스트 태스크를 공동으로 학습한다.
다중태스크 파인튜닝 중 사전학습된 코덱 보코더를 고정하는 동시에 백본과 인코더를 학습시킨다.

실험 결과

연구 질문

RQ1연속 입력과 이산 출력으로 오디오와 텍스트 태스크를 단일 디코더-전용 모델이 함께 처리할 수 있는가?
RQ2입력에 대한 연속 오디오 표현과 출력에 대한 이산 코덱 토큰을 결합하는 것이 ASR, S2TT, TTS, SE, AAC, SER, 및 SLU 전반에 걸쳐 강한 성능을 낳는가?
RQ3인식, 이해 및 생성 태스크에서 연속 표현과 이산 표현을 사용하는 것이 성능에 미치는 영향은 무엇인가?
RQ4다양한 벤치마크에서 단일 오디오-텍스트 GPT의 다중 태스크 파인튜닝이 태스크별 기준선과 비교해 얼마나 효과적인가?

주요 결과

LauraGPT 는 여러 오디오 태스크와 벤치마크에서 강력한 기준선과 비교해도 경쟁력 있거나 우수한 성능을 달성한다.
연속 오디오 입력은 인식 및 신호 처리 태스크에 이점을 제공하는 반면 이산 출력은 단일 모델 내에서 강건한 오디오 생성을 가능하게 한다.
이 모델은 ASR, S2TT, MT, SE, AAC, SER, 및 SLU 를 포함한 광범위한 태스크를 하나의 통합 프레임워크 내에서 지원한다.
S2TT의 경우, LauraGPT는 기준선에 비해 상당한 BLEU 향상을 보여주고 일부 언어쌍에서 일부 연계 시스템에 근접하거나 더 나은 성능을 보인다.
SE에서 LauraGPT는 잡음이 있는 입력에 비해 PESQ와 STOI를 향상시키고 일부 지표에서 최첨단 CMGAN에 근접한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.