QUICK REVIEW

[논문 리뷰] DeepASMR: LLM-Based Zero-Shot ASMR Speech Generation for Anyone of Any Voice

Leying Zhang, Tingxiao Zhou|arXiv (Cornell University)|2026. 01. 22.

Speech Recognition and Synthesis인용 수 0

한 줄 요약

DeepASMR은 LLM 기반 콘텐츠-스타일 인코더와 흐름 매칭 음향 디코더를 이용한 제로샷 ASMR 음성 생성 프레임워크를 제시하여 최소한의 일반 음성 데이터로 어떠한 화자라도 ASMR 음성을 합성하고, 대규모 이중언어 ASMR 코퍼스 DeepASMR-DB를 제공합니다.

ABSTRACT

While modern Text-to-Speech (TTS) systems achieve high fidelity for read-style speech, they struggle to generate Autonomous Sensory Meridian Response (ASMR), a specialized, low-intensity speech style essential for relaxation. The inherent challenges include ASMR's subtle, often unvoiced characteristics and the demand for zero-shot speaker adaptation. In this paper, we introduce DeepASMR, the first framework designed for zero-shot ASMR generation. We demonstrate that a single short snippet of a speaker's ordinary, read-style speech is sufficient to synthesize high-fidelity ASMR in their voice, eliminating the need for whispered training data from the target speaker. Methodologically, we first identify that discrete speech tokens provide a soft factorization of ASMR style from speaker timbre. Leveraging this insight, we propose a two-stage pipeline incorporating a Large Language Model (LLM) for content-style encoding and a flow-matching acoustic decoder for timbre reconstruction. Furthermore, we contribute DeepASMR-DB, a comprehensive 670-hour English-Chinese multi-speaker ASMR speech corpus, and introduce a novel evaluation protocol integrating objective metrics, human listening tests, LLM-based scoring and unvoiced speech analysis. Extensive experiments confirm that DeepASMR achieves state-of-the-art naturalness and style fidelity in ASMR generation for anyone of any voice, while maintaining competitive performance on normal speech synthesis.

연구 동기 및 목표

TTS 시스템에서 중립적/읽는 음성 너머의 고감정 ASMR 음성 생성을 위한 격차를 해결한다.
오직 일반 음성 샘플만으로도 어떤 화자에 대해서든 제로샷 ASMR 합성을 달성한다.
ASMR 스타일을 화자 음색과 구분하기 위한 토큰 수준 요인 분해를 탐구한다.
대규모 ASMR 코퍼스(DeepASMR-DB)와 객관적, 주관적, LLM 기반 지표를 결합한 강력한 평가 프로토콜을 제공한다.

제안 방법

대규모 언어 모델(LLM) 기반 텍스트-의미 인코더와 흐름 매칭 음향 디코더를 활용한 2단계 파이프라인을 채택한다.
S3로 불리는 이산 음성 토큰을 ASMR 스타일과 음색의 소프트 분해로 사용하여 화자身份가 누출되지 않도록 스타일 조작을 가능하게 한다.
텍스트와 프롬프트에서 순서를 예측하도록 LLM을 학습하고 교차 엔트로피 손실로 최적화한다.
토큰 시퀀스와 목표 화자 음색을 조건으로 하는 조건부 흐름 매칭 네트워크를 통해 멜 스펙트로그램으로 토큰을 디코딩하고, 그 후 HiFi-GAN 보코딩을 적용한다.
크로스타일 합성을 위한 스타일 프롬트를 선택하고 음색 누출을 줄이기 위해 가상 화자 풀을 사용하는 작업 프롬프트 선택기(Task Prompt Selector)를 구현한다.
생성된 ASMR 프롬트를 시스템으로 다시 피드백해 2–3회 추가 패스로 출력물을 선택적으로 정교화한다.

실험 결과

연구 질문

RQ1토큰 공간에서 ASMR 스타일을 화자 음색과 분리하여 보지 않은 화자에 대한 제로샷 ASMR을 가능하게 할 수 있는가?
RQ2두 단계의 LLM+플로우 아키텍처가 ASMR 스타일을 제어하면서 화자 정체성을 보존하는 데 효과적인가?
RQ3제로샷 Normal-to-ASMR 합성이 intra-style 또는 cascade 기반 벤치마크에 비해 얼마나 잘 작동하는가?
RQ4생성 음성의 ASMR 품질과 비모음 발성을 포착하는 데 어떤 데이터셋과 평가 프로토콜이 가장 잘 작동하는가?

주요 결과

DeepASMR은 보이지 않는 화자의 언어에 대해 제로샷 합성에서 자연스러움과 ASMR 스타일 적합성에서 최첨단 성능을 달성한다.
토큰 수준 분석에 따르면 ASMR 스타일은 주로 의미 토큰에 인코딩되는 반면, 잔여 음색은 흐름 기반 디코더에서 복원 가능하다.
가상 화자 풀과 유사성 기반 작업 프롬프트 검색은 음색 누출을 완화하고 크로스타일 합성 품질을 향상시킨다.
비모음 음성 생성(N2A)은 우수한 가청성(WER/CER)과 음색 보존(SIM)을 유지하며 강건하게 달성될 수 있다.
객관적 지표, 주관적 MOS, LLM 기반 스타일 점수, 비모음 음성 분석을 결합한 광범위한 평가 프로토콜이 ASMR 품질에 대한 포괄적 평가를 뒷받침한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.