[논문 리뷰] AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation
AudioToken은 오디오 신호를 텍스트와 유사한 토큰으로 매핑하여 사전 학습된 텍스트-투-이미지 확산 모델을 조건화하고, 경쟁력 있는 목표와 주관적 성능으로 오디오 조건의 이미지 생성을 가능하게 한다.
In recent years, image generation has shown a great leap in performance, where diffusion models play a central role. Although generating high-quality images, such models are mainly conditioned on textual descriptions. This begs the question: "how can we adopt such models to be conditioned on other modalities?". In this paper, we propose a novel method utilizing latent diffusion models trained for text-to-image-generation to generate images conditioned on audio recordings. Using a pre-trained audio encoding model, the proposed method encodes audio into a new token, which can be considered as an adaptation layer between the audio and text representations. Such a modeling paradigm requires a small number of trainable parameters, making the proposed approach appealing for lightweight optimization. Results suggest the proposed method is superior to the evaluated baseline methods, considering objective and subjective metrics. Code and samples are available at: https://pages.cs.huji.ac.il/adiyoss-lab/AudioToken.
연구 동기 및 목표
- Motivate and enable audio-conditioned image generation using existing text-to-image diffusion models.
- Create a lightweight adaptation layer that maps audio representations into a textual embedding space.
- Develop an audio token and training objective that leverage pre-trained audio encoders and diffusion models.
제안 방법
- Use a pre-trained text-to-image diffusion model as the base generator.
- Introduce an Embedder that converts audio into an e_audio token in the textual space.
- Train only the Embedder (projection and pooling layers) while keeping the audio encoder and generator frozen.
- Adopt the latent diffusion model loss L_LDM and an optional classification loss L_CL to align audio tokens with video labels.
- Apply attentive pooling to compress temporal audio embeddings.
- Evaluate with AIS, IIS, AIC, FID, and human judgments, using VGGSound data.
실험 결과
연구 질문
- RQ1Can audio signals be effectively encoded into a textual-like token to condition a text-to-image diffusion model?
- RQ2Does the AudioToken approach produce high-quality, diverse images aligned with audio scenes compared to baselines?
- RQ3What evaluation framework best captures audio-to-image generation quality and semantic alignment?
주요 결과
| 방법 | AIC | FID | AIS | IIS |
|---|---|---|---|---|
| Reference | 54.66 | - | - | - |
| SD (Text) | 71.28 | 52.85 | - | - |
| Wav2Clip [30] | 29.32 | 99.89 | 47.76 | 51.11 |
| ImageBind [37] | 39.15 | 67.42 | 67.48 | 75.50 |
| AudioToken with CL | 48.01 | 66.08 | 62.28 | 76.40 |
| AudioToken | 45.48 | 56.65 | 68.23 | 76.66 |
- AudioToken achieves higher AIS and IIS than Wav2Clip and ImageBind on the evaluated metrics.
- AudioToken outperforms baselines on AIC and FID, showing better image quality and semantic alignment.
- Adding Classification Loss (CL) improves certain metrics (e.g., AIS, IIS) but may trade off others.
- Subjective evaluation shows AudioToken scoring 4.07±0.83, outperforming Wav2Clip (1.85±0.46) and approaching SD with text labels (4.58±0.60).
- On qualitative speaker visuals, the method captures distinctive voices (e.g., Barack Obama, Donald Trump) and gender cues for others.
- The approach uses a lightweight trainable Embedder and leverages a frozen pre-trained audio encoder and diffusion backbone.
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.