[Paper Review] EARTalking: End-to-end GPT-style Autoregressive Talking Head Synthesis with Frame-wise Control
EARTalking introduces an end-to-end GPT-style autoregressive framework for frame-by-frame audio-driven talking head generation with frame-wise in-context control and a Sink Frame Window Attention mechanism for variable-length inference and identity-consistent output.
Audio-driven talking head generation aims to create vivid and realistic videos from a static portrait and speech. Existing AR-based methods rely on intermediate facial representations, which limit their expressiveness and realism. Meanwhile, diffusion-based methods generate clip-by-clip, lacking fine-grained control and causing inherent latency due to overall denoising across the window. To address these limitations, we propose EARTalking, a novel end-to-end, GPT-style autoregressive model for interactive audio-driven talking head generation. Our method introduces a novel frame-by-frame, in-context, audio-driven streaming generation paradigm. For inherently supporting variable-length video generation with identity consistency, we propose the Sink Frame Window Attention (SFA) mechanism. Furthermore, to avoid the complex, separate networks that prior works required for diverse control signals, we propose a streaming Frame Condition In-Context (FCIC) scheme. This scheme efficiently injects diverse control signals in a streaming, in-context manner, enabling interactive control at every frame and at arbitrary moments. Experiments demonstrate that EARTalking outperforms existing autoregressive methods and achieves performance comparable to diffusion-based methods. Our work demonstrates the feasibility of in-context streaming autoregressive control, unlocking a scalable direction for flexible, efficient generation. The code will be released for reproducibility.
Motivation & Objective
- Motivate the need for end-to-end autoregressive talking head generation beyond intermediate representations.
- Develop a frame-by-frame streaming model that supports variable-length inference while maintaining identity fidelity.
- Introduce mechanisms for in-context, frame-wise control and stable attention anchoring to the reference image.
- Show that the proposed method outperforms autoregressive baselines and matches diffusion-based methods in key metrics.
Proposed method
- Propose EARTalking, a GPT-style end-to-end autoregressive framework for frame-by-frame talking head generation.
- Introduce Sink Frame Window Attention (SFA) with an adaLN sink to anchor generated frames to a reference image and support variable-length inference.
- Use a Frame-wise Causal Autoregression (FCA) with a 3D VAE-based frame encoder and a masked autoregressive decoder for per-frame generation.
- Adopt Frame Condition In-Context (FCIC) control to inject multi-modal conditions (e.g., audio, motion) per frame via in-context learning.
- Employ bidirectional audio-visual attention and a kv-cache mechanism to enable streaming generation with reference-frame consistency.
- Train with fixed-length sequences but enable variable-length inference through the SFA framework and cyclic position embeddings.
Experimental results
Research questions
- RQ1Can a fully end-to-end autoregressive model achieve high lip-sync quality and natural facial dynamics without relying on intermediate representations?
- RQ2Does frame-wise in-context control enable fine-grained, per-frame manipulation of audio-driven talking head generation?
- RQ3Can Sink Frame Window Attention stabilize autoregressive generation and support variable-length outputs while maintaining identity consistency with a reference image?
- RQ4How does FCIC-style control interact with audio-visual signals to improve controllability and extensibility of talking head generation?
- RQ5How does EARTalking compare to diffusion-based and traditional autoregressive methods across standard THG metrics?
Key findings
| Dataset | Method | Params | Avg-R (↓) | FID (↓) | FVD (↓) | Sync-C (↑) | Sync-D (↓) | E-FID (↓) |
|---|---|---|---|---|---|---|---|---|
| HDTF | AniPortrait | 2B | 3.50 | 17.629 | 443.902 | 3.368 | 10.712 | 2.210 |
| HDTF | EchoMimic | 2B | 2.66 | 18.733 | 629.370 | 5.610 | 8.953 | 0.803 |
| HDTF | EchoMimicV3 | 1.3B | 3.83 | 21.054 | 380.812 | 2.824 | 11.794 | 1.987 |
| HDTF | AniTalker | 0.1B | 2.83 | 34.644 | 476.710 | 5.639 | 8.588 | 1.551 |
| HDTF | Ditto | 0.1B | 3.00 | 16.253 | 384.232 | 4.036 | 10.250 | 2.861 |
| HDTF | Ours | 0.6B | 1.66 | 18.981 | 363.909 | 5.707 | 8.903 | 1.326 |
| MEAD | AniPortrait | 2B | 4.16 | 63.460 | 519.822 | 1.324 | 12.650 | 1.886 |
| MEAD | EchoMimic | 2B | 3.00 | 51.546 | 775.951 | 5.276 | 9.565 | 1.562 |
| MEAD | EchoMimicV3 | 1.3B | 3.50 | 46.797 | 347.066 | 2.397 | 12.455 | 2.425 |
| MEAD | AniTalker | 0.1B | 2.66 | 95.210 | 627.765 | 6.145 | 8.610 | 1.538 |
| MEAD | Ditto | 0.1B | 2.66 | 28.503 | 329.314 | 4.412 | 9.767 | 2.004 |
| MEAD | Ours | 0.6B | 1.50 | 55.682 | 316.275 | 5.866 | 8.459 | 0.872 |
- Outperforms existing autoregressive methods and achieves performance comparable to diffusion-based methods on standard THG metrics.
- Demonstrates strong lip-sync (Sync-C) and expressive fidelity (E-FID) while maintaining identity consistency (low FVD).
- SFA with adaLN sink anchors frames to the reference image, reducing autoregressive error accumulation and enabling variable-length inference.
- FCIC enables per-frame, multi-modal control without extra specialized networks, improving flexibility and extensibility.
- Bidirectional audio-visual attention and an audio/visual kv-cache improve synchronization and temporal stability.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.