QUICK REVIEW

[Paper Review] EARTalking: End-to-end GPT-style Autoregressive Talking Head Synthesis with Frame-wise Control

Yuzhe Weng, Haotian Wang|arXiv (Cornell University)|Mar 19, 2026

Generative Adversarial Networks and Image Synthesis0 citations

TL;DR

EARTalking introduces an end-to-end GPT-style autoregressive framework for frame-by-frame audio-driven talking head generation with frame-wise in-context control and a Sink Frame Window Attention mechanism for variable-length inference and identity-consistent output.

ABSTRACT

Audio-driven talking head generation aims to create vivid and realistic videos from a static portrait and speech. Existing AR-based methods rely on intermediate facial representations, which limit their expressiveness and realism. Meanwhile, diffusion-based methods generate clip-by-clip, lacking fine-grained control and causing inherent latency due to overall denoising across the window. To address these limitations, we propose EARTalking, a novel end-to-end, GPT-style autoregressive model for interactive audio-driven talking head generation. Our method introduces a novel frame-by-frame, in-context, audio-driven streaming generation paradigm. For inherently supporting variable-length video generation with identity consistency, we propose the Sink Frame Window Attention (SFA) mechanism. Furthermore, to avoid the complex, separate networks that prior works required for diverse control signals, we propose a streaming Frame Condition In-Context (FCIC) scheme. This scheme efficiently injects diverse control signals in a streaming, in-context manner, enabling interactive control at every frame and at arbitrary moments. Experiments demonstrate that EARTalking outperforms existing autoregressive methods and achieves performance comparable to diffusion-based methods. Our work demonstrates the feasibility of in-context streaming autoregressive control, unlocking a scalable direction for flexible, efficient generation. The code will be released for reproducibility.

Motivation & Objective

Motivate the need for end-to-end autoregressive talking head generation beyond intermediate representations.
Develop a frame-by-frame streaming model that supports variable-length inference while maintaining identity fidelity.
Introduce mechanisms for in-context, frame-wise control and stable attention anchoring to the reference image.
Show that the proposed method outperforms autoregressive baselines and matches diffusion-based methods in key metrics.

Proposed method

Propose EARTalking, a GPT-style end-to-end autoregressive framework for frame-by-frame talking head generation.
Introduce Sink Frame Window Attention (SFA) with an adaLN sink to anchor generated frames to a reference image and support variable-length inference.
Use a Frame-wise Causal Autoregression (FCA) with a 3D VAE-based frame encoder and a masked autoregressive decoder for per-frame generation.
Adopt Frame Condition In-Context (FCIC) control to inject multi-modal conditions (e.g., audio, motion) per frame via in-context learning.
Employ bidirectional audio-visual attention and a kv-cache mechanism to enable streaming generation with reference-frame consistency.
Train with fixed-length sequences but enable variable-length inference through the SFA framework and cyclic position embeddings.

Experimental results

Research questions

RQ1Can a fully end-to-end autoregressive model achieve high lip-sync quality and natural facial dynamics without relying on intermediate representations?
RQ2Does frame-wise in-context control enable fine-grained, per-frame manipulation of audio-driven talking head generation?
RQ3Can Sink Frame Window Attention stabilize autoregressive generation and support variable-length outputs while maintaining identity consistency with a reference image?
RQ4How does FCIC-style control interact with audio-visual signals to improve controllability and extensibility of talking head generation?
RQ5How does EARTalking compare to diffusion-based and traditional autoregressive methods across standard THG metrics?

Key findings

Dataset	Method	Params	Avg-R (↓)	FID (↓)	FVD (↓)	Sync-C (↑)	Sync-D (↓)	E-FID (↓)
HDTF	AniPortrait	2B	3.50	17.629	443.902	3.368	10.712	2.210
HDTF	EchoMimic	2B	2.66	18.733	629.370	5.610	8.953	0.803
HDTF	EchoMimicV3	1.3B	3.83	21.054	380.812	2.824	11.794	1.987
HDTF	AniTalker	0.1B	2.83	34.644	476.710	5.639	8.588	1.551
HDTF	Ditto	0.1B	3.00	16.253	384.232	4.036	10.250	2.861
HDTF	Ours	0.6B	1.66	18.981	363.909	5.707	8.903	1.326
MEAD	AniPortrait	2B	4.16	63.460	519.822	1.324	12.650	1.886
MEAD	EchoMimic	2B	3.00	51.546	775.951	5.276	9.565	1.562
MEAD	EchoMimicV3	1.3B	3.50	46.797	347.066	2.397	12.455	2.425
MEAD	AniTalker	0.1B	2.66	95.210	627.765	6.145	8.610	1.538
MEAD	Ditto	0.1B	2.66	28.503	329.314	4.412	9.767	2.004
MEAD	Ours	0.6B	1.50	55.682	316.275	5.866	8.459	0.872

Outperforms existing autoregressive methods and achieves performance comparable to diffusion-based methods on standard THG metrics.
Demonstrates strong lip-sync (Sync-C) and expressive fidelity (E-FID) while maintaining identity consistency (low FVD).
SFA with adaLN sink anchors frames to the reference image, reducing autoregressive error accumulation and enabling variable-length inference.
FCIC enables per-frame, multi-modal control without extra specialized networks, improving flexibility and extensibility.
Bidirectional audio-visual attention and an audio/visual kv-cache improve synchronization and temporal stability.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.