Skip to main content
QUICK REVIEW

[论文解读] EARTalking: End-to-end GPT-style Autoregressive Talking Head Synthesis with Frame-wise Control

Yuzhe Weng, Haotian Wang|arXiv (Cornell University)|Mar 19, 2026
Generative Adversarial Networks and Image Synthesis被引用 0
一句话总结

EARTalking 引入一种端到端 GPT 风格的自回归框架,用于逐帧音频驱动的说话人头部生成,具备逐帧上下文控制和 Sink Frame Window Attention 以实现可变长度推断与身份一致输出。

ABSTRACT

Audio-driven talking head generation aims to create vivid and realistic videos from a static portrait and speech. Existing AR-based methods rely on intermediate facial representations, which limit their expressiveness and realism. Meanwhile, diffusion-based methods generate clip-by-clip, lacking fine-grained control and causing inherent latency due to overall denoising across the window. To address these limitations, we propose EARTalking, a novel end-to-end, GPT-style autoregressive model for interactive audio-driven talking head generation. Our method introduces a novel frame-by-frame, in-context, audio-driven streaming generation paradigm. For inherently supporting variable-length video generation with identity consistency, we propose the Sink Frame Window Attention (SFA) mechanism. Furthermore, to avoid the complex, separate networks that prior works required for diverse control signals, we propose a streaming Frame Condition In-Context (FCIC) scheme. This scheme efficiently injects diverse control signals in a streaming, in-context manner, enabling interactive control at every frame and at arbitrary moments. Experiments demonstrate that EARTalking outperforms existing autoregressive methods and achieves performance comparable to diffusion-based methods. Our work demonstrates the feasibility of in-context streaming autoregressive control, unlocking a scalable direction for flexible, efficient generation. The code will be released for reproducibility.

研究动机与目标

  • 证明需要超越中间表示的端到端自回归说话头生成的必要性。
  • 开发一个逐帧流式模型,在维持身份保真度的同时支持可变长度推断。
  • 引入逐帧上下文控制与稳定的参考图像锚定机制以实现对齐与可控性。
  • 证明所提方法在关键指标上优于自回归基线,并在扩散方法方面具备可比性。

提出的方法

  • 提出 EARTalking,一种用于逐帧说话头生成的 GPT 风格端到端自回归框架。
  • 引入 Sink Frame Window Attention(SFA)及 adaLN 池锚,将生成帧锚定到参考图像并支持可变长度推断。
  • 使用带有 3D VAE 基帧编码器和掩码自回归解码器的逐帧因果自回归(FCA)进行逐帧生成。
  • 采用 Frame Condition In-Context(FCIC)控制,通过上下文学习在每帧注入多模态条件(如音频、动作)。
  • 使用双向音视频注意力和 kv-cache 机制实现参考帧一致性下的流式生成。
  • 在固定长度序列上训练,但通过 SFA 框架和循环位置嵌入实现可变长度推断。

实验结果

研究问题

  • RQ1全端端到端自回归模型是否能够在不依赖中间表示的情况下达到较高的唇音同步质量和自然的面部动态?
  • RQ2逐帧上下文控制是否能够实现对音频驱动的说话头生成的细粒度逐帧操控?
  • RQ3Sink Frame Window Attention 能否稳定自回归生成并在维持与参考图像的身份一致性同时支持可变长度输出?
  • RQ4FCIC 风格控制如何与音视频信号交互以提升说话头生成的可控性和扩展性?
  • RQ5与扩散方法和传统自回归方法相比,EARTalking 在标准 THG 指标上表现如何?

主要发现

DatasetMethodParamsAvg-R (↓)FID (↓)FVD (↓)Sync-C (↑)Sync-D (↓)E-FID (↓)
HDTFAniPortrait2B3.5017.629443.9023.36810.7122.210
HDTFEchoMimic2B2.6618.733629.3705.6108.9530.803
HDTFEchoMimicV31.3B3.8321.054380.8122.82411.7941.987
HDTFAniTalker0.1B2.8334.644476.7105.6398.5881.551
HDTFDitto0.1B3.0016.253384.2324.03610.2502.861
HDTFOurs0.6B1.6618.981363.9095.7078.9031.326
MEADAniPortrait2B4.1663.460519.8221.32412.6501.886
MEADEchoMimic2B3.0051.546775.9515.2769.5651.562
MEADEchoMimicV31.3B3.5046.797347.0662.39712.4552.425
MEADAniTalker0.1B2.6695.210627.7656.1458.6101.538
MEADDitto0.1B2.6628.503329.3144.4129.7672.004
MEADOurs0.6B1.5055.682316.2755.8668.4590.872
  • 优于现有自回归方法,在标准 THG 指标上达到与扩散方法相当的性能。
  • 在保持身份一致性的同时展现出强唇音同步(Sync-C)和表达忠实度(E-FID),并具备较低的 FVD。
  • SFA 通过 adaLN 池锚将帧锚定到参考图像,减少自回归误差累积并实现可变长度推断。
  • FCIC 实现逐帧、多模态控制,而无需额外的专用网络,从而提升灵活性和扩展性。
  • 双向音视频注意力与音频/视觉 kv-cache 提升同步性与时间稳定性。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。