QUICK REVIEW

[论文解读] FAIRT2V: Training-Free Debiasing for Text-to-Video Diffusion Models

Haonan Zhong, Wei Song|arXiv (Cornell University)|Jan 28, 2026

Generative Adversarial Networks and Image Synthesis被引用 0

一句话总结

FairT2V 引入一种训练-free 框架，通过锚点式球面大地测地线变换中和提示嵌入中的编码器性别偏见，并采用动态去噪计划以保持时序连贯性来去偏文本到视频扩散输出。

ABSTRACT

Text-to-video (T2V) diffusion models have achieved rapid progress, yet their demographic biases, particularly gender bias, remain largely unexplored. We present FairT2V, a training-free debiasing framework for text-to-video generation that mitigates encoder-induced bias without finetuning. We first analyze demographic bias in T2V models and show that it primarily originates from pretrained text encoders, which encode implicit gender associations even for neutral prompts. We quantify this effect with a gender-leaning score that correlates with bias in generated videos. Based on this insight, FairT2V mitigates demographic bias by neutralizing prompt embeddings via anchor-based spherical geodesic transformations while preserving semantics. To maintain temporal coherence, we apply debiasing only during early identity-forming steps through a dynamic denoising schedule. We further propose a video-level fairness evaluation protocol combining VideoLLM-based reasoning with human verification. Experiments on the modern T2V model Open-Sora show that FairT2V substantially reduces demographic bias across occupations with minimal impact on video quality.

研究动机与目标

识别文本到视频扩散模型中的人口统计偏见来源，聚焦于提示中的性别偏见。
开发一种训练-free 的去偏方法，保持提示语义和视频生成的时序连贯性。
使用以视频为中心的公平性评估协议来量化偏见削减并评估对视频质量的影响。

提出的方法

分析文本条件路径中的性别偏见并为中性提示定义性别倾向分数。
引入基于锚点的球面大地测地线变换，在单位超球面上获得中性去偏的提示嵌入。
基于与多数/少数锚点的角度接近度计算自适应去偏强度 lambda*，并沿人口统计轴应用去偏。
使用动态去噪计划仅在早期身份形成步骤应用去偏嵌入，以保持时序连贯性。
采用基于 VideoLLM 的公平性评估协议并辅以人工验证以评估视频层面的公平性。
对 conditioning 使用基于 CLIP 的文本编码器，并研究在不同编码器（CLIP 与 T5）下的鲁棒性。

Figure 1 : Bias source analysis in text-to-video generation. Neutral prompts are encoded by the text encoder (e.g., CLIP) into embeddings aligned with gender-associated directions, revealing implicit demographic bias in the text-conditioning space.

实验结果

研究问题

RQ1人口统计偏见在文本到视频扩散模型中源自何处？
RQ2是否训练-free 的嵌入层去偏足以在不损害视频质量的前提下降低性别偏见？
RQ3动态调度如何影响偏见缓解与 T2V 输出的时序连贯性？
RQ4对于 T2V 系统，什么样的视频层面公平性评估协议最有效？
RQ5哪些文本编码器在不牺牲语义保真度的前提下支持鲁棒的去偏？

主要发现

T2V 的人口统计偏见主要来自预训练文本编码器，即使对于中性提示也包含隐性的性别关联。
FairT2V 通过沿职业相关性别轴在锚点式球面大地测地线变换上引导提示嵌入到中性点，降低编码器诱发的偏见。
动态去噪计划将去偏限定在早期身份形成步骤，保持时序连贯性并减少帧级伪影。
与训练-free 基线相比，FairT2V 在减少偏见与维持视频质量之间取得更好的平衡，尤其在时序连贯性指标上表现更佳。
在此设定下，基于 CLIP 的嵌入在去偏效果与视频质量之间提供更稳定的折中，比如 T5 等替代方案。
结合 VideoLLM 与人工验证的视频层面公平性评估在帧级方法之外提供了可靠的偏见评估。

Figure 2 : Gender-leaning scores ( Equation 5 ) from the CLIP text encoder for 16 occupations, using the prompt sets in Equation 3 .

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。