QUICK REVIEW

[論文レビュー] DyaDiT: A Multi-Modal Diffusion Transformer for Socially Favorable Dyadic Gesture Generation

Yichen Peng, Jyun-Ting Song|arXiv (Cornell University)|Feb 26, 2026

Human Motion and Animation被引用数 0

ひとこと要約

DyaDiT は、二人の発話音声から社会的文脈を組み込んだ対話型のジェスチャーを生成する拡散Transformerであり、社会的文脈、ORCA音声フュージョン、および任意のパートナー運動事前知識を組み込む。現実性、多様性、ユーザーの好みにおいてベースラインを上回る。

ABSTRACT

Generating realistic conversational gestures are essential for achieving natural, socially engaging interactions with digital humans. However, existing methods typically map a single audio stream to a single speaker's motion, without considering social context or modeling the mutual dynamics between two people engaging in conversation. We present DyaDiT, a multi-modal diffusion transformer that generates contextually appropriate human motion from dyadic audio signals. Trained on Seamless Interaction Dataset, DyaDiT takes dyadic audio with optional social-context tokens to produce context-appropriate motion. It fuses information from both speakers to capture interaction dynamics, uses a motion dictionary to encode motion priors, and can optionally utilize the conversational partner's gestures to produce more responsive motion. We evaluate DyaDiT on standard motion generation metrics and conduct quantitative user studies, demonstrating that it not only surpasses existing methods on objective metrics but is also strongly preferred by users, highlighting its robustness and socially favorable motion generation. Code and models will be released upon acceptance.

研究の動機と目的

単一話者モデルを超えた、二者対話の現実的で社会的文脈を考慮したジェスチャー生成を動機づける。
二者の音声、社会的手掛かり、パートナーの運動を融合する拡散-トランスフォーマー枠組みを設計・検証する。
二音声ストリームを直交分離し応答性の高いジェスチャーを可能にするORCAを導入する。
モーション辞書とモーショントークナイザーを活用してスタイル意識・多様なジェスチャーを可能にする。

提案手法

マルチモーダル入力（ORCA(audio self, audio other)、パートナー運動、関係タイプ、人格スコア）で条件づけられた拡散トランスフォーマーを採用。
双方向のクロスアテンションと学習可能ゲートを用いて二者の音声ストリームを直交化・融合するORCAを導入。
スタイル事前知識を注入し、CFG誘導によるジェスチャースタイルの制御を可能にする学習可能なモーション辞書を組み込む。
モーションを残差VQ-VAEで離散化し、モーション空間での効率的拡散のための潜在トークンを得る。
約3,000クリップ（ ≈182時間）と6D上半身運動表現からなる厳選済みSeamless Interactionサブセットで訓練。
関係性と人格による条件付けを任意で追加しジェスチャーを調整する。

Figure 2 : Overview of DyaDiT. DyaDiT conditions on multiple input modalities, including audio, partner motion, relationship type, and personality scores. It employs an Audio Orthogonalization Cross Attention (ORCA) module to obtain cleaner audio representations and a motion dictionary to guide styl

実験結果

リサーチクエスチョン

RQ1拡散-トランスフォーマー枠組みは、社会属性と二者音声に条件付けられた文脈適切な二者ジェスチャーを生成できるか。
RQ2ORCAは二人の話者音声信号の分離を改善し、より現実的なジェスチャーにつながるか。
RQ3モーション事前知識と社会的条件付けがジェスチャーの多様性と現実性に与える影響はどれか。
RQ4DyaDiTは客観的指標と人間の好みにおいて既存の二者ジェスチャーのベースラインとどのように比較されるか。

主な発見

FD (Static)	FD (Kinetic)	Diversity (Static)	Diversity (Kinetic)
GT	-	-	28.42	1.97
Random	14.94	3.74	33.85	2.05
ConvoFusion [29]	9.22	1.74	18.33	1.10
Audio2PhotoReal [32]	8.77	1.84	19.35	1.05
DyaDiT (w/o ORCA)	7.32	1.79	23.57	1.24
DyaDiT (w/o MD)	6.88	1.75	18.34	1.29
DyaDiT (Uncond)	7.40	1.63	21.65	1.16
DyaDiT (Random)	8.24	1.53	21.94	1.43
DyaDiT	6.40	1.37	27.46	1.38

DyaDiTは静的・運動的指標の両方でFréchet距離（FD）を低く抑えつつ高い多様性を維持する。
アブレーション実験によりORCAとモーション辞書の両方が現実性とスタイル変化に寄与し、社会文脈による条件付けがジェスチャー品質を向上させることを示す。
FDとDiversity指標においてConvoFusionおよびAudio2PhotoRealを上回る定量的結果。
ユーザ調査ではDyaDiTジェスチャーがConvoFusionより高い好評を得ただけでなく、一部はグラウンドトゥルースの知覚にも及ぶことがあり、社会的一貫性と自然さを強調。
拡散ベースの社会的文脈付き条件付けは、より自然で協調的な二者ジェスチャーを生み出す。

Figure 3 : ORCA reduces ambiguity between the two audio streams, allowing DyaDiT to generate realistic motion even when one person interrupts the other during the conversation. The example demonstrates the generated motions adjusts naturally as the conversation shifts.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。