QUICK REVIEW

[論文レビュー] T2M Mamba: Motion Periodicity-Saliency Coupling Approach for Stable Text-Driven Motion Generation

Xingzu Zhan, Chen Xie|arXiv (Cornell University)|Feb 1, 2026

Human Motion and Animation被引用数 0

ひとこと要約

T2M Mamba はモーションキーフレーム顕著性とモーション周期性を結合し、長文からモーションへの生成の安定性と頑健性を改善する周期差分クロスモーダル整合モジュールを導入。Ultra-low FID と強い整合指標で HumanML3D および KIT-ML で最先端の結果。

ABSTRACT

Text-to-motion generation, which converts motion language descriptions into coherent 3D human motion sequences, has attracted increasing attention in fields, such as avatar animation and humanoid robotic interaction. Though existing models have achieved significant fidelity, they still suffer from two core limitations: (i) They treat motion periodicity and keyframe saliency as independent factors, overlooking their coupling and causing generation drift in long sequences. (ii) They are fragile to semantically equivalent paraphrases, where minor synonym substitutions distort textual embeddings, propagating through the decoder and producing unstable or erroneous motions. In this work, we propose T2M Mamba to address these limitations by (i) proposing Periodicity-Saliency Aware Mamba, which utilizes novel algorithms for keyframe weight estimation via enhanced Density Peaks Clustering and motion periodicity estimation via FFT-accelerated autocorrelation to capture coupled dynamics with minimal computational overhead, and (ii) constructing a Periodic Differential Cross-modal Alignment Module (PDCAM) to enhance robust alignment of textual and motion embeddings. Extensive experiments on HumanML3D and KIT-ML datasets have been conducted, confirming the effectiveness of our approach, achieving an FID of 0.068 and consistent gains on all other metrics.

研究の動機と目的

長期のテキストからモーション生成における不安定さとドリフトを動機づけて解決する。
キーフレーム顕著性とモーション周期性の結合をモデル化し、歴史的忘却を防ぐ。
クロスモーダル整合性の強化を通じてパラフレーズ誘発の埋め込みドリフトに対する頑健性を向上させる。
キーフレーム検出と周期性推定のオーバーヘッドを最小化した効率的アルゴリズムを提案する。

提案手法

モーションセグメント内の強化された密度ピーククラスタリングを用いてキーフレーム顕著性を検出し、適応的なキーフレーム重みを割り当てる。
FFT 加速自己相関とスペクトルエントロピーおよび顕著性基準を用いてセグメントごとのモーション周期性を推定する。
キーフレーム重みと位相エンコードを Periodicity-Saliency Aware Mamba に統合し、重要なフレームとリズム性を強化する。
Periodic Differential Cross-modal Alignment Module (PDCAM) を開発し、時刻スケールの不一致下でもテキストとモーションの埋め込みを頑健に整合させる。
位相回転クエリスライスと差分アテンションを用いて、パラフレーズの混乱を緩和しつつ識別的なクロスモーダル手がかりを強調する。

Figure 1: The overview of the proposed T2M Mamba. (a) T2M Mamba. Our T2M Mamba consisting of N basic blocks aims to predict clean motion sequence (b) Inference Process. Starting from Gaussian noise, the model iteratively denoises to generate a clean motion sequence $M^{0}$ semantically aligned with

実験結果

リサーチクエスチョン

RQ1長いシーケンスにおける歴史的忘却を減らすために、キーフレーム顕著性とモーション周期性をどのように結合できるか。
RQ2位相エンコードされた周期情報はテキストからモーションへの生成の安定性とリズムを改善できるか。
RQ3PDCAM のクロスモーダル整合性は意味論的パラフレーズの摂動に頑健に対処できるか。
RQ4周期性-顕著性結合を既存のテキストからモーションへのモデルに追加する際の性能向上と計算コストはどの程度か。

主な発見

T2M Mamba は HumanML3D/KIT-ML ベンチマークで ultra-low FID 0.068 を達成し、指標全般で一貫した利得を示す。
アブレーションによりキーフレーム重み付け（M）または位相エンコーディング（phi）を除去するとFIDとR-Top3が劣化し、それらが相補的な役割を持つことを確認した。
PDCAM は従来の差分アテンションよりもクロスモーダル整合性を大幅に改善し、R-Top3を向上させ MM Dist を低減させる。
M と phi を共同で使用すると最も良い安定性とモーション忠実度を達成し、検討した深さのうち 6 Mamba 層が最適であった。
パラフレーズ耐性実験では、モデルは小さな本文の変化下でも安定したモーションを維持し、従来のパラフレーズ感度の問題に対処していることが示された。

Figure 2: Illustration of our Periodicity-Saliency Aware Mamba. $\odot$ denotes dot product.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。