QUICK REVIEW

[論文レビュー] AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation

Dongjie Cheng, Ruifeng Yuan|arXiv (Cornell University)|Jan 25, 2026

Speech and Audio Processing被引用数 0

ひとこと要約

AR-Omniは、モダリティを意識した損失、知覚整合、そして実時間性能のための有限状態デコードを用いて、外部デコーダなしでテキスト、画像、ストリーミング音声生成をネイティブにサポートする単一の自己回帰型Transformerである。

ABSTRACT

Real-world perception and interaction are inherently multimodal, encompassing not only language but also vision and speech, which motivates the development of "Omni" MLLMs that support both multimodal inputs and multimodal outputs. While a sequence of omni MLLMs has emerged, most existing systems still rely on additional expert components to achieve multimodal generation, limiting the simplicity of unified training and inference. Autoregressive (AR) modeling, with a single token stream, a single next-token objective, and a single decoder, is an elegant and scalable foundation in the text domain. Motivated by this, we present AR-Omni, a unified any-to-any model in the autoregressive paradigm without any expert decoders. AR-Omni supports autoregressive text and image generation, as well as streaming speech generation, all under a single Transformer decoder. We further address three practical issues in unified AR modeling: modality imbalance via task-aware loss reweighting, visual fidelity via a lightweight token-level perceptual alignment loss for image tokens, and stability-creativity trade-offs via a finite-state decoding mechanism. Empirically, AR-Omni achieves strong quality across three modalities while remaining real-time, achieving a 0.88 real-time factor for speech generation.

研究の動機と目的

真に統一されたオムニモ multimodalモデルが、入力と出力の両方を処理できる必要性を動機づける。
テキスト、画像、音声のための単一トークンストリームとデコーダを用いる自己回帰フレームワークを提案する。
モダリティの不均衡、視覚的忠実度、デコードの安定性/創造性のトレードオフなど、統一ARモデリングの実践的課題に対処する。
標準ベンチマークで、テキスト、画像、音声生成のリアルタイム能力と競争力のある品質を実証する。

提案手法

テキスト、画像、音声を共通の離散的結合語彙にトークン化し、次のトークンを予測する単一のTransformerデコーダを訓練する。
モダリティ間を境界マーカーで interleave して、自己回帰生成の単一因果シーケンスを形成する。
テキストと画像生成、ストリーミング音声生成をサポートする、拡散を用いない自己回帰アプローチを使用する。
応答尾部のタスク認識再重み付けによってモダリティの不均衡を緩和する。
画像トークンの視覚忠実度を向上させるトークンレベルの知覚整合損失を導入する。
ASR/TTSに対してグリーディデコードを選択し、オープンエンド生成にはサンプリングを行う有限状態デコーディング機構を採用する。

実験結果

リサーチクエスチョン

RQ1外部デコーダなしで単一の自己回帰モデルと統一トークン空間が、テキスト、画像、音声を効果的に理解・生成できるか。
RQ2統一ARモ multimodalフレームワークにおけるモダリティ不均衡、視覚忠実度、デコードの安定性をどう解決するか。
RQ3三モーダルタスクに対して拡散なし自己回帰生成を用いる際のリアルタイム性と品質のトレードオフは。
RQ4ゼロショットまたは少数ショット設定で、画像キャプション、ASR、TTSの標準ベンチマークにおけるAR-Omniの性能はどうか。

主な発見

AR-Omniは、テキスト、音声、画像の統一I/Oを、単一の7B Transformerバックボーンで実現。
AR-Omniは最初のトークンの待機遅延を146 ms、音声生成の実時間係数を0.88に達成。
AR-OmniはLibriSpeech test-cleanでゼロショットTTSのWER 6.5、LibriSpeech test-cleanでASRのWER 9.4を報告。
重み付きNTP、知覚損失、 swin-normの重要性が安定性と多モーダル性能に寄与することをアブレーションで示した。
AR-Omniは拡散フリーのまま競争力のある三モーダル性能を維持しつつ、リアルタイムストリーミングが可能である。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。