QUICK REVIEW

[論文レビュー] A unified multimodal understanding and generation model for cross-disciplinary scientific research

Xiaomeng Yang, Zhiyu Tan|arXiv (Cornell University)|Jan 4, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

FuXi-Uni は地球科学と生物医薬分野を横断する単一アーキテクチャで理解と生成を行うネイティブ統一多模対応モデル。科学トークンと言語トークンを整列させ、全球天気予報、熱帯低気圧予報の編集、空間ダウンスケーリング、バイオメディカル VQA などを可能にする。

ABSTRACT

Scientific discovery increasingly relies on integrating heterogeneous, high-dimensional data across disciplines nowadays. While AI models have achieved notable success across various scientific domains, they typically remain domain-specific or lack the capability of simultaneously understanding and generating multimodal scientific data, particularly for high-dimensional data. Yet, many pressing global challenges and scientific problems are inherently cross-disciplinary and require coordinated progress across multiple fields. Here, we present FuXi-Uni, a native unified multimodal model for scientific understanding and high-fidelity generation across scientific domains within a single architecture. Specifically, FuXi-Uni aligns cross-disciplinary scientific tokens within natural language tokens and employs science decoder to reconstruct scientific tokens, thereby supporting both natural language conversation and scientific numerical prediction. Empirically, we validate FuXi-Uni in Earth science and Biomedicine. In Earth system modeling, the model supports global weather forecasting, tropical cyclone (TC) forecast editing, and spatial downscaling driven by only language instructions. FuXi-Uni generates 10-day global forecasts at 0.25° resolution that outperform the SOTA physical forecasting system. It shows superior performance for both TC track and intensity prediction relative to the SOTA physical model, and generates high-resolution regional weather fields that surpass standard interpolation baselines. Regarding biomedicine, FuXi-Uni outperforms leading multimodal large language models on multiple biomedical visual question answering benchmarks. By unifying heterogeneous scientific modalities within a native shared latent space while maintaining strong domain-specific performance, FuXi-Uni provides a step forward more general-purpose, multimodal scientific models.

研究の動機と目的

学際的科学理解と生成を可能にする統一多模対応モデルの必要性を動機づける。
地球科学と生物医薬の領域特異エンコーダ/デコーダを備えた science-token 整列 LLM である FuXi-Uni を提案する。
地球系統タスク（予測、TC 編集、ダウンスケーリング）と生物医薬 VQA ベンチマークで最先端の性能を示す。

提案手法

領域特異の science encoders により構造化トークンを生成する三部構成設計を採用する。
共有潜在空間を用いてテキストと高次元科学データを共同処理し、離散化による情報損失を回避する。
タスク固有の自然言語プロンプトでモデルを条件付けし、同一アーキテクチャ内で異なる科学タスクを推進する。
backbone に適合するマルチモーダルトークンへ格子場をマッピングする地球科学エンコーダを用い、4D 入力 X ∈ R^T×C×H×W を扱う。
画像とテキストの VQA フレームワークに拡張し、データセット特有のプロンプトと指示ベースの supervision で複数のベンチマークを統合する。
Qwen2.5-VL バックボーンを基盤とし、画像用のビジョン経路とテキスト生成用のデコーダー専用言語バックボーンを持つ。

実験結果

リサーチクエスチョン

RQ1単一の統一モデルが異なる科学的モダリティと領域を横断して理解・生成を同時に行えるか。
RQ2科学トークンと言語トークンを整列させることは、高次元の地球科学と生物医薬タスクの性能にどのような影響を与えるか。
RQ3タスクプロンプトは、1つのアーキテクチャ内で予測、ダウンスケーリング、TC 編集、バイオメディカル VQA を実行するよう統一モデルを誘導できるか。

主な発見

FuXi-Uni は 0.25° 解像度・10日間の全球予測において最先端の数値天気予報モデルを上回る。
強度強化プロンプト後、熱帯低気圧の追跡と強度予測の性能が改善される。
1.5° から 0.25° への空間ダウンスケーリングで、直線補間より精度と画像品質が上回る予測を生む。
バイオメディicine では FuXi-Uni が VQA-RAD、SLAKE、PathVQA の複数のベンチマークで先端的なマルチモーダル LLM を上回る。
本フレームワークは高次元・跨領域の生成と理解を単一のプロンプト駆動インターフェースでサポートし、タスク特化モデルへの依存を削減する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。