QUICK REVIEW

[論文レビュー] UniFinEval: Towards Unified Evaluation of Financial Multimodal Models across Text, Images and Videos

Zhi Yang, Lingfeng Zeng|arXiv (Cornell University)|Jan 9, 2026

Stock Market Forecasting Methods被引用数 0

ひとこと要約

UniFinEval は手動構築のバイリンガル（中国語-英語）マルチモーダルベンチマークで、テキスト・画像・動画に跨る5つのコアシナリオで財務MLLMを横断的に推論させる。Zero-Shot および Zero-Shot CoT 設定の下で10モデルを比較し、財務の専門家が指摘する残されたギャップを浮き彫りにする。

ABSTRACT

Multimodal large language models are playing an increasingly significant role in empowering the financial domain, however, the challenges they face, such as multimodal and high-density information and cross-modal multi-hop reasoning, go beyond the evaluation scope of existing multimodal benchmarks. To address this gap, we propose UniFinEval, the first unified multimodal benchmark designed for high-information-density financial environments, covering text, images, and videos. UniFinEval systematically constructs five core financial scenarios grounded in real-world financial systems: Financial Statement Auditing, Company Fundamental Reasoning, Industry Trend Insights, Financial Risk Sensing, and Asset Allocation Analysis. We manually construct a high-quality dataset consisting of 3,767 question-answer pairs in both chinese and english and systematically evaluate 10 mainstream MLLMs under Zero-Shot and CoT settings. Results show that Gemini-3-pro-preview achieves the best overall performance, yet still exhibits a substantial gap compared to financial experts. Further error analysis reveals systematic deficiencies in current models. UniFinEval aims to provide a systematic assessment of MLLMs' capabilities in fine-grained, high-information-density financial environments, thereby enhancing the robustness of MLLMs applications in real-world financial scenarios. Data and code are available at https://github.com/aifinlab/UniFinEval.

研究の動機と目的

財務環境の高情報密度におけるマルチモーダル大型言語モデル（MLLM）の能力境界を評価する。
実世界の財務ワークフローに合わせた統一的なクロスモダリティベンチマークを提供する。
財務におけるクロスモーダル整合性と多跳推論の評価を可能にする。
現行ベンチマークの限界を特定し、堅牢な財務AIの導入を導く。

提案手法

3,767 問題からなるバイリンガル（中国語-英語）データセットの手動構築。
五つの財務シナリオ：財務諸表監査、企業基本要因推論、業界動向の洞察、財務リスク感知、資産配分分析。
全モダリティ入力をサポート：テキスト、画像、動画とそれらのクロスモーダル組み合わせ（テキスト-画像、テキスト-動画、画像-動画、テキスト-画像-動画）。
二つの評価設定：Zero-Shot および Zero-Shot CoT、頑健な審査のため出力抽出を Qwen-Max を用いて標準化。
実世界の財務ロジックへの適合性を確保する四段階検証による専門家主導の品質管理。

Figure 1: UniFinEval is manually constructed and supports full-modality inputs including text, images, and videos. It is equipped with cross-modal reasoning capabilities and features high information density while closely aligning with real financial business practices.

実験結果

リサーチクエスチョン

RQ1現在の MLLMs は高情報密度の財務タスクで統合的なクロスモーダル推論を実行できるか。
RQ2 perception・推論・意思決定タスクにおいて既存モデルは財務専門家のパフォーマンスにどれだけ近いか。
RQ3マルチモーダル財務情報の処理における支配的なエラーモードは何か。
RQ4財務特有のクロスモーダルタスクにおける Chain-of-Thought プロンプティングの影響はどの程度か。
RQ5現行ベンチマークの実世界の財務意思決定ループを模擬する際の限界は何か。

主な発見

Model	FSA Zero-Shot	FSA CoT	CFR Zero-Shot	CFR CoT	ITI Zero-Shot	ITI CoT	FRS Zero-Shot	FRS CoT	AAA Zero-Shot	AAA CoT	Average Zero-Shot	Average CoT
Gemini-3-pro-preview	83.5	83.8	82.2	82.8	73.3	74.7	68.8	70.1	61.1	55.4	73.8	73.4
Qwen3-VL-235B-A22B-Thinking	80.2	81.3	78.9	74.9	69.4	64.6	62.9	62.7	43.3	50.3	66.9	66.8
Qwen3-VL-32B-Thinking	75.1	76.2	71.0	70.3	65.6	65.2	54.8	56.6	40.8	43.3	61.5	62.3
GPT-5.1	76.9	77.8	67.1	65.0	65.8	60.4	50.0	54.1	47.8	48.4	61.5	61.1
Claude-Sonnet-4.5	70.8	71.9	65.4	68.2	61.7	61.4	50.0	50.6	40.8	42.0	57.7	58.6
InternVL3.5-241B-A28B	69.0	70.6	66.2	68.7	63.8	63.8	37.1	36.2	38.2	40.1	54.9	55.9
MiniCPM-V-4.5	65.9	66.2	62.3	64.1	53.2	57.9	30.6	38.0	33.1	29.9	49.0	51.2
InternVL3.5-30B-A3B	61.5	61.7	64.7	59.9	50.0	52.7	33.9	35.8	28.0	34.4	47.6	49.0
Grok-4.1-Fast-Reasoning	50.3	52.5	43.1	44.1	32.5	34.9	16.1	19.3	17.8	22.3	32.0	34.6
Llama-3.2-11B-Vision	22.2	23.1	20.9	23.7	19.0	21.4	14.1	15.7	11.5	10.8	17.5	18.9
Expert	97.5	95.3	90.1	88.5	85.2	91.3

Gemini-3-pro-preview が Zero-Shot 全体パフォーマンスで最も高く、平均 73.8% 。
多くのモデルが CoT で改善するが、タスク間の改善は限定的。
人間（専門家）はすべてのモデルを大幅に上回り、ITI および AAA シナリオで大きな差がある。
エラーロ分析では画像知覚とクロスモーダル整合性に大きな問題があり、数値計算の弱点も顕著。
モデルはクロスモーダル多跳推論と高密度タスクにおける長期的な論理整合性の維持に苦戦する。

Figure 2: UniFinEval covers five major financial scenarios and constructs datasets spanning text, images, videos, as well as multiple cross-modal combinations. It features high-information-density and manually construct data, together with dedicated designs for cross-modal consistency checking and m

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。