QUICK REVIEW

[논문 리뷰] UniFinEval: Towards Unified Evaluation of Financial Multimodal Models across Text, Images and Videos

Zhi Yang, Lingfeng Zeng|arXiv (Cornell University)|2026. 01. 09.

Stock Market Forecasting Methods인용 수 0

한 줄 요약

UniFinEval 은 텍스트, 이미지, 비디오를 포함한 다중모달 벤치마크를 수동으로 구성하고 중국어-영어 이중언어로 다섯 가지 핵심 시나리오에서 금융 MLLMs를 평가하며, 교차 모달 다중 호추론을 수행합니다. 제로샷 및 제로샷 CoT 설정에서 10개의 주류 모델을 비교하고 금융 전문가와의 남은 차이를 강조합니다.

ABSTRACT

Multimodal large language models are playing an increasingly significant role in empowering the financial domain, however, the challenges they face, such as multimodal and high-density information and cross-modal multi-hop reasoning, go beyond the evaluation scope of existing multimodal benchmarks. To address this gap, we propose UniFinEval, the first unified multimodal benchmark designed for high-information-density financial environments, covering text, images, and videos. UniFinEval systematically constructs five core financial scenarios grounded in real-world financial systems: Financial Statement Auditing, Company Fundamental Reasoning, Industry Trend Insights, Financial Risk Sensing, and Asset Allocation Analysis. We manually construct a high-quality dataset consisting of 3,767 question-answer pairs in both chinese and english and systematically evaluate 10 mainstream MLLMs under Zero-Shot and CoT settings. Results show that Gemini-3-pro-preview achieves the best overall performance, yet still exhibits a substantial gap compared to financial experts. Further error analysis reveals systematic deficiencies in current models. UniFinEval aims to provide a systematic assessment of MLLMs' capabilities in fine-grained, high-information-density financial environments, thereby enhancing the robustness of MLLMs applications in real-world financial scenarios. Data and code are available at https://github.com/aifinlab/UniFinEval.

연구 동기 및 목표

멀티모달 대형 언어 모델(MLLM)이 높은 정보밀도 금융 환경에서의 역량 경계를 평가한다.
실제 금융 워크플로우에 맞춘 통합된 교차 모달 벤치마크를 제공한다.
금융에서 교차모달 일관성과 다중 히프 추론의 평가를 가능하게 한다.
일반적 실패 모드를 식별하여 강 robust 금융 AI 배치를 안내한다.

제안 방법

3,767개 질문으로 구성된 중국어-영어 이중언어 데이터셋의 수동 구성.
다섯 가지 금융 시나리오: 재무제표 감사, 회사 기본 이론, 업계 동향 인사이트, 재무 위험 감지, 자산 배분 분석.
전체 모달리티 입력 지원: 텍스트, 이미지, 비디오 및 교차 모달 조합(텍스트-이미지, 텍스트-비디오, 이미지-비디오, 텍스트-이미지-비디오).
두 가지 평가 설정: 제로샷 및 제로샷 코트, 강건한 판단을 위한 Qwen-Max으로 출력 추출 표준화.
현실 세계 금융 로직에 맞추기 위한 4단계 검증의 전문가 주도 품질 관리.

Figure 1: UniFinEval is manually constructed and supports full-modality inputs including text, images, and videos. It is equipped with cross-modal reasoning capabilities and features high information density while closely aligning with real financial business practices.

실험 결과

연구 질문

RQ1현 시점의 MLLMs가 고정보밀도 금융 작업에서 통합된 교차모달 추론을 수행할 수 있는가?
RQ2지각, 추론, 의사결정 작업 전반에 걸쳐 기존 모델이 금융 전문가의 성과에 얼마나 근접하는가?
RQ3다중모달 금융 정보를 처리할 때 지배적인 오류 모드는 무엇인가?
RQ4Chain-of-Thought 프롬프트가 금융 특화 교차모달 작업의 성능에 미치는 영향은 무엇인가?
RQ5현 벤치마크가 현실 세계의 금융 의사결정 루프를 시뮬레이션하는 데 어떤 한계가 있는가?

주요 결과

Gemini-3-pro-preview 가 평균 73.8%의 Zero-Shot 전체 성능으로 최고를 기록했다.
대다수 모델은 CoT에서 향상되지만 과제 전반에 걸친 이득은 제한적이다.
사람(전문가)은 모든 모델을 크게 능가하며 ITI 및 AAA 시나리오에서 상당한 차이가 있다.
오류 분석은 이미지 인식 및 교차모달 정렬에 큰 문제를 보이고, 수치 계산 약점도 뚜렷하다.
모델은 고정보밀도 작업에서 교차모달 다중 히프 추론 및 장기적 논리 일관성 유지에 어려움을 겪는다.

Figure 2: UniFinEval covers five major financial scenarios and constructs datasets spanning text, images, videos, as well as multiple cross-modal combinations. It features high-information-density and manually construct data, together with dedicated designs for cross-modal consistency checking and m

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.