QUICK REVIEW

[논문 리뷰] UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark

Yanlin Li, Minghui Guo|arXiv (Cornell University)|2026. 03. 05.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

tldr: UniM은 any-to-any 인터리브된 다중모달 학습 paradigm을 실제 세계의 상호 작용을 반영하도록 촉진하고 운영화합니다. 대규모의 고품질 데이터셋을 제공하며 여러 모달리티와 도메인을 아우릅니다. 원리적인 평가 도구를 개발하여 의미적 정합성, 구조, 인터리브 일관성을 포착하고, 추적 가능한 추론을 갖춘 강력한 베이스라인 모델을 제공하여 향후 MLLMs를 벤치마킹합니다. UniM 태스크에서의 도전과 unified interleaved 멀티모달 지능의 향후 방향을 강조합니다.

ABSTRACT

In real-world multimodal applications, systems usually need to comprehend arbitrarily combined and interleaved multimodal inputs from users, while also generating outputs in any interleaved multimedia form. This capability defines the goal of any-to-any interleaved multimodal learning under a unified paradigm of understanding and generation, posing new challenges and opportunities for advancing Multimodal Large Language Models (MLLMs). To foster and benchmark this capability, this paper introduces the UniM benchmark, the first Unified Any-to-Any Interleaved Multimodal dataset. UniM contains 31K high-quality instances across 30 domains and 7 representative modalities: text, image, audio, video, document, code, and 3D, each requiring multiple intertwined reasoning and generation capabilities. We further introduce the UniM Evaluation Suite, which assesses models along three dimensions: Semantic Correctness & Generation Quality, Response Structure Integrity, and Interleaved Coherence. In addition, we propose UniMA, an agentic baseline model equipped with traceable reasoning for structured interleaved generation. Comprehensive experiments demonstrate the difficulty of UniM and highlight key challenges and directions for advancing unified any-to-any multimodal intelligence. The project page is https://any2any-mllm.github.io/unim.

연구 동기 및 목표

현실 세계 상호 작용을 반영하도록 any-to-any 인터리브된 다중모달 학습 패러다임을 동기 부여하고 운영화한다.
여러 모달리티와 도메인에 걸친 대규모의 고품질 데이터셋을 제공한다.
의미적 정합성, 구조, 인터리브 일관성을 포착하는 원리적 평가 도구를 개발한다.
향후 MLLMs를 벤치마킹하기 위한 추적 가능한 추론을 갖춘 견고한 베이스라인 모델을 제공한다.
통합 인터리브 다중모달 지능의 도전과 방향성을 강조한다.

제안 방법

30개 도메스에서 7 모달리티(text, image, audio, video, document, code, 3D)로 31,026개의 고품질 인터리브 다중모달 인스턴스를 수집한다.
모달리티 플레이스홀더를 갖춘 개방형 QA 포맷을 설계하여 any-to-any 인터리브 입력과 출력을 시뮬레이션한다.
Semantic Correctness & Generation Quality, Response Structure Integrity, Interleaved Coherence의 세 가지 차원으로 UniM 평가 도구를 도입한다.
Traceable Evidence Reasoning (TER) 모듈과 구조화된 인터리브 생성을 위한 태스크 조건부 증거 접근법을 갖춘 에이전틱 기반선 UniMA를 제안한다.
데이터 품질을 보장하기 위한 수동 검토 및 독립 검사를 포함한 2단계 품질 관리 프로세스를 사용한다.
Pearson 상관관계와 차단 연구를 통해 자동 지표를 인간 판단과 정렬시키며 모델을 평가한다.

실험 결과

연구 질문

RQ1다양한 모달리티와 도메인에 걸쳐 현재의 MLLMs가 통합된 any-to-any 인터리브 다중모달 태스크를 얼마나 잘 처리할 수 있는가?
RQ2통합 인터리브 패러다임 하에서 기존 MLLMs의 강점과 한계는 무엇인가?
RQ3추적 가능한 추론을 가진 에이전틱 베이스라인이 UniM 태스크의 성능과 신뢰성을 개선할 수 있는가?
RQ4인터리브 다중모달 생성에서 의미적 정합성, 구조적 완전성, 인터리브 일관성을 공정하게 평가하기 위해 평가 지표를 어떻게 설계해야 하는가?

주요 결과

UniMA는 UniM 전반의 여러 지표에서 베이스라인 모델보다 상당히 우수하게 성능을 발휘하며 의미적 정합성, 생성 품질, 인터리브 일관성에서 더 높은 점수를 달성한다.
베이스라인 모델은 절대 SQCS 및 ICS 점수가 낮고, 태스크 복잡도가 증가함에 따라 구조와 일관성에서 큰 하락을 보인다.
UniMA는 여러 분야에서 베이스라인보다 2–6배 더 높은 StS/LeS 및 약 15–40배 더 높은 ICS를 보여 모달리티 커버리지와 조정이 더 우수함을 나타낸다.
평가 지표 SQCS와 ICS는 인간 판단과 강력한 상관관계(Pearson r ≈ 0.974 및 0.960)를 보인다.
UniM의 데이터는 30개 도메인과 7 모달리티를 포괄하며 다중 작업 및 다중 모달 추론을 강조하고, Easy, Medium, Hard로 진행 난이도가 증가한다.
Ablation 연구에서 TER가 구조적 일치를 위해 중요하고, 검증 서브모듈이 인터리브 출력의 신뢰성에 필수적임을 보인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.