QUICK REVIEW

[论文解读] UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark

Yanlin Li, Minghui Guo|arXiv (Cornell University)|Mar 5, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

UniM 引入了任意到任意互嵌多模态学习的首个统一基准，包含31K实例、跨7模态、30领域的数据集、评估套件，以及强基线UniMA。结果显示当前多模态学习模型在统一互嵌任务上存在挑战，而UniMA提供稳健的基线与未来研究洞见。

ABSTRACT

In real-world multimodal applications, systems usually need to comprehend arbitrarily combined and interleaved multimodal inputs from users, while also generating outputs in any interleaved multimedia form. This capability defines the goal of any-to-any interleaved multimodal learning under a unified paradigm of understanding and generation, posing new challenges and opportunities for advancing Multimodal Large Language Models (MLLMs). To foster and benchmark this capability, this paper introduces the UniM benchmark, the first Unified Any-to-Any Interleaved Multimodal dataset. UniM contains 31K high-quality instances across 30 domains and 7 representative modalities: text, image, audio, video, document, code, and 3D, each requiring multiple intertwined reasoning and generation capabilities. We further introduce the UniM Evaluation Suite, which assesses models along three dimensions: Semantic Correctness & Generation Quality, Response Structure Integrity, and Interleaved Coherence. In addition, we propose UniMA, an agentic baseline model equipped with traceable reasoning for structured interleaved generation. Comprehensive experiments demonstrate the difficulty of UniM and highlight key challenges and directions for advancing unified any-to-any multimodal intelligence. The project page is https://any2any-mllm.github.io/unim.

研究动机与目标

推动并操作化任意到任意互嵌多模态学习范式，以反映现实世界的交互。
提供跨多模态与多领域的大规模高质量数据集。
开发一种 principled 的评估套件，捕捉语义正确性、结构性与互嵌连贯性。
提供可追溯推理的稳健基线模型以基准未来的多模态学习模型（MLLMs）。
突出统一互嵌多模态智能的挑战与发展方向。

提出的方法

从30个领域、7种模态（文本、图像、音频、视频、文档、代码、3D）中整理31,026个高质量的互嵌多模态实例。
设计带有模态占位符的开放式问答格式，以模拟任意到任意的互嵌输入输出。
引入 UniM 评估套件，涵盖三维度：语义正确性与生成质量、响应结构完整性、互嵌连贯性。
提出 UniMA，一种具可追溯证据推理（TER）模块和基于任务条件的证据方法，用于结构化的互嵌生成的基线代理。
采用两阶段质量控制流程，包括人工审查与独立检查，以确保数据质量。
使用与人类判断一致性的自动评估指标进行模型评估，并通过皮尔逊相关性与消融研究进行验证。

实验结果

研究问题

RQ1当前的多模态学习模型在统一任意到任意互嵌多模态任务中，在多模态和多领域上能达到何种程度？
RQ2在统一的互嵌范式下，现有多模态学习模型的优势与局限性是什么？
RQ3带有可追溯推理的代理基线能否提升 UniM 任务的性能与可靠性？
RQ4应如何设计评估指标，以公平评估互嵌多模态生成中的语义正确性、结构完整性与互嵌连贯性？

主要发现

UniMA 在多项指标上显著优于基线模型，取得更高的语义正确性、生成质量和互嵌连贯性分数。
基线模型在绝对 SQCS 和 ICS 分数上偏低，任务复杂性增加时，结构性与连贯性显著下降。
UniMA 在多个领域的 StS/LeS 提高2–6倍，ICS 提高约15–40倍，表明模态覆盖与协同能力更强。
评估指标 SQCS 与 ICS 与人类判断高度相关（Pearson r 约 0.974 与 0.960，分别）。
UniM 的数据覆盖30个领域、7种模态，强调多任务与多模态推理，具有逐级难度（Easy、Medium、Hard）。
消融研究表明 TER 对结构遵循至关重要，验证子模块对可靠的互嵌输出也至关重要。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。