QUICK REVIEW

[论文解读] ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

Peng Wang, Shijie Wang|arXiv (Cornell University)|May 18, 2023

Multimodal Machine Learning Applications被引用 42

一句话总结

ONE-PEACE 提供一个可扩展的 4B 参数模型，具备模态适配器和共享融合编码器，通过通用预训练任务对齐视觉、音频和语言表示，从而在没有外部预训练初始化的情况下实现广泛的单模态和多模态任务。

ABSTRACT

In this work, we explore a scalable way for building a general representation model toward unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B parameters that can seamlessly align and integrate representations across vision, audio, and language modalities. The architecture of ONE-PEACE comprises modality adapters, shared self-attention layers, and modality FFNs. This design allows for the easy extension of new modalities by adding adapters and FFNs, while also enabling multi-modal fusion through self-attention layers. To pretrain ONE-PEACE, we develop two modality-agnostic pretraining tasks, cross-modal aligning contrast and intra-modal denoising contrast, which align the semantic space of different modalities and capture fine-grained details within modalities concurrently. With the scaling-friendly architecture and pretraining tasks, ONE-PEACE has the potential to expand to unlimited modalities. Without using any vision or language pretrained model for initialization, ONE-PEACE achieves leading results on a wide range of uni-modal and multi-modal tasks, including image classification (ImageNet), semantic segmentation (ADE20K), audio-text retrieval (AudioCaps, Clotho), audio classification (ESC-50, FSD50K, VGGSound), audio question answering (AVQA), image-text retrieval (MSCOCO, Flickr30K), and visual grounding (RefCOCO/+/g). Code is available at https://github.com/OFA-Sys/ONE-PEACE.

研究动机与目标

推动一个可扩展的通用表示模型，能够处理无限模态。
提出一种灵活的架构，具备模态适配器与共享融合编码器。
引入通用预训练任务以对齐模态并捕捉模态内细节。
在没有外部初始化的情况下，展示在单模态与多模态的视觉、音频和语言任务上的强大性能。

提出的方法

使用模态适配器（V-Adapter、A-Adapter、L-Adapter）将原始输入转化为模态特定的特征序列。
使用具有共享自注意力层和模态特定前馈网络（V-FFN、A-FFN、L-FFN）的模态融合编码器。
在训练稳定性与性能方面应用 Sub-LayerNorm、GeGLU 激活、相对位置偏置和 LayerScale。
以两个通用任务进行预训练：跨模态对齐对比（视觉-语言和音频-语言）以及五类数据类型（图像、音频、文本、图像-文本、音频-文本）的模态内去噪对比学习。
将系统拆分为任务特定分支（V-Branch、A-Branch、L-Branch，以及多模态分支），以便灵活扩展到新模态。

Figure 1: The architecture of ONE-PEACE . It consists of three modality adapters and a modality fusion encoder. ONE-PEACE can be disassembled into different branches to handle different tasks. For example, the vision adapter, self-attention layers, and vision FFNs can be combined into V-Branch to ha

实验结果

研究问题

RQ1ONE-PEACE 能否在保持有效跨模态对齐的同时扩展到无限模态？
RQ2通用预训练任务（跨模态与模态内去噪）是否足以在没有模态特定设计的情况下实现对单模态和多模态任务的强大性能？
RQ3与最新方法相比，该架构在广泛的视觉、音频以及视觉-语言/音频-语言任务上的表现如何？
RQ4面向大规模、模块化的基于 Transformer 的融合方法在多模态学习中有哪些好处？

主要发现

ONE-PEACE 在不使用预训练的视觉或语言模型初始化的情况下，在 ImageNet 图像分类上达到领先结果（Top-1 89.8%）。
在语义分割方面，ONE-PEACE 在 ADE20K 上达到 63.0 mIoU，按所评估协议创造了新的 state-of-the-art。
在音频-文本检索方面，ONE-PEACE 在 AudioCaps 与 Clotho 上显著优于现有 SOTA。
在音频分类方面，ONE-PEACE 在不使用视觉信息的情况下，在 ESC-50 上达到 91.8% 的零-shot 准确率，在 FSD50K 上达到 69.7%。
在图像-文本检索方面，ONE-PEACE 在 COCO 的零-shot/微调对比中达到 84.1 的 R@1，在 Flickr30K 为 97.6 的 R@1，并在 RefCOCO+/g 的视觉定位结果中达到 89.26/83.23/89.27。
在各项任务中，ONE-PEACE 展现出强大的跨模态和模态内学习能力，无需外部模型初始化。

Figure 2: The pretraining tasks of ONE-PEACE. Intra-modal denoising contrastive learning encourages the features of the masked units (e.g., image patches or text tokens) close to the positive units (indicated by the green lines) and get away from the negative units (indicated by the red lines). Note

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。