QUICK REVIEW

[논문 리뷰] Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs

Jinguo Zhu, Xizhou Zhu|arXiv (Cornell University)|2022. 06. 09.

Multimodal Machine Learning Applications인용 수 28

한 줄 요약

Conditional Mixture-of-Experts(Conditional MoEs)를 도입하여 일반ist 모델의 태스크 간 간섭을 완화하고 Uni-Perceiver에 통합하며, 1% 다운스트림 데이터로 프롬프트 튜닝을 통해 SOTA에 근접한 성능을 보이고 제로샷 일반화를 유지합니다.

ABSTRACT

To build an artificial neural network like the biological intelligence system, recent works have unified numerous tasks into a generalist model, which can process various tasks with shared parameters and do not have any task-specific modules. While generalist models achieve promising results on various benchmarks, they have performance degradation on some tasks compared with task-specialized models. In this work, we find that interference among different tasks and modalities is the main factor to this phenomenon. To mitigate such interference, we introduce the Conditional Mixture-of-Experts (Conditional MoEs) to generalist models. Routing strategies under different levels of conditions are proposed to take both the training/inference cost and generalization ability into account. By incorporating the proposed Conditional MoEs, the recently proposed generalist model Uni-Perceiver can effectively mitigate the interference across tasks and modalities, and achieves state-of-the-art results on a series of downstream tasks via prompt tuning on 1% of downstream data. Moreover, the introduction of Conditional MoEs still holds the generalization ability of generalist models to conduct zero-shot inference on new tasks, e.g., video-text retrieval and video caption. Code and pre-trained generalist models shall be released.

연구 동기 및 목표

일반ist 다중 태스크 모델에서의 태스크 간 간섭 문제와 이것이 성능에 미치는 영향을 설명합니다.
다양한 라우팅 전략을 갖춘 Conditional MoEs를 제안하여 간섭을 완화하면서 일반화를 보존합니다.
Conditional MoEs가 탑재된 Uni-Perceiver가 제한된 다운스트림 데이터에서 강력한 성능을 달성하고 새로운 태스크에 대한 제로샷 일반화를 지원하는지 보여줍니다.

제안 방법

태스크 간 간섭을 그래디언트 방향 지표로 분석하여 교차 태스크 효과를 정량화합니다.
토큰 수준, 컨텍스트 수준, 모달리티 수준, 태스크 수준, 속성 기반 조건부 라우팅 전략을 통해 Conditional MoEs를 정의합니다.
Uni-Perceiver의 Self-Attention 및 FFN 블록의 선형 투사(projections)를 Conditional-MoE 층으로 대체합니다.
데이터 및 태스크 일반화 라우팅 결정을 가능하게 하는 8차원 토큰 속성 임베딩을 도입합니다.
훈련/추론 비용 및 일반화 측면에서 데이터 의존적 라우팅 대 데이터 비의존적 라우팅 변형을 비교합니다.
프롬프트 튜닝을 포함한 1% 데이터로 대규모 사전 학습 및 다운스트림 평가를 수행합니다.

실험 결과

연구 질문

RQ1태스크와 모달리티 간 매개변수 공유에서 교차 태스크 간섭이 일반ist 모델의 성능에 어떤 영향을 미치는가?
RQ2Conditional MoEs가 간섭을 줄이면서 보지 못한 태스크에 대한 일반화를 보존 또는 향상시킬 수 있는가?
RQ3효율성과 정확성 사이의 최적 균형을 제공하는 라우팅 전략(토큰, 컨텍스트, 모달리티, 태스크, 속성)은 무엇인가?
RQ4프롬프트 튜닝 및 데이터 효율성이 Conditional MoEs를 갖춘 일반ist 모델에서 완전 감독 미세조정보다 어떤 차이를 보이는가?
RQ5Conditional MoEs를 갖춘 모델이 비디오-텍스트 검색 및 비디오 자막 생성과 같은 신규 태스크에 대해 제로샷 능력을 유지하는가?

주요 결과

모델	학습 시간	추론 시간	ImageNet-1k (학습 정확도)	COCO Caption (B@4 검증)	MLM (학습 정확도)	MLM (검증 perplexity)
Uni-Perceiver-Ti	1.0×	1.0×	47.3	68.3	49.2	5.86
Uni-Perceiver-Ti + Conditional MoEs (token)	1.8×	2.2×	53.1	72.7	52.9	4.96
Uni-Perceiver-Ti + Conditional MoEs (context)	2.2×	2.6×	52.5	73.1	52.8	4.86
Uni-Perceiver-Ti + Conditional MoEs (modality)	1.4×	1.0×	51.7	72.6	52.1	5.06
Uni-Perceiver-Ti + Conditional MoEs (task)	1.4×	1.0×	52.9	73.2	52.7	4.56
Uni-Perceiver-Ti + Conditional MoEs (attribute)	1.4×	1.0×	52.8	73.3	53.1	4.56

Conditional MoEs가 태스크 간 간섭을 완화하고 완전히 공유된 Uni-Perceiver 기준선보다 성능을 향상합니다.
라우팅 변형 중 속성 MoEs(8비트 토큰 속성 임베딩 사용)가 더 나은 효율성과 일반화로 강력한 성능을 제공합니다.
데이터 의존적이지 않은 MoE 변형들(모달리티, 태스크, 속성)은 높은 효율성을 달성하고 재매개변수를 통해 단일 프로젝션으로 병합할 수 있습니다; 데이터 의존적 변형은 더 높은 학습/추론 비용을 수반합니다.
프롬프트 튜닝에 다운스트림 데이터 1%를 사용했을 때 Uni-Perceiver-MoEs는 더 많은 데이터와 컴퓨팅이 필요한 SOTA와 비교해 경쟁력 있는 결과를 달성합니다.
Uni-Perceiver-MoEs는 비디오 자막 및 비디오-텍스트 검색과 같은 신규 태스크에 대해 제로샷 일반화를 유지하고 GLUE 벤치마크의 성능도 향상시킵니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.