QUICK REVIEW

[논문 리뷰] ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning

Junting Pan, Ziyi Lin|arXiv (Cornell University)|2022. 06. 27.

Domain Adaptation and Few-Shot Learning인용 수 77

한 줄 요약

ST-Adapter는 사전에 학습된 이미지 ViT를 비디오 동작 인식에 맞추기 위한 경량의 시공간 어댑터를 도입하여 약 8%의 태스크 특화 매개변수로 전체 미세조정에 필적하거나 이를 능가한다.

ABSTRACT

Capitalizing on large pre-trained models for various downstream tasks of interest have recently emerged with promising performance. Due to the ever-growing model size, the standard full fine-tuning based task adaptation strategy becomes prohibitively costly in terms of model training and storage. This has led to a new research direction in parameter-efficient transfer learning. However, existing attempts typically focus on downstream tasks from the same modality (e.g., image understanding) of the pre-trained model. This creates a limit because in some specific modalities, (e.g., video understanding) such a strong pre-trained model with sufficient knowledge is less or not available. In this work, we investigate such a novel cross-modality transfer learning setting, namely parameter-efficient image-to-video transfer learning. To solve this problem, we propose a new Spatio-Temporal Adapter (ST-Adapter) for parameter-efficient fine-tuning per video task. With a built-in spatio-temporal reasoning capability in a compact design, ST-Adapter enables a pre-trained image model without temporal knowledge to reason about dynamic video content at a small (~8%) per-task parameter cost, requiring approximately 20 times fewer updated parameters compared to previous work. Extensive experiments on video action recognition tasks show that our ST-Adapter can match or even outperform the strong full fine-tuning strategy and state-of-the-art video models, whilst enjoying the advantage of parameter efficiency. The code and model are available at https://github.com/linziyi96/st-adapter

연구 동기 및 목표

사전에 학습된 이미지 모델을 기반으로 한 매개변수 효율적 전이 학습을 비디오 이해 작업에 대해 조사한다.
ViT 백본을 활용한 이미지-투-비디오 전이에서 다양한 파인튜닝 전략을 벤치마킹한다.
매개변수 오버헤드가 최소인 시공간 어댑터(ST-Adapter)를 제안하여 시간적 추론이 가능하도록 한다.
ST-Adapter가 전체 파인튜닝 및 최첨단 비디오 모델과 견줄 수 있음을, 액션 인식 데이터셋에서 입증한다.

제안 방법

공간-시간 병목을 가진 NLP 어댑터 설계를 확장하여 ST-Adapter를 도입한다.
공간-시간 추론을 위한 다운프로젝션, 깊이 방향 3D 컨볼루션, 잔차 블록의 업프로젝션을 사용: ST-Adapter(X) = X + f(DWConv3D(XW_down))W_up.
X’를 [T, N, d]에서 [T, h, w, d]로 바꿔 공간-시간 처리를 수행한 후 DWConv3D를 적용한다.
통합을 위해 각 트랜스포머 블록의 멀티-헤드 셀프 어텐션(MHSA) 앞에 단일 ST-Adapter를 배치한다.
구현의 단순성과 배치를 보장하기 위해 표준 연산자를 사용한다.
작은 매개변수 발자국(~2%의 추가 매개변수)과 낮은 계산 오버헤드를 유지한다.

실험 결과

연구 질문

RQ1사전에 학습된 이미지 모델을 전체 미세조정 없이 비디오 작업에 효율적으로 적응시킬 수 있을까?
RQ2시공간 어댑터가 비디오 액션 인식에서 전체 미세조정 및 다른 매개변수-효율적 방법과 비교해 어떤 성능을 보이나?
RQ3ST-Adapter가 이미지에서 비디오 도메인으로의 전이 시 효과적인 시간 추론을 가능하게 하는가?

주요 결과

CLIP으로 사전학습된 ViT-B/16를 사용한 ST-Adapter는 Kinetics-400에서 Top-1 82.0%, Something-Something-v2에서 66.3%를 달성하여 전체 파인튜닝 성능과 견주거나 이를 능가하며, 업데이트 매개변수는 7.2M에 불과하고 전체는 121.57M이다.
ST-Adapter는 CLIP 및 ImageNet-21K 사전학습 백본에서 Prompt Tuning, Partial Fine-tuning 등 다른 효율적 파인튜닝 방법을 능가한다.
데이터셋 전반에 걸쳐 ST-Adapter는 업데이트된 매개변수가 크게 적고 학습 비용이 낮은 강한 정확도를 제공하며, 동일 백본 초기화를 가진 많은 최첨단 비디오 모델을 능가한다.
변형 연구는 병목 폭에 대한 견고성, MHSA 앞의 효과적 배치, ViT 블록에서 더 깊은 어댑타의 이점을 보여주며, 깊이 방향 커널의 시간적 범위가 성능에 결정적임을 시사한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.