QUICK REVIEW

[논문 리뷰] Kolmogorov-Arnold Transformer

Xingyi Yang, Xinchao Wang|arXiv (Cornell University)|2024. 09. 16.

Fusion and Plasma Physics Studies인용 수 14

한 줄 요약

이 논문은 비전 트랜스포머의 MLP 계층을 Group-Rational Kolmogorov–ArnolD 네트워크(GR-KAN)로 교체하여 표현력과 효율성을 향상시키고, ImageNet 규모의 학습 가능성과 ViT/ DeiT 기본 모델을 능가하는 성능을 달성합니다.

ABSTRACT

Transformers stand as the cornerstone of mordern deep learning. Traditionally, these models rely on multi-layer perceptron (MLP) layers to mix the information between channels. In this paper, we introduce the Kolmogorov-Arnold Transformer (KAT), a novel architecture that replaces MLP layers with Kolmogorov-Arnold Network (KAN) layers to enhance the expressiveness and performance of the model. Integrating KANs into transformers, however, is no easy feat, especially when scaled up. Specifically, we identify three key challenges: (C1) Base function. The standard B-spline function used in KANs is not optimized for parallel computing on modern hardware, resulting in slower inference speeds. (C2) Parameter and Computation Inefficiency. KAN requires a unique function for each input-output pair, making the computation extremely large. (C3) Weight initialization. The initialization of weights in KANs is particularly challenging due to their learnable activation functions, which are critical for achieving convergence in deep neural networks. To overcome the aforementioned challenges, we propose three key solutions: (S1) Rational basis. We replace B-spline functions with rational functions to improve compatibility with modern GPUs. By implementing this in CUDA, we achieve faster computations. (S2) Group KAN. We share the activation weights through a group of neurons, to reduce the computational load without sacrificing performance. (S3) Variance-preserving initialization. We carefully initialize the activation weights to make sure that the activation variance is maintained across layers. With these designs, KAT scales effectively and readily outperforms traditional MLP-based transformers.

연구 동기 및 목표

트랜스포머에 KAN을 통합하는 확장성 문제(기본 함수, 파라미터화, 초기화) 식별.
합리적 활성화, 그룹 KAN 및 분산 보존 초기화를 해결책으로 제시.
ViT 유사 아키텍처에서 GR-KAN으로 MLP를 대체하여 KAT를 개발 및 검증.
이미지 분류, 물체 검출, 시맨틱 분할 작업 전반에서 성능 향상을 입증합니다.

제안 방법

KAN의 기본 활성화 함수로 합리적 함수를 채용하고 효율성을 위한 CUDA 기반 그래디언트를 구현합니다.
에지 그룹 간 공유 기본 함수를 갖는 GR-KAN으로 매개변수와 계산을 줄여 파라미터를 감소시킵니다.
계산 속도를 높이기 위해 다항식 평가에 Horner 방법을 적용합니다.
GR-KAN 계층 전반의 훈련을 안정시키기 위해 분산 보존 초기화를 사용합니다.
사전 학습된 ViT에서 가중치를 전이할 수 있도록 하여 KAT가 ViT 가중치를 로드하고 미세 조정할 수 있게 합니다.
ImageNet-1K, COCO(ViTDet의 Mask R-CNN), ADE20K(UperNet)에서 KAT를 평가하여 확장성 및 성능 향상을 보여줍니다.

실험 결과

연구 질문

RQ1GR-KAN이 ViT/DeiT 기반의 트랜스포머에서 ImageNet 규모의 수렴이나 성능 저하 없이 MLP를 대체할 수 있나요?
RQ2그룹별 매개변수 공유를 갖는 합리적 활성화가 B-스플라인 KAN 대비 계산 효율성과 정확도를 개선하나요?
RQ3유사한 계산 자원 하에서 표준 비전 작업(분류, 검출, 분할)에서 KAT의 성능은 ViT/DeiT 기본 모델에 비해 어떤가요?
RQ4ViT에서 KAT로의 사전 학습 전이의 최종 정확도에 어떤 영향을 미치나요?
RQ5KAT 성능에 대한 활성화 선택과 초기화 효과를 보여주는 제거 연구(ablations)는 무엇인가요?

주요 결과

Model	Channel Mixer	#Param.	FLOPs	IN-1k Top-1
ViT-Ti/16	MLP	5.7M	1.08G	72.7
DeiT-T	MLP	5.7M	1.08G	72.2
ViT-T + KAN	KAN	12.8M	1.78G	64.9
KAT-T	KAN	5.7M	1.13G	74.6
KAT-T ∗	KAN	5.7M	1.13G	75.7
ViT-S/16	MLP	22.1M	4.25G	78.8
DeiT-S	MLP	22.1M	4.25G	79.8
ViT-S + KAN	KAN	50.4M	7.05G	62.9
KAT-S	KAN	22.1M	4.35G	81.2
KAT-S ∗	KAN	22.1M	4.35G	82.0
ViT-B/16	MLP	86.6M	16.87G	79.1
DeiT-B	MLP	86.6M	16.87G	81.8
ViT-B + KAN	KAN	199.8M	28.04G	NAN
KAT-B	KAN	86.6M	17.06G	82.3
KAT-B ∗	KAN	86.6M	17.06G	82.8

KAT 변형은 ImageNet-1K에서 유사한 FLOPs 및 파라미터 예산에서 MLP 기반 트랜스포머보다 지속적으로 더 높은 성능을 보입니다.
KAT-T는 74.6% 위상 1(ViT-Ti/16 규모) 및 사전 학습 전이 시 75.7%를 달성하여 ViT/ DeiT 기본 모델보다 우수합니다.
KAT-S는 사전 학습 없이 81.2%, 사전 학습 시 82.0%의 위상 1을 달성하여 DeiT-S보다 약 2.4% 포인트 높습니다.
KAT-B는 82.3%의 위상 1을 달성하고 ViT에서 초기화할 경우 82.8%로, ViT-B 및 DeiT-B 기본 모델을 능가합니다.
GR-KAN 설계의 필요성을 입증하는 경우(ViT+KAN에서 제안된 확장성 보강 없이 이미지넷 규모 학습에서 수렴 실패, S1-S3).
검출 및 분할 전반에 걸쳐 KAT 백본이 ViTDet 및 DeiT 유사 백본보다 일관된 이득을 보이며, 작은 모델일수록 상대적 향상이 큽니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.