QUICK REVIEW

[論文レビュー] Kolmogorov-Arnold Transformer

Xingyi Yang, Xinchao Wang|arXiv (Cornell University)|Sep 16, 2024

Fusion and Plasma Physics Studies被引用数 14

ひとこと要約

この論文は視覚変換器のMLP層をグループ-合理的Kolmogorov–Arnoldネットワーク（GR-KAN）に置換し、表現力と効率を向上させ、ImageNet規模の学習を可能にし、ViT/DeiTベースラインを上回る。

ABSTRACT

Transformers stand as the cornerstone of mordern deep learning. Traditionally, these models rely on multi-layer perceptron (MLP) layers to mix the information between channels. In this paper, we introduce the Kolmogorov-Arnold Transformer (KAT), a novel architecture that replaces MLP layers with Kolmogorov-Arnold Network (KAN) layers to enhance the expressiveness and performance of the model. Integrating KANs into transformers, however, is no easy feat, especially when scaled up. Specifically, we identify three key challenges: (C1) Base function. The standard B-spline function used in KANs is not optimized for parallel computing on modern hardware, resulting in slower inference speeds. (C2) Parameter and Computation Inefficiency. KAN requires a unique function for each input-output pair, making the computation extremely large. (C3) Weight initialization. The initialization of weights in KANs is particularly challenging due to their learnable activation functions, which are critical for achieving convergence in deep neural networks. To overcome the aforementioned challenges, we propose three key solutions: (S1) Rational basis. We replace B-spline functions with rational functions to improve compatibility with modern GPUs. By implementing this in CUDA, we achieve faster computations. (S2) Group KAN. We share the activation weights through a group of neurons, to reduce the computational load without sacrificing performance. (S3) Variance-preserving initialization. We carefully initialize the activation weights to make sure that the activation variance is maintained across layers. With these designs, KAT scales effectively and readily outperforms traditional MLP-based transformers.

研究の動機と目的

トランスフォーマーへKANを統合する際のスケーラビリティ課題（基底関数、パラメータ化、初期化）を識別する。
合理的活性化、グループKAN、分散保持初期化の解決策を提案する。
ViT系アーキテクチャでMLPをGR-KANに置換してKATを開発・検証する。
画像分類、物体検出、セマンティックセグメンテーションのタスクで性能向上を実証する。

提案手法

KANの基底活性化として合理的関数を採用し、効率性のためCUDAベースの勾配を実装する。
エッジグループ間で共有基底関数を持つGR-KANを導入し、パラメータと計算を削減する。
Horner法を多項式評価に適用してCUDA実行を高速化する。
分散保持初期化を用いてGR-KAN層間の訓練を安定化させる。
ViTの事前学習からの重み転送を可能にして、KATがViTの重みをロードして微調整できるようにする。
ImageNet-1K、COCO（Mask R-CNN with ViTDet）、ADE20K（UperNet）でKATを評価し、スケーラビリティと性能向上を示す。

実験結果

リサーチクエスチョン

RQ1GR-KANはVision TransformerにおいてMLPを置換してImageNetスケールで収束や性能を損なわずに実現できるか？
RQ2グループごとのパラメータ共有を伴う合理的活性化は、B-spline KANと比べて計算効率と精度を改善するか？
RQ3KATはViT/DeiTベースラインと比較して、同等の計算量で標準的な視覚タスク（分類、検出、分割）でどう機能するか？
RQ4ViTからKATへの事前学習転移は最終精度にどのような影響を与えるか？
RQ5活性化選択と初期化の効果に関するアブレーションは、KATの性能にどのような影響をもたらすか？

主な発見

モデル	チャンネルミキサー	#パラメータ	FLOPs	IN-1k Top-1
ViT-Ti/16	MLP	5.7M	1.08G	72.7
DeiT-T	MLP	5.7M	1.08G	72.2
ViT-T + KAN	KAN	12.8M	1.78G	64.9
KAT-T	KAN	5.7M	1.13G	74.6
KAT-T ∗	KAN	5.7M	1.13G	75.7
ViT-S/16	MLP	22.1M	4.25G	78.8
DeiT-S	MLP	22.1M	4.25G	79.8
ViT-S + KAN	KAN	50.4M	7.05G	62.9
KAT-S	KAN	22.1M	4.35G	81.2
KAT-S ∗	KAN	22.1M	4.35G	82.0
ViT-B/16	MLP	86.6M	16.87G	79.1
DeiT-B	MLP	86.6M	16.87G	81.8
ViT-B + KAN	KAN	199.8M	28.04G	NAN
KAT-B	KAN	86.6M	17.06G	82.3
KAT-B ∗	KAN	86.6M	17.06G	82.8

KATのバリアントは、ImageNet-1KにおいてFLOPsとパラメータ予算が同程度のMLPベースのトランスフォーマーを一貫して上回る。
KAT-Tは74.6%のトップ-1（ViT-Ti/16スケール）および事前学習転移時には75.7%を達成し、ViT/DeiTベースラインを上回る。
KAT-Sは事前学習なしで81.2%、事前学習ありで82.0%のトップ-1を達成し、DeiT-Sを約2.4%上回る。
KAT-Bは82.3%のトップ-1を達成し、ViTから初期化した場合は82.8%で、ViT-BおよびDeiT-Bベースラインを上回る。
提案されたスケーラビリティの改善なしにViT+KANをImageNetスケールの訓練で収束させることはできず、GR-KAN設計の必然性を示す（S1-S3）。
検出と分割にわたって、KATバックボーンはViTDetおよびDeiTライクなバックボーンより一貫した改善を示し、小さなモデルほど相対的な改善が大きい。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。