QUICK REVIEW

[論文レビュー] All Tokens Matter: Token Labeling for Training Better Vision Transformers

Zihang Jiang, Qibin Hou|arXiv (Cornell University)|Apr 22, 2021

Advanced Neural Network Applications参考文献 56被引用数 140

ひとこと要約

本論文は token labeling を導入し、ビジョン・トランスフォーマーのための密な patch-token 監視 objectives を提案。これにより ViT のバリアント全体で一貫した利得を得られ、より小さなモデルでも高い精度を実現可能となる。ImageNet の Top-1 が 26M パラメータで 84.4%、150M で 86.4%、さらに下流のセマンティックセグメンテーション性能も向上することを報告している。

ABSTRACT

In this paper, we present token labeling -- a new training objective for training high-performance vision transformers (ViTs). Different from the standard training objective of ViTs that computes the classification loss on an additional trainable class token, our proposed one takes advantage of all the image patch tokens to compute the training loss in a dense manner. Specifically, token labeling reformulates the image classification problem into multiple token-level recognition problems and assigns each patch token with an individual location-specific supervision generated by a machine annotator. Experiments show that token labeling can clearly and consistently improve the performance of various ViT models across a wide spectrum. For a vision transformer with 26M learnable parameters serving as an example, with token labeling, the model can achieve 84.4% Top-1 accuracy on ImageNet. The result can be further increased to 86.4% by slightly scaling the model size up to 150M, delivering the minimal-sized model among previous models (250M+) reaching 86%. We also show that token labeling can clearly improve the generalization capability of the pre-trained models on downstream tasks with dense prediction, such as semantic segmentation. Our code and all the training details will be made publicly available at https://github.com/zihangJiang/TokenLabeling.

研究の動機と目的

ViT の訓練方式を再考し、すべての画像 patch token に対して密で位置特異の監視を活用する。
画像分類を複数の token レベル認識問題に変換し、物体の grounding と認識を強化する。
token labeling の一般性と有効性を、ViT のさまざまなバリアントとモデルスケールで実証する。

提案手法

密 supervision のために N 個の patch token に対して K 次元のスコアマップを用いる token labeling 目的を定義する。
画像分類トークン損失と dense token labeling 損失を結合し、L_total = H(X^{cls}, y^{cls}) + beta * (1/N) * sum_i H(X^i, y^i) とする。
事前学習済みのマシンアノテータを用いて per-patch 監視を生成し、トークンレベルのラベルを作成する。
Patch embedding 後のトークンをブレンドする token-space の拡張法である MixToken を導入し、ラベルもそれに合わせてブレンドする。
LV-ViT 系列では局所情報をより捉えるため、patch embedding モジュールを小さな畳み込みに置換する。

実験結果

リサーチクエスチョン

RQ1密なロケーション特性を持つ token 監視は、従来のクラス-token 訓練を超えてビジョン・トランスフォーマーを改善できるか。
RQ2token labeling は ViT バリアント間の既存の拡張やモデルサイズとどのように相互作用するか。
RQ3異なるアノテータ品質や patch-embedding の選択に対して token labeling は頑健か、下流の密な予測タスクに役立つか。
RQ4複数の ViT アーキテクチャに token labeling を適用した場合の ImageNet および ADE20K の性能向上はどの程度か。

主な発見

token labeling を用いた LV-ViT は 150M パラメータで ImageNet の Top-1 が 84.4%、より大規模なスケールで 86.4% を達成し、同程度の予算内の他のトランスフォーマーベースのベースラインを上回る。
token labeling は七つの ViT バリアント（DeiT、T2T-ViT、LV-ViT など）で一貫して改善をもたらし、モデルが大きいほど利得が大きい。
MixToken は token-based トランスフォーマーにおける CutMix より優れており、LV-ViT-S に token labeling と MixToken を組み合わせると ImageNet で Top-1 が 83.3% に到達。
トークンラベリングからの密な監視は ADE20K の下流セマンティックセグメンテーションにも有益で、LV-ViT-L + UperNet などで ImageNet-22K プレ訓練なしの最先端結果を達成（例: 51.8 mIoU）。
このアプローチは token レベルのスコアを生成する事前学習アノテータを必要とするが、アノテーションが事前に計算されるため追加の訓練コストはほぼ発生しない。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。