QUICK REVIEW

[論文レビュー] Blockwisely Supervised Neural Architecture Search with Knowledge Distillation

Changlin Li, Jiefeng Peng|arXiv (Cornell University)|Nov 29, 2019

Advanced Neural Network Applications参考文献 36被引用数 26

ひとこと要約

本稿では、知識蒸留を用いたブロック単位の教師ありニューラルアーキテクチャサーチ（DNA）を提案する。この手法は、ニューラルアーキテクチャサーチをモジュール化されたブロックに分割することで、候補アーキテクチャの完全かつ公平な訓練を可能にし、パラメータ共有による誤差を低減する。特徴マップの一致を用いて教師モデルからアーキテクチャ知識を蒸留することで、DNAはモバイル環境下でImageNetで78.4%のトップ1精度を達成し、EfficientNet-B0を2.1%上回り、さらに教師モデルをも上回る性能を示す。

ABSTRACT

Neural Architecture Search (NAS), aiming at automatically designing network architectures by machines, is hoped and expected to bring about a new revolution in machine learning. Despite these high expectation, the effectiveness and efficiency of existing NAS solutions are unclear, with some recent works going so far as to suggest that many existing NAS solutions are no better than random architecture selection. The inefficiency of NAS solutions may be attributed to inaccurate architecture evaluation. Specifically, to speed up NAS, recent works have proposed under-training different candidate architectures in a large search space concurrently by using shared network parameters; however, this has resulted in incorrect architecture ratings and furthered the ineffectiveness of NAS. In this work, we propose to modularize the large search space of NAS into blocks to ensure that the potential candidate architectures are fully trained; this reduces the representation shift caused by the shared parameters and leads to the correct rating of the candidates. Thanks to the block-wise search, we can also evaluate all of the candidate architectures within a block. Moreover, we find that the knowledge of a network model lies not only in the network parameters but also in the network architecture. Therefore, we propose to distill the neural architecture (DNA) knowledge from a teacher model as the supervision to guide our block-wise architecture search, which significantly improves the effectiveness of NAS. Remarkably, the capacity of our searched architecture has exceeded the teacher model, demonstrating the practicability and scalability of our method. Finally, our method achieves a state-of-the-art 78.4\% top-1 accuracy on ImageNet in a mobile setting, which is about a 2.1\% gain over EfficientNet-B0. All of our searched models along with the evaluation code are available online.

研究の動機と目的

共有重みが不十分に訓練された状態で評価を行う従来のワンショットNAS手法の非効率性と不正確さを是正すること。
探索空間をブロックにモジュール化することで、各ブロック内での候補アーキテクチャの完全かつ公平な訓練を可能にし、NASの効果を向上させること。
グリーディなブロック単位探索における監視の欠如を克服するため、教師モデルの特徴マップから知識を転送する新しいアーキテクチャ蒸留法を導入すること。
探索されたアーキテクチャが教師モデルを上回る性能を示すことで、スケーラビリティと実用性を実証すること。

提案手法

探索空間を離散的なブロックに分解し、各ブロックがアーキテクチャ選択のサブセットを含むようにすることで、ブロック内すべての候補を完全に訓練可能にする。
教師モデルの特徴マップの一致を用いて、学生と教師の活性化の間でMSE損失を用いる、新しい蒸留法（DNA）を導入する。
学生のスーパーネットを、各ブロックに対して教師の特徴マップを入力として用い、ブロックごとに訓練することで、探索中にアーキテクチャ知識が保持されるようにする。
チャネル数と深さの多様性を高めるために、マルチセルスーパーネット設計を導入し、探索能力を強化する。
各ブロックのアーキテクチャが完全な訓練と蒸留のガイダンスに基づいて選択される、段階的でブロック単位の訓練・評価戦略を採用する。
最終的なアーキテクチャは教師の監視なしに再訓練されるため、本手法の一般化性とスケーラビリティが検証される。

実験結果

リサーチクエスチョン

RQ1探索空間をブロックにモジュール化することで、ワンショットNASにおけるアーキテクチャ評価の正確性と信頼性が向上するか？
RQ2教師モデルの特徴マップからアーキテクチャ知識を蒸留することで、ブロック単位のアーキテクチャ探索の有効性が向上するか？
RQ3教師モデルが最高性能のアーキテクチャでない場合でも、探索されたアーキテクチャが教師モデルを上回る精度を達成できるか？
RQ4探索されたアーキテクチャの性能はモデルサイズに応じてどのようにスケーリングされるか？また、より大きな教師モデルを上回れるか？

主な発見

提案手法のDNAは、モバイル環境下でImageNetで78.4%のトップ1精度を達成し、EfficientNet-B0を2.1%上回る最先端の性能を示した。
5.28Mパラメータの探索モデル（DNA-B7）は77.8%のトップ1精度を達成し、6600万パラメータのEfficientNet-B7教師モデルと同等の性能を示した。
6490万パラメータにスケーリングしたDNA-B7モデルは79.9%のトップ1精度を達成し、6600万パラメータの教師モデルを2.1%上回った。
教師モデルの品質にかかわらず本手法は頑健である：EfficientNet-B0を教師として用いることで、同じパラメータ数で教師を1.5%上回るモデル（DNA-B0）が得られた。
アブレーションスタディの結果、マルチセル探索と蒸留戦略が精度を顕著に向上させたことが確認され、提案手法の蒸留法はS1およびS2ベースラインをそれぞれ0.3%および0.2%上回った。
学生のスーパーネットは、抽象度の高い14×14の特徴マップにおいても、すべてのチャネルと空間次元で教師の特徴マップを効果的に模倣しており、知識転送の有効性が裏付けられた。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。