QUICK REVIEW

[논문 리뷰] A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation

Akhilesh Deepak Gotmare, Nitish Shirish Keskar|arXiv (Cornell University)|2018. 10. 29.

Neural Networks and Applications인용 수 107

한 줄 요약

이 논문은 모드 connectivity와 SVCCA를 사용하여 심층 네트워크의 학습 다이나믹스와 표현을 이해하기 위한 실증 분석으로 코사인 학습률 재시작, 학습률 워밍업, 그리고 지식 증류를 분석한다.

ABSTRACT

The convergence rate and final performance of common deep learning models have significantly benefited from heuristics such as learning rate schedules, knowledge distillation, skip connections, and normalization layers. In the absence of theoretical underpinnings, controlled experiments aimed at explaining these strategies can aid our understanding of deep learning landscapes and the training dynamics. Existing approaches for empirical analysis rely on tools of linear interpolation and visualizations with dimensionality reduction, each with their limitations. Instead, we revisit such analysis of heuristics through the lens of recently proposed methods for loss surface and representation analysis, viz., mode connectivity and canonical correlation analysis (CCA), and hypothesize reasons for the success of the heuristics. In particular, we explore knowledge distillation and learning rate heuristics of (cosine) restarts and warmup using mode connectivity and CCA. Our empirical analysis suggests that: (a) the reasons often quoted for the success of cosine annealing are not evidenced in practice; (b) that the effect of learning rate warmup is to prevent the deeper layers from creating training instability; and (c) that the latent knowledge shared by the teacher is primarily disbursed to the deeper layers.

연구 동기 및 목표

경험적 성공을 넘어 널리 사용되는 딥 러닝 휴리스틱에 대한 이해를 고취시키는 것.
현대의 분석 도구를 활용하여 코사인 어닐링 / SGDR, 학습률 워밍업, 지식 증류를 조사한다.
이 휴리스틱들이 손실 표면과 네트워크 계층 전반의 표현에 어떤 영향을 미치는지 평가한다.
훈련 중 이러한 휴리스틱이 어디에서 어떻게 영향을 미치는지에 대한 통찰을 제공한다.

제안 방법

다른 학습 방식에서 발견된 최적점을 연결하기 위해 mode connectivity를 적용하고 결과 곡선과 장벽을 분석한다.
효율성을 위해 SVD/DFT 전처리를 갖춘 SVCCA를 사용하여 네트워크 간 및 학습 반복 사이의 활성화 표현 유사성을 측정한다.
학습률 스케줄을 통해 SGDR을 특성화하고 재시작 여부에 관계없이 표준 SGD와 비교한다.
워밍업과 증류 시나리오에서 계층별 활성화의 변화를 연구하기 위해 CCA를 사용한다.
CIFAR-10과 함께 VGG-16/ResNet 변종에 대한 통제된 실험을 수행하여 계층 간의 다이나믹스를 관찰한다.

실험 결과

연구 질문

RQ1코사인 어닐링 / SGDR 재시작이 손실 지형에 장벽을 만들거나 이를 넘어가게 하나요, 그리고 이것이 성공의 필수 요소인가요?
RQ2학습률 워밍업이 안정성에 어떤 영향을 미치고 네트워크의 어떤 계층이 가장 영향을 받나요?
RQ3증류에서 전달된 지식이 학생 네트워크의 표현들 중 어디에 나타나나요?
RQ4이 휴리스틱들 아래의 학습 다이나믹스에 대해 mode connectivity와 SVCCA가 무엇을 드러내나요?

주요 결과

코사인 어닐링의 이점은 장벽을 탈출하는 것으로 일관되게 입증되지 않는다; 재시작 후 반복이 장벽을 넘어가지만, 이것이 이점의 충분한 설명으로 보이지는 않는다.
학습률 워밍업은 주로 더 깊은 계층의 가중치 변화를 제한하고, 그 계층을 고정시키면 비슷한 안정성을 얻을 수 있다.
증류에서 교사로부터의 잠재 지식은 주로 학생의 더 깊은(판별적) 계층으로 분배된다.
표현 유사성 분석은 훈련 후 얕은 계층의 활성화가 더 비슷하고, 깊은 계층은 차별화된 표현을 담고 있음을 보인다.
Mode connectivity는 다양한 최적점 사이에 강건하고 높은 정확도를 가진 연결 곡선을 나타내며, 학습 선택 간 연결된 손실 지형이 있음을 시사한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.