QUICK REVIEW

[논문 리뷰] A block coordinate descent optimizer for classification problems exploiting convexity

Ravi G. Patel, Nathaniel Trask|arXiv (Cornell University)|2020. 01. 01.

3D Shape Modeling and Analysis참고 문헌 25인용 수 3

한 줄 요약

이 논문은 딥러닝 분류를 위한 하이브리드 뉴턴/기울기 하강(NGD) 최적화 방법을 제안하며, 선형 레이어 가중치에서 교차 엔트로피 손실의 볼록성을 활용한다. 선형 레이어에 대해 뉴턴 단계를 번갈아 적용함으로써 전역 최적성을 보장하고, 은닉 레이어에 대해 기울기 하강을 수행함으로써 수렴 속도를 가속화하고 테스트 정확도를 향상시킨다. 이로 인해 CIFAR-10에서 최대 4배 빠른 수렴 속도와 ConvNet 아키텍처에서 1.76% 높은 최종 테스트 정확도를 달성한다.

ABSTRACT

Second-order optimizers hold intriguing potential for deep learning, but suffer from increased cost and sensitivity to the non-convexity of the loss surface as compared to gradient-based approaches. We introduce a coordinate descent method to train deep neural networks for classification tasks that exploits global convexity of the cross-entropy loss in the weights of the linear layer. Our hybrid Newton/Gradient Descent (NGD) method is consistent with the interpretation of hidden layers as providing an adaptive basis and the linear layer as providing an optimal fit of the basis to data. By alternating between a second-order method to find globally optimal parameters for the linear layer and gradient descent to train the hidden layers, we ensure an optimal fit of the adaptive basis to data throughout training. The size of the Hessian in the second-order step scales only with the number weights in the linear layer and not the depth and width of the hidden layers; furthermore, the approach is applicable to arbitrary hidden layer architecture. Previous work applying this adaptive basis perspective to regression problems demonstrated significant improvements in accuracy at reduced training cost, and this work can be viewed as an extension of this approach to classification problems. We first prove that the resulting Hessian matrix is symmetric semi-definite, and that the Newton step realizes a global minimizer. By studying classification of manufactured two-dimensional point cloud data, we demonstrate both an improvement in validation error and a striking qualitative difference in the basis functions encoded in the hidden layer when trained using NGD. Application to image classification benchmarks for both dense and convolutional architectures reveals improved training accuracy, suggesting possible gains of second-order methods over gradient descent.

연구 동기 및 목표

딥 네ural 네트워크의 선형 레이어에서의 볼록성을 활용하는 두 번째 차수 최적화 방법을 개발하는 것.
선형 및 비선형 가중치의 최적화를 분리함으로써 학습 비용을 줄이고 수렴 속도를 향상시키는 것.
두 번째 차수 방법이 분류 작업에서 정확도와 수렴 속도 면에서 확률적 기울기 하강보다 뛰어나게 작용할 수 있는지 탐구하는 것.
최적화 방법 선택이 은닉 레이어에서 학습된 기저 함수에 어떤 영향을 미치는지 조사하는 것.

제안 방법

이 방법은 선형 레이어 가중치 W에 대해 뉴턴 단계를, 은닉 레이어 가중치 ξ에 대해 기울기 하강을 번갈아 적용하는 블록 좌표 강하를 사용한다.
ξ가 고정되어 있을 때 손실 함수는 W에 대해 볼록적이므로, 선형 검색을 동반한 뉴턴 방법을 통해 전역 최소화가 가능하다.
헤시안 행렬 계산은 은닉 레이어의 깊이 또는 너비에 관계없이 선형 레이어 가중치의 수에만 비례한다.
계산 효율성과 안정성을 유지하기 위해 뉴턴 단계는 미니배치에 적용된다.
알고리즘은 텐서플로우로 구현되었으며, github.com/rgp62/ 에서 오픈소스로 공개되었다.
이 접근법은 은닉 레이어를 데이터 기반 적응형 기저로 해석하며, 선형 레이어 가중치는 이러한 기저에 대한 최적의 피팅을 제공한다.

실험 결과

연구 질문

RQ1선형 레이어 가중치의 볼록성을 활용하면 딥 네트워크의 분류에서 더 빠르고 정확한 학습이 가능할 수 있는가?
RQ2NGD 최적화 방법은 표준 확률적 기울기 하강과 비교해 수렴 속도와 최종 정확도 면에서 어떻게 다른가?
RQ3NGD와 GD로 훈련했을 때 은닉 레이어가 인코딩한 기저 함수에 어떤 정성적 차이가 발생하는가?
RQ4뉴턴 단계의 정밀도 허용 오차가 모델의 일반화 능력과 내성에 어떤 영향을 미치는가?
RQ5기하급수적인 계산 비용을 유발하지 않고도 두 번째 차수 최적화를 깊은 네트워크에 효율적으로 적용할 수 있는가?

주요 결과

CIFAR-10 벤치마크에서 NGD는 GD 대비 약 1/4 수준의 반복 수에서 최대 검증 정확도에 도달했다.
CIFAR-10 ConvNet 아키텍처에서 NGD는 GD 대비 최종 테스트 정확도를 1.76% 향상시켰다.
MNIST, Fashion MNIST, peaks 벤치마크에서 NGD는 GD보다 더 빨리 더 높은 검증 정확도에 도달했다.
NGD가 학습한 기저 함수는 GD에서 유도된 것과 비교해 훨씬 더 규칙적이고 구조적인 패턴을 보였으며, 이는 매개변수 공간 탐색 방식의 정성적 차이를 시사한다.
선형 레이어 가중치에 대한 헤시안 행렬이 대칭이며 양의 준정부호임을 증명하여, 뉴턴 방법을 통한 전역 최소화 해가 존재함을 확인했다.
이 방법은 아키텍처 수정 없이도 밀도형 및 컨볼루션 네트워크를 포함한 다양한 아키텍처에서 일관된 향상을 보였다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.