[논문 리뷰] BinaryDuo: Reducing Gradient Mismatch in Binary Activation Network by Coupling Binary Activations
BinaryDuo는 사전 학습된 삼진 활성화를 두 개의 바이너리 활성화로 분리(decoupling)하여 바이너리 활성화 네트워크를 학습시키고, 더 나은 그래디언트 매칭을 가능하게 하며 CIFAR-10 및 ImageNet 규모 모델과 같은 벤치마크에서 최첨단 BNN보다 정확도를 향상시킨다.
Binary Neural Networks (BNNs) have been garnering interest thanks to their compute cost reduction and memory savings. However, BNNs suffer from performance degradation mainly due to the gradient mismatch caused by binarizing activations. Previous works tried to address the gradient mismatch problem by reducing the discrepancy between activation functions used at forward pass and its differentiable approximation used at backward pass, which is an indirect measure. In this work, we use the gradient of smoothed loss function to better estimate the gradient mismatch in quantized neural network. Analysis using the gradient mismatch estimator indicates that using higher precision for activation is more effective than modifying the differentiable approximation of activation function. Based on the observation, we propose a new training scheme for binary activation networks called BinaryDuo in which two binary activations are coupled into a ternary activation during training. Experimental results show that BinaryDuo outperforms state-of-the-art BNNs on various benchmarks with the same amount of parameters and computing cost.
연구 동기 및 목표
- Motivate and quantify gradient mismatch in binary activation networks during training.
- Propose a better gradient-mismatch estimation method using the gradient of a smoothed loss.
- Introduce BinaryDuo: a two-stage training scheme that decouples a ternary activation into two binary activations and later fine-tunes.
- Demonstrate that BinaryDuo achieves state-of-the-art or competitive accuracy on CIFAR-10, ImageNet with AlexNet and ResNet-18 at similar parameter and compute budgets.
제안 방법
- Estimate gradient mismatch using the gradient of a smoothed loss via Coordinate Discrete Gradient (CDG).
- Show that higher-precision activations (ternary or 2-bit) mitigate gradient mismatch more effectively than sophisticated STEs.
- Propose BinaryDuo: train a network with ternary activation, then decouple into two binary activations with specific BN bias shifts to mimic the ternary function.
- Double the weights after decoupling and proportionally reconfigure layer widths to keep parameter count comparable, followed by fine-tuning the decoupled binary network.
실험 결과
연구 질문
- RQ1Can gradient mismatch be better estimated with the gradient of a smoothed loss rather than cumulative differences between activation and approximation?
- RQ2Does a training scheme that leverages ternary activations during training and decouples to binary activations during inference improve BNN performance at equal cost?
- RQ3How does BinaryDuo compare to state-of-the-art BNN methods on standard benchmarks like CIFAR-10 and ImageNet in terms of accuracy and efficiency?
주요 결과
| 네트워크 | Top-1 | Top-5 | (Mbit) | FLOP |
|---|---|---|---|---|
| AlexNet (BNN) | 41.8 | 67.1 | 62.3 | 82.3M |
| XNOR-Net | 44.2 | 69.2 | 191 | 126M |
| BNN+ | 46.1 | 75.7 | 191 | 126M |
| BinaryDuo | 52.7 | 76.0 | 189 | 119M |
| BinaryDuo(+sc)† | - | - | - | 164M |
| ResNet-18 (BNN with shortcut) | - | - | - | - |
| BinaryDuo(+sc)† | 60.9 | 82.6 | 31.9 | 164M |
- Cosine similarity between coarse gradients and Coordinate Discrete Gradient (CDG) degrades with binary activations, and is not improved by sophisticated STEs.
- Higher precision activations (ternary/2-bit) reduce gradient mismatch more effectively than improving the backward surrogate alone.
- Coupling two binary activations to emulate a ternary activation (BinaryDuo) followed by decoupling and fine-tuning yields superior accuracy under the same parameter and compute budget.
- On CIFAR-10 with VGG-7, the decoupled BinaryDuo achieves 90.44% test accuracy, surpassing the 89.07% baseline binary model after training the coupled ternary model (89.69%) and subsequent fine-tuning.
- On ImageNet with AlexNet and ResNet-18, BinaryDuo achieves top-1 accuracy of 52.7% (AlexNet) and 60.4% (ResNet-18), outperforming other BNN schemes at similar parameter and compute budgets; BinaryDuo(+sc) reaches 60.9% top-1 with shortcut on ResNet-18.
- BinaryDuo consistently outperforms state-of-the-art BNN methods across tested architectures while maintaining comparable model size and FLOPs.
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.