QUICK REVIEW

[Paper Review] Compressing GANs using Knowledge Distillation

Angeline Aguinaldo, Ping-yeh Chiang|arXiv (Cornell University)|Feb 1, 2019

Generative Adversarial Networks and Image Synthesis19 references61 citations

TL;DR

This paper demonstrates compressing over-parameterized GANs via knowledge distillation, producing small student GANs that closely match or outperform similarly-sized GANs trained from scratch across MNIST, CIFAR-10, and Celeb-A, with substantial compression ratios.

ABSTRACT

Generative Adversarial Networks (GANs) have been used in several machine learning tasks such as domain transfer, super resolution, and synthetic data generation. State-of-the-art GANs often use tens of millions of parameters, making them expensive to deploy for applications in low SWAP (size, weight, and power) hardware, such as mobile devices, and for applications with real time capabilities. There has been no work found to reduce the number of parameters used in GANs. Therefore, we propose a method to compress GANs using knowledge distillation techniques, in which a smaller "student" GAN learns to mimic a larger "teacher" GAN. We show that the distillation methods used on MNIST, CIFAR-10, and Celeb-A datasets can compress teacher GANs at ratios of 1669:1, 58:1, and 87:1, respectively, while retaining the quality of the generated image. From our experiments, we observe a qualitative limit for GAN's compression. Moreover, we observe that, with a fixed parameter budget, compressed GANs outperform GANs trained using standard training methods. We conjecture that this is partially owing to the optimization landscape of over-parameterized GANs which allows efficient training using alternating gradient descent. Thus, training an over-parameterized GAN followed by our proposed compression scheme provides a high quality generative model with a small number of parameters.

Motivation & Objective

Motivate and address the computational burden of large GANs for low SWaP hardware and real-time applications.
Introduce knowledge distillation tailored for GANs to compress generator networks while preserving image quality.
Empirically evaluate compression on MNIST, CIFAR-10, and Celeb-A using IS and FID as quality metrics.
Analyze the limits of GAN compression and the role of over-parameterization in successful distillation.

Proposed method

Use a teacher-student framework where a large, over-parameterized GAN (teacher) guides a smaller GAN (student).
Adopt two training schemes for the student: (i) MSE loss minimizing pixel-wise distance to the teacher outputs; (ii) joint loss combining GAN objectives with an MSE term to align student outputs with the teacher.
Select teacher networks by training various sizes and choosing the best via Inception Score and FID.
Control model size via a depth scale factor d, with teacher sizes and corresponding parameter counts explored.
Evaluate compression using Inception Score, Frechet Inception Distance, and, for blur, Variance of Laplacian.

Experimental results

Research questions

RQ1Can a student GAN with far fewer parameters replicate the teacher GAN’s generation function across the latent space?
RQ2What compression ratios are achievable for MNIST, CIFAR-10, and Celeb-A without substantial loss in image quality?
RQ3Does knowledge distillation provide advantages over training similarly-sized GANs from scratch in terms of IS, FID, and sharpness?
RQ4What are the visual and quantitative limits to GAN compression across datasets of varying complexity?
RQ5How does a joint GAN+MSE loss compare to MSE alone for compression quality, particularly regarding image sharpness?

Key findings

Student GANs consistently outperform regular GANs of the same small size across all datasets.
On MNIST, compression reaches 1,669:1 with 83% of the teacher’s Inception Score preserved.
On CIFAR-10 and Celeb-A, compression achieves substantial ratios (58:1 and 87:1 respectively) with competitive FID scores.
Compressed students approximate the teacher’s generation function across the latent space, indicating knowledge transfer rather than memorization.
Joint loss improves FID slightly and yields significantly sharper images (higher VoL) than MSE-only training, though some blur remains at high compression on more complex data.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.