Generative Adversarial Networks (GANs) are a class of deep learning models that learn to produce new (or pseudo-real) data. Their advent in 2014 and refinement thereafter have led to them dominating the image generation domain for the past few years and laying the foundations of a new paradigm – deep fakes. Their ability to mimic training data and produce new samples similar to it has gone more or less unmatched. As such, they hold the state-of-the-art (SOTA) in most image generation tasks today.
Despite these advantages, GANs are notoriously hard to train and are prone to issues like mode collapse and unintelligible training procedures. Moreover, researchers have realized that GANs focus more on fidelity rather than capturing a diverse set of the training data's distribution. As such, researchers have been looking into improving GANs in this domain or eyeing other architectures that would perform better in the same domain.
Two researchers, Prafulla Dhariwal and Alex Nichol from OpenAI, one of the leading AI-research labs, took up the question and looked towards other architectures. In their latest work "Diffusion Models Beat GANs on Image Synthesis", published in the preprint repository arXiv this week, they show that a different deep learning architecture, called diffusion models, addresses the aforementioned shortcomings of GANs. They show that not only are diffusion models better at capturing a greater breadth of the training data's variance compared to GANs, but they also beat the SOTA GANs in image generation tasks.
"We show that models with our improved architecture achieve state-of-the-art on unconditional image synthesis tasks, and with classifier guidance achieve state-of-the-art on conditional image synthesis. When using classifier guidance, we find that we can sample with as few as 25 forward passes while maintaining FIDs comparable to BigGAN. We also compare our improved models to upsampling stacks, finding that the two approaches give complementary improvements and that combining them gives the best results on ImageNet 512x512."
Before moving further, it is important to understand the crux of diffusion models. Diffusion models are another class of deep learning models (specifically, likelihood models), that do well in image-generation tasks. Unlike GANs which learn to map a random noisy image to a point in the training distribution, diffusion models take a noisy image and then perform a series of de-noising steps that progressively cut the noise and reveal an image that belongs to the training data's distribution.
Dhariwal and Nichol hypothesized that a series of upgrades to the architecture of contemporary diffusion models would improve their performance. They also incorporated the choice of the tradeoff between fidelity and variance characteristic of GANs into their own diffusion models as well. Taking inspiration from the attention layers of the Transformer architecture, improving the UNet architecture, using Adaptive Group Normalization, and conditioning on class labels, the two researchers trained a fleet of diffusion models and then pitted them against the SOTA GANs in image generation tasks.
Both the BigGAN and OpenAI's models were trained on the LSUN and ImageNet datasets for unconditional and conditional image generation tasks. The output images were compared using several metrics that weighed precision, recall, and fidelity. Most notably, the venerable Fréchet Inception Distance (FID) and sFID metrics, which quantify the difference between two image distributions, were used.
OpenAI's diffusion models obtain the best FID on each task and the best sFID on all but one task. The table below shows the results. Note that as stated earlier, FID measures the distance between two image distributions so a perfect score is 0.0, meaning that the two distributions are identical. Thus, in the table below, the lower the score, the better.
Qualitatively, this leads to the following image outputs. The left column houses results from the SOTA BigGAN-deep model, the middle column has outputs from OpenAI's diffusion models, and the right column has images from the original training dataset.
More samples from the experiment are attached at the end of this article. The astute reader would notice that perceptually, the images above look certainly similar, but the authors pointed out that the diffusion models captured more breadth of information from the training set:
"While the samples are of similar perceptual quality, the diffusion model contains more modes than the GAN, such as zoomed ostrich heads, single flamingos, different orientations of cheeseburgers, and a tinca fish with no human holding it."
With these results out in the open now, the researchers believe that diffusion models are an "extremely promising direction" for generative modeling, a domain that has largely been dominated by GANs.
Despite the promising results, the researchers noted that diffusion models are not without their own set of limitations. Currently, training diffusion models requires more computational resources than GANs. Image synthesis is slower as well due to the multiple de-noising steps that progressively remove noise from the image. They have pointed to existing approaches tackling these issues in their paper, which might be explored in the future.