Diffusion models beat GANs on image synthesis | Proceedings of the 35th International Conference on Neural Information Processing Systems (2024)

Advanced Search
Browse
About
- Sign in
- Register

Advanced Search

nips

research-article

Free Access

Authors:
Prafulla Dhariwal OpenAI

OpenAI
View Profile

,
Alex Nichol OpenAI

OpenAI
View Profile

NIPS '21: Proceedings of the 35th International Conference on Neural Information Processing SystemsDecember 2021Article No.: 672Pages 8780–8794

Published:10 June 2024Publication History

0citation
0
Downloads

Metrics

Total Citations0Total Downloads0

Last 12 Months0

Last 6 weeks0

Get Citation Alerts
New Citation Alert added!
This alert has been successfully added and will be sent to:
You will be notified whenever a record that you have chosen has been cited.
To manage your alert preferences, click on the button below.
Manage my Alerts
New Citation Alert!
Please log in to your account
Publisher Site

NIPS '21: Proceedings of the 35th International Conference on Neural Information Processing Systems

Diffusion models beat GANs on image synthesis

Pages 8780–8794

PreviousChapterNextChapter

ABSTRACT

We show that diffusion models can achieve image sample quality superior to the current state-of-the-art generative models. We achieve this on unconditional image synthesis by finding a better architecture through a series of ablations. For conditional image synthesis, we further improve sample quality with classifier guidance: a simple, compute-efficient method for trading off diversity for fidelity using gradients from a classifier. We achieve an FID of 2.97 on ImageNet 128×128, 4.59 on ImageNet 256×256, and 7.72 on ImageNet 512×512, and we match BigGAN-deep even with as few as 25 forward passes per sample, all while maintaining better coverage of the distribution. Finally, we find that classifier guidance combines well with upsampling diffusion models, further improving FID to 3.94 on ImageNet 256×256 and 3.85 on ImageNet 512×512.

Skip Supplemental Material Section

Supplemental Material

Available for Download

pdf

3540261.3540933_supp.pdf (35.5 MB)

Supplemental material.

References

David Ackley, Geoffrey Hinton, and Terrence Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147-169, 1985.Google ScholarCross Ref
Adverb. The big sleep. https://twitter.com/advadnoun/status/1351038053033406468, 2021.Google Scholar
Shruti Agarwal, Hany Farid, Yuming Gu, Mingming He, Koki Nagano, and Hao Li. Protecting world leaders against deep fakes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2019.Google Scholar
Hadeer Ahmed, Issa Traore, and Sherif Saad. Detection of online fake news using n-gram analysis and machine learning techniques. pages 127–138, 10 2017. ISBN 978-3-319-69154-1. .Google ScholarCross Ref
Hadeer Ahmed, Issa Traore, and Sherif Saad. Detecting opinion spams and fake news using text classification. Security and Privacy, 1(1):e9, 2018. . URL https://onlinelibrary.wiley.com/doi/abs/10.1002/spy2.9.Google ScholarCross Ref
Shane Barratt and Rishi Sharma. A note on the inception score. arXiv:1801.01973, 2018.Google Scholar
Andrew Brock, Theodore Lim, J. M. Ritchie, and Nick Weston. Neural photo editing with introspective adversarial networks. arXiv:1609.07093, 2016.Google Scholar
Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv:1809.11096, 2018.Google Scholar
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari-wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. arXiv:2005.14165, 2020.Google Scholar
Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In International Conference on Machine Learning, pages 1691–1703. PMLR, 2020.Google Scholar
Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, and William Chan. Wavegrad: Estimating gradients for waveform generation. arXiv:2009.00713, 2020.Google Scholar
Rewon Child. Very deep vaes generalize autoregressive models and can outperform them on images. arXiv:2011.10650, 2021.Google Scholar
Peter Dayan, Geoffrey E Hinton, Radford M Neal, and Richard S Zemel. The helmholtz machine. Neural computation, 7(5):889–904, 1995.Google ScholarDigital Library
Harm de Vries, Florian Strub, Jérémie Mary, Hugo Larochelle, Olivier Pietquin, and Aaron Courville. Modulating early visual processing by language. arXiv:1707.00683, 2017.Google Scholar
DeepMind. Biggan-deep 128x128 on tensorflow hub. https://tfhub.dev/deepmind/biggan-deep-128/1, 2018.Google Scholar
Terrance DeVries, Michal Drozdzal, and Graham W. Taylor. Instance selection for gans. arXiv:2007.15255, 2020.Google Scholar
Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever. Jukebox: A generative model for music. arXiv:2005.00341, 2020.Google Scholar
Jeff Donahue and Karen Simonyan. Large scale adversarial representation learning. arXiv:1907.02544, 2019.Google Scholar
Yilun Du and Igor Mordatch. Implicit generation and generalization in energy-based models. arXiv:1903.08689, 2019.Google Scholar
Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for artistic style. arXiv:1610.07629, 2017.Google Scholar
Patrick Esser, Robin Rombach, and Björn Ommer. Taming transformers for high-resolution image synthesis. arXiv:2012.09841, 2020.Google Scholar
Jeffrey De Fauw, Sander Dieleman, and Karen Simonyan. Hierarchical autoregressive image models with auxiliary decoders. arXiv:1903.04933, 2019.Google Scholar
Federico A. Galatolo, Mario G. C. A. Cimino, and Gigliola Vaglini. Generating images from caption and vice versa via clip-guided generative latent space search. arXiv:2102.01645, 2021.Google Scholar
Ruiqi Gao, Yang Song, Ben Poole, Ying Nian Wu, and Diederik P. Kingma. Learning energy-based models by diffusion recovery likelihood. arXiv:2012.08125, 2020.Google Scholar
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. arXiv:1406.2661, 2014.Google Scholar
Google. Cloud tpus. https://cloud.google.com/tpu/, 2018.Google Scholar
Anirudh Goyal, Nan Rosemary Ke, Surya Ganguli, and Yoshua Bengio. Variational walkback: Learning a transition operator as a stochastic recurrent net. arXiv:1711.02282, 2017.Google Scholar
Will Grathwohl, Kuan-Chieh Wang, Jörn-Henrik Jacobsen, David Duvenaud, Mohammad Norouzi, and Kevin Swersky. Your classifier is secretly an energy based model and you should treat it like one. arXiv:1912.03263, 2019.Google Scholar
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems 30 (NIPS 2017), 2017.Google Scholar
Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002.Google ScholarDigital Library
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. arXiv:2006.11239, 2020.Google Scholar
Alexia Jolicoeur-Martineau, Rémi Piché-Taillefer, Rémi Tachet des Combes, and Ioan-nis Mitliagkas. Adversarial score matching and improved sampling for image generation. arXiv:2009.05475, 2020.Google Scholar
Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. arXiv:arXiv:1812.04948, 2019.Google Scholar
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. arXiv:1912.04958, 2019.Google Scholar
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.Google Scholar
Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. arXiv:2009.09761, 2020.Google Scholar
Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. CIFAR-10 (Canadian Institute for Advanced Research), 2009. URL http://www.cs.toronto.edu/~kriz/cifar.html.Google Scholar
Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. arXiv:1904.06991, 2019.Google Scholar
Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. arXiv:1611.06612, 2016.Google Scholar
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.Google ScholarDigital Library
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv:1711.05101, 2017.Google Scholar
Mario Lucic, Michael Tschannen, Marvin Ritter, Xiaohua Zhai, Olivier Bachem, and Sylvain Gelly. High-fidelity image generation with fewer labels. arXiv:1903.02271, 2019.Google Scholar
Eric Luhman and Troy Luhman. Knowledge distillation in iterative generative models for improved sampling speed. arXiv:2101.02388, 2021.Google Scholar
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training. arXiv:1710.03740, 2017.Google Scholar
Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv:1411.1784, 2014.Google Scholar
Takeru Miyato and Masanori Koyama. cgans with projection discriminator. arXiv:1802.05637, 2018.Google Scholar
Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. arXiv:1802.05957, 2018.Google Scholar
Charlie Nash, Jacob Menick, Sander Dieleman, and Peter W. Battaglia. Generating images with sparse representations. arXiv:2103.03841, 2021.Google Scholar
Alex Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. arXiv:2102.09672, 2021.Google Scholar
NVIDIA. Stylegan2. https://github.com/NVlabs/stylegan2, 2019.Google Scholar
Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On buggy resizing libraries and surprising subtleties in fid calculation. arXiv:2104.11222, 2021.Google Scholar
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. arXiv:1912.01703, 2019.Google Scholar
Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text-driven manipulation of stylegan imagery. arXiv:2103.17249, 2021.Google Scholar
Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. arXiv:1709.07871, 2017.Google Scholar
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. arXiv:2103.00020, 2021.Google Scholar
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. arXiv:2102.12092, 2021.Google Scholar
Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with VQ-VAE-2. arXiv:1906.00446, 2019.Google Scholar
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. arXiv:1505.04597, 2015.Google Scholar
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. arXiv:1409.0575, 2014.Google Scholar
Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J. Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. arXiv:arXiv:2104.07636, 2021.Google Scholar
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. arXiv:1606.03498, 2016.Google Scholar
Shibani Santurkar, Dimitris Tsipras, Brandon Tran, Andrew Ilyas, Logan Engstrom, and Aleksander Madry. Image synthesis with a single (robust) classifier. arXiv:1906.09453, 2019.Google Scholar
Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. arXiv:1503.03585, 2015.Google Scholar
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv:2010.02502, 2020.Google Scholar
Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. arXiv:2006.09011, 2020.Google Scholar
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. arXiv:arXiv:1907.05600, 2020.Google Scholar
Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv:2011.13456, 2020.Google Scholar
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. arXiv:1512.00567, 2015.Google Scholar
Arash Vahdat and Jan Kautz. Nvae: A deep hierarchical variational autoencoder. arXiv:2007.03898, 2020.Google Scholar
Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv:1609.03499, 2016.Google Scholar
Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. arXiv:1711.00937, 2017.Google Scholar
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv:1706.03762, 2017.Google Scholar
Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681–688. Citeseer, 2011.Google Scholar
Yan Wu, Jeff Donahue, David Balduzzi, Karen Simonyan, and Timothy Lillicrap. Logan: Latent optimisation for generative adversarial networks. arXiv:1912.00953, 2019.Google Scholar
Yuxin Wu and Kaiming He. Group normalization. arXiv:1803.08494, 2018.Google Scholar
Jianwen Xie, Yang Lu, Song-Chun Zhu, and Ying Nian Wu. A theory of generative convnet. arXiv:1602.03264, 2016.Google Scholar
Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv:1506.03365, 2015.Google Scholar
Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. arXiv:1612.03242, 2016.Google Scholar
Ligeng Zhu. Thop. https://github.com/Lyken17/pytorch-OpCounter, 2018.Google Scholar

Cited By

View all

Recommendations

A denoising approach via wavelet domain diffusion and image domain diffusion
This paper presents a new image denoising algorithm based on wavelet transform and nonlinear diffusion. Although the wavelet domain diffusion methods are very effective in image denoising, the salient artifacts are still produced. On the other hand, the ...
Read More
Conditional image synthesis with auxiliary classifier GANs
ICML'17: Proceedings of the 34th International Conference on Machine Learning - Volume 70
In this paper we introduce new methods for the improved training of generative adversarial networks (GANs) for image synthesis. We construct a variant of GANs employing label conditioning that results in 128 x 128 resolution image samples exhibiting ...
Read More
Cross-view image synthesis using geometry-guided conditional GANs
Abstract
We address the problem of generating images across two drastically different views, namely ground (street) and aerial (overhead) views. Image synthesis by itself is a very challenging computer vision task and is even more so when ...
Highlights
- The first work to synthesize outdoor natural scene images between aerial and street view, conditioned on images in one view.
Read More

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Information
Contributors

Published in
NIPS '21: Proceedings of the 35th International Conference on Neural Information Processing Systems
December 2021
30517 pages
ISBN:9781713845393
Editors:
M. Ranzato,
A. Beygelzimer,
Y. Dauphin,
P.S. Liang,
J. Wortman Vaughan
Copyright © 2021 Neural Information Processing Systems Foundation, Inc.
Sponsors
In-Cooperation
Publisher
Curran Associates Inc.
Red Hook, NY, United States
Publication History
- Published: 10 June 2024
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Bibliometrics
Citations0

Article Metrics
- Total Citations
  View Citations
- Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

Diffusion models beat GANs on image synthesis | Proceedings of the 35th International Conference on Neural Information Processing Systems (2024)

New Citation Alert added!

New Citation Alert!

NIPS '21: Proceedings of the 35th International Conference on Neural Information Processing Systems

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Recommendations

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

Digital Edition

Caption

Export Citations

References