[Paper reading] Denoising Diffusion Probabilistic Models
2023. 9. 19. 16:55ใArtificialIntelligence/PaperReading
Abstract
- We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. (๋นํํ ์ด์ญํ)
- Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding.
- On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN. Our implementation is available at https://github.com/hojonathanho/diffusion.
Diffusion Model
- This paper presents progress in diffusion probabilistic models. A diffusion probabilistic model is a parameterized Markov chain trained using variational inference to produce samples matching the data after finite time.
- Transitions of this chain are learned to reverse a diffusion process, which is a Markov chain that gradually adds noise to the data in the opposite direction of sampling until signal is destroyed. When the diffusion consists of small amounts of Gaussian noise, it is sufficient to set the sampling chain transitions to conditional Gaussians too, allowing for a particularly simple neural network parameterization.
- Diffusion models are straightforward to define and efficient to train, but to the best of our knowledge, there has been no demonstration that they are capable of generating high quality samples. We show that diffusion models actually are capable of generating high quality samples, sometimes better than the published results on other types of generative models (Section 4).
- In addition, we show that a certain parameterization of diffusion models reveals an equivalence with denoising score matching over multiple noise levels during training and with annealed Langevin dynamics during sampling.
- Despite their sample quality, our models do not have competitive log likelihoods compared to other likelihood-based models. We find that the majority of our models’ lossless codelengths are consumed to describe imperceptible image details.
- We present a more refined analysis of this phenomenon in the language of lossy compression, and we show that the sampling procedure of diffusion models is a type of progressive decoding that resembles autoregressive decoding along a bit ordering that vastly generalizes what is normally possible with autoregressive models.
Diffusion models and denoising autoencoders
- Diffusion models might appear to be a restricted class of latent variable models, but they allow a large number of degrees of freedom in implementation. One must choose the variances βt of the forward process and the model architecture and Gaussian distribution parameterization of the reverse process.
- We ignore the fact that the forward process variances βt are learnable by reparameterization and instead fix them to constants. Thus, in our implementation, the approximate posterior q has no learnable parameters, so LT is a constant during training and can be ignored.
- Since our simplified objective discards the weighting in Eq(12), it is a weighted variational bound that emphasizes different aspects of reconstruction compared to the standard variational bound.
- In particular, our diffusion process causes the simplified objective to down-weight loss terms corresponding to small t.
- These terms train the network to denoise data with very small amounts of noise, so it is beneficial to down-weight them so that the network can focus on more difficult denoising tasks at larger t terms. We will see in our experiments that this reweighting leads to better sample quality.
- Progressive generation We also run a progressive unconditional generation process given by progressive decompression from random bits. In other words, we predict the result of the reverse process, x0, while sampling from the reverse process using Algorithm 2. Figures 6 and 10 show the resulting sample quality of x0 over the course of the reverse process. Large scale image features appear first and details appear last. Figure 7 shows stochastic predictions x0 ∼ pθ(x0|xt) with xt frozen for various t. When t is small, all but fine details are preserved, and when t is large, only large scale features are preserved. Perhaps these are hints of conceptual compression.
- We can therefore interpret the Gaussian diffusion model as a kind of autoregressive model with a generalized bit ordering that cannot be expressed by reordering data coordinates.
- Prior work has shown that such reorderings introduce inductive biases that have an impact on sample quality, so we speculate that the Gaussian diffusion serves a similar purpose, perhaps to greater effect since Gaussian noise might be more natural to add to images compared to masking noise.
- Moreover, the Gaussian diffusion length is not restricted to equal the data dimension; for instance, we use T = 1000, which is less than the dimension of the 32 × 32 × 3 or 256 × 256 × 3 images in our experiments. Gaussian diffusions can be made shorter for fast sampling or longer for model expressiveness.
Related Work
- While diffusion models might resemble flows and VAEs, diffusion models are designed so that q has no parameters and the top-level latent xT has nearly zero mutual information with the data x0.
- Our ε-prediction reverse process parameterization establishes a connection between diffusion models and denoising score matching over multiple noise levels with annealed Langevin dynamics for sampling.
- Diffusion models, however, admit straightforward log likelihood evaluation, and the training procedure explicitly trains the Langevin dynamics sampler using variational inference. The connection also has the reverse implication that a certain weighted form of denoising score matching is the same as variational inference to train a Langevin-like sampler.
- Other methods for learning transition operators of Markov chains include infusion training, variational walkback, generative stochastic networks, and others.
Conclusion
- We have presented high quality image samples using diffusion models, and we have found connections among diffusion models and variational inference for training Markov chains, denoising score matching and annealed Langevin dynamics (and energy-based models by extension), autoregressive models, and progressive lossy compression.
- Since diffusion models seem to have excellent inductive biases for image data, we look forward to investigating their utility in other data modalities and as components in other types of generative models and machine learning systems.
Question
- background ํ๋ฐ๋ถ์ ์ ์๋ ๋ค๋ฅธ ๊ณ์ฐ ๋ฐฉ์(Rao-Blackwellized fashion)์ด ๋ฌด์์ ์๋ฏธํ๋์ง ์ ๋ชจ๋ฅด๊ฒ ๋ค. Appendix๋ฅผ ๋ณด์๋ ์ ์ดํด๋์ง ์๋๋ค.
- Consequently, all KL divergences in Eq(5) are comparisons between Gaussians, so they can be calculated in a Rao-Blackwellized fashion with closed form expressions instead of high variance Monte Carlo estimates.
- To summarize, we can train the reverse process mean function approximator μθ to predict μt, or by modifying its parameterization, we can train it to predict ε. We have shown that the ε-prediction parameterization both resembles Langevin dynamics and simplifies the diffusion model’s variational bound to an objective that resembles denoising score matching.
- Langevin dynamics (๋์ฃผ๋ฑ
๋์ญํ)์ resemble ํ๋ค๋ ๊ฒ์ด ๋ฌด์์ธ์ง ๋ชจ๋ฅด๊ฒ ๋ค.
- Langevin dynamics (๋์ฃผ๋ฑ
๋์ญํ)์ resemble ํ๋ค๋ ๊ฒ์ด ๋ฌด์์ธ์ง ๋ชจ๋ฅด๊ฒ ๋ค.
- Similar to the discretized continuous distributions used in VAE decoders and autoregressive models, our choice here ensures that the variational bound is a lossless codelength of discrete data, without need of adding noise to the data or incorporating the Jacobian of the scaling operation into the log likelihood. At the end of sampling, we display μθ (x1 , 1) noiselessly.
- ์ด ๋ถ๋ถ์์ data์ ๋ ธ์ด์ฆ๋ฅผ ๋ํ๊ฑฐ๋, ์์ฝ๋น์์ incorporating ํ๋ค๋ ๊ฒ์ด ์ด๋ค ์๋ฏธ์ธ์ง ์ ๋ชจ๋ฅด๊ฒ ๋ค.
์ฐธ๊ณ ์๋ฃ
'ArtificialIntelligence > PaperReading' ์นดํ ๊ณ ๋ฆฌ์ ๋ค๋ฅธ ๊ธ
[Paper reading] Dataset Condensation with Distribution Matching (1) | 2023.10.01 |
---|---|
[Paper reading] NeRF, Representing Scenes as Neural Radiance Fields for View Synthesis (0) | 2023.09.24 |
[Paper reading] Implicit Neural Representations (0) | 2023.09.18 |
[Paper reading] Dataset Condensation Keynote (0) | 2023.09.11 |
[Paper reading] Dataset Condensation reading (0) | 2023.09.11 |