iccv2019论文全集9361-generative-modeling-by-estimating-gradients-of-the-data-distribution

上传人：我*** IP属地：北京上传时间：2020-06-19 格式：PDF 页数：13 大小：2.26MB 积分：12 举报 版权申诉

iccv2019论文全集9361-generative-modeling-by-estimating-gradients-of-the-data-distribution_第2页

iccv2019论文全集9361-generative-modeling-by-estimating-gradients-of-the-data-distribution_第3页

iccv2019论文全集9361-generative-modeling-by-estimating-gradients-of-the-data-distribution_第4页

iccv2019论文全集9361-generative-modeling-by-estimating-gradients-of-the-data-distribution_第5页

已阅读5页，还剩8页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

1、Generative Modeling by Estimating Gradients of the Data Distribution Yang Song Stanford University Stefano Ermon Stanford University Abstract We introduce a new generative model where samples are produced via Langevin dynamics using gradients of the data

2、distribution estimated with score matching. Because gradients can be ill-defi ned and hard to estimate when the data resides on low-dimensional manifolds, we perturb the data with different levels of Gaussian noise, and jointly estimate the corresponding scores, i.e., the vector fi elds of gradients

3、 of the perturbed data distribution for all noise levels. For sampling, we propose an annealed Langevin dynamics where we use gradients corresponding to gradually decreasing noise levels as the sampling process gets closer to the data manifold. Our framework allows fl exible model architectures, req

4、uires no sampling during training or the use of adversarial methods, and provides a learning objective that can be used for principled model comparisons. Our models produce samples comparable to GANs on MNIST, CelebA and CIFAR-10 datasets, achieving a new state-of-the-art inception score of 8.87 on

5、CIFAR-10. Additionally, we demonstrate that our models learn effective representations via image inpainting experiments. 1Introduction Generative models have many applications in machine learning. To list a few, they have been used to generate high-fi delity images 26,6, synthesize realistic speech

6、and music fragments 58, improve the performance of semi-supervised learning 28,10, detect adversarial examples and other anomalous data 54, imitation learning 22, and explore promising states in reinforcement learning 41. Recent progress is mainly driven by two approaches: likelihood-based methods 1

7、7, 29,11,60 and generative adversarial networks (GAN 15). The former uses log-likelihood (or a suitable surrogate) as the training objective, while the latter uses adversarial training to minimize f-divergences 40 or integral probability metrics 2, 55 between model and data distributions. Although l

8、ikelihood-based models and GANs have achieved great success, they have some intrinsic limitations. For example, likelihood-based models either have to use specialized architectures to build a normalized probability model (e.g., autoregressive models, fl ow models), or use surrogate losses (e.g., the

9、 evidence lower bound used in variational auto-encoders 29, contrastive divergence in energy-based models 21) for training. GANs avoid some of the limitations of likelihood-based models, but their training can be unstable due to the adversarial training procedure. In addition, the GAN objective is n

10、ot suitable for evaluating and comparing different GAN models. While other objectives exist for generative modeling, such as noise contrastive estimation 19 and minimum probability fl ow 50, these methods typically only work well for low-dimensional data. In this paper, we explore a new principle fo

11、r generative modeling based on estimating and sampling from the (Stein) score 33 of the logarithmic data density, which is the gradient of the log-density function at the input data point. This is a vector fi eld pointing in the direction where the log data density grows the most. We use a neural ne

12、twork trained with score matching 24 to learn this vector fi eld from data. We then produce samples using Langevin dynamics, which approximately 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. works by gradually moving a random initial sample to high densi

13、ty regions along the (estimated) vector fi eld of scores. However, there are two main challenges with this approach. First, if the data distribution is supported on a low dimensional manifoldas it is often assumed for many real world datasetsthe score will be undefi ned in the ambient space, and sco

14、re matching will fail to provide a consistent score estimator. Second, the scarcity of training data in low data density regions, e.g., far from the manifold, hinders the accuracy of score estimation and slows down the mixing of Langevin dynamics sampling. Since Langevin dynamics will often be initi

15、alized in low-density regions of the data distribution, inaccurate score estimation in these regions will negatively affect the sampling process. Moreover, mixing can be diffi cult because of the need of traversing low density regions to transition between modes of the distribution. To tackle these

16、two challenges, we propose to perturb the data with random Gaussian noise of various magnitudes. Adding random noise ensures the resulting distribution does not collapse to a low dimensional manifold. Large noise levels will produce samples in low density regions of the original (unperturbed) data d

17、istribution, thus improving score estimation. Crucially, we train a single score network conditioned on the noise level and estimate the scores at all noise magnitudes. We then propose an annealed version of Langevin dynamics, where we initially use scores corresponding to the highest noise level, a

18、nd gradually anneal down the noise level until it is small enough to be indistinguishable from the original data distribution. Our sampling strategy is inspired by simulated annealing 30, 37 which heuristically improves optimization for multimodal landscapes. Our approach has several desirable prope

19、rties. First, our objective is tractable for almost all pa- rameterizations of the score networks without the need of special constraints or architectures, and can be optimized without adversarial training, MCMC sampling, or other approximations during training. The objective can also be used to qua

20、ntitatively compare different models on the same dataset. Experimentally, we demonstrate the effi cacy of our approach on MNIST, CelebA 34, and CIFAR-10 31. We show that the samples look comparable to those generated from modern likelihood-based models and GANs. On CIFAR-10, our model sets the new s

21、tate-of-the-art inception score of 8.87 for unconditional generative models, and achieves a competitive FID score of 25.32. We show that the model learns meaningful representations of the data by image inpainting experiments. 2Score-based generative modeling Suppose our dataset consists of i.i.d. sa

22、mplesxi RDN i=1 from an unknown data distribution pdata(x) . We defi ne the score of a probability densityp(x)to bexlogp(x). The score network s: RD RDis a neural network parameterized by, which will be trained to approximate the score ofpdata(x). The goal of generative modeling is to use the datase

23、t to learn a model for generating new samples frompdata(x). The framework of score-based generative modeling has two ingredients: score matching and Langevin dynamics. 2.1Score matching for score estimation Score matching 24 is originally designed for learning non-normalized statistical models based

24、 on i.i.d. samples from an unknown data distribution. Following 53, we repurpose it for score estimation. Using score matching, we can directly train a score networks(x)to estimatexlogpdata(x)without training a model to estimatepdata(x) fi rst. Different from the typical usage of score matching, we

25、opt not to use the gradient of an energy-based model as the score network to avoid extra computation due to higher-order gradients. The objective minimizes 1 2Epdataks(x) xlogpdata(x)k 2 2, which can be shown equivalent to the following up to a constant Epdata(x) ? tr(xs(x) + 1 2 ks(x)k2 2 ? ,(1) wh

26、erexs(x)denotes the Jacobian ofs(x). As shown in 53, under some regularity conditions the minimizer of Eq.(3)(denoted ass(x) ) satisfi ess(x) = xlogpdata(x)almost surely. In practice, the expectation overpdata(x)in Eq.(1)can be quickly estimated using data samples. However, score matching is not sca

27、lable to deep networks and high dimensional data 53 due to the computation oftr(xs(x). Below we discuss two popular methods for large scale score matching. Denoising score matchingDenoising score matching 61 is a variant of score matching that completely circumventstr(xs(x) . It fi rst perturbs the

28、data pointx with a pre-specifi ed noise 2 distributionq( x | x)and then employs score matching to estimate the score of the perturbed data distribution q( x) , R q( x | x)pdata(x)dx. The objective was proved equivalent to the following: 1 2Eq( x|x)pdata(x)ks( x) x logq( x | x)k2 2. (2) As shown in 6

29、1, the optimal score network (denoted ass(x) that minimizes Eq.(2) satisfi es s(x) = xlogq(x)almost surely. However,s(x) = xlogq(x) xlogpdata(x)is true only when the noise is small enough such that q(x) pdata(x). Sliced score matchingSliced score matching 53 uses random projections to approximate tr

30、(xs(x) in score matching. The objective is EpvEpdata ? v|xs(x)v + 1 2 ks(x)k2 2 ? ,(3) wherepvis a simple distribution of random vectors, e.g., the multivariate standard normal. As shown in 53, the termv|xs(x)v can be effi ciently computed by forward mode auto-differentiation. Unlike denoising score

31、 matching which estimates the scores of perturbed data, sliced score matching provides score estimation for the original unperturbed data distribution, but requires around four times more computations due to the forward mode auto-differentiation. 2.2Sampling with Langevin dynamics Langevin dynamics

32、can produce samples from a probability densityp(x)using only the score function xlogp(x) . Given a fi xed step size? 0, and an initial value x0 (x)withbeing a prior distribution, the Langevin method recursively computes the following xt= xt1+ ? 2x logp( xt1) + ? z t, (4) wherezt N(0,I). The distribu

33、tion of xTequalsp(x)when? 0andT , in which case xT becomes an exact sample fromp(x)under some regularity conditions 62. When? 0andT 1. Letq(x) , R pdata(t)N(x | t,2I)dtdenote the perturbed data distribution. We choose the noise levelsiL i=1 such that1 is large enough to mitigate the diffi culties di

34、scussed in Section 3, andLis small enough to minimize the effect on data. We aim to train a conditional score network to jointly estimate the scores of all perturbed data distributions, i.e., iL i=1 : s(x,) xlogq(x). Note that s(x,) RDwhen x RD. We call s(x,) a Noise Conditional Score Network (NCSN)

35、. Similar to likelihood-based generative models and GANs, the design of model architectures plays an important role in generating high quality samples. In this work, we mostly focus on architectures useful for image generation, and leave the architecture design for other domains as future work. Sinc

36、e the output of our noise conditional score networks(x,)has the same shape as the input imagex, we draw inspiration from successful model architectures for dense prediction of images (e.g., semantic segmentation). In the experiments, our models(x,)combines the architecture design of U-Net 46 with di

37、lated/atrous convolution 64,65,8both of which have been proved very successful in semantic segmentation. In addition, we adopt instance normalization in our score network, inspired by its superior performance in some image generation tasks 57,13,23, and we 5 use a modifi ed version of conditional in

38、stance normalization 13 to provide conditioning oni. More details on our architecture can be found in Appendix A. 4.2Learning NCSNs via score matching Both sliced and denoising score matching can train NCSNs. We adopt denoising score matching as it is slightly faster and naturally fi ts the task of

39、estimating scores of noise-perturbed data distributions. However, we emphasize that empirically sliced score matching can train NCSNs as well as denoising score matching. We choose the noise distribution to beq( x | x) = N( x | x,2I); therefore xlogq( x | x) = ( xx)/2. For a given , the denoising sc

40、ore matching objective (Eq. (2) is (;) , 1 2Epdata(x)E xN(x, 2I) ? ? ? ?s( x,) + x x 2 ? ? ? ? 2 2 ? .(5) Then, we combine Eq. (5) for all iL i=1 to get one unifi ed objective L(;iL i=1) , 1 L L X i=1 (i)(;i),(6) where(i) 0 is a coeffi cient function depending oni. Assumings(x,)has enough capacity,

41、s(x,)minimizes Eq.(6)if and only ifs(x,i) = xlogqi(x)a.s. for alli 1,2, ,L, because Eq. (6) is a conical combination of L denoising score matching objectives. There can be many possible choices of(). Ideally, we hope that the values of(i)(;i) for alliL i=1 are roughly of the same order of magnitude.

42、 Empirically, we observe that when the score networks are trained to optimality, we approximately haveks(x,)k2 1/. This inspires us to choose() = 2. Because under this choice, we have()(;) = 2(;) = 1 2Eks( x,) + xx k2 2. Since xx N(0,I)andks(x,)k2 1, we can easily conclude that the order of magnitud

43、e of ()(;) does not depend on . We emphasize that our objective Eq.(6)requires no adversarial training, no surrogate losses, and no sampling from the score network during training (e.g., unlike contrastive divergence). Also, it does not requires(x,)to have special architectures in order to be tracta

44、ble. In addition, when()and iL i=1 are fi xed, it can be used to quantitatively compare different NCSNs. 4.3NCSN inference via annealed Langevin dynamics Algorithm 1 Annealed Langevin dynamics. Require: iL i=1,?,T. 1:Initialize x0 2:for i 1 to L do 3:i ? 2 i/2L . iis the step size. 4:for t 1 to T do

45、 5:Draw zt N(0,I) 6: xt xt1+ i 2 s( xt1,i) + izt 7:end for 8: x0 xT 9:end for return xT After the NCSNs(x,)is trained, we propose a sampling approachannealed Langevin dy- namics (Alg. 1)to produced samples, inspired by simulated annealing 30 and annealed im- portance sampling 37. As shown in Alg. 1,

46、 we start annealed Langevin dynamics by initializing the samples from some fi xed prior distribution, e.g., uniform noise. Then, we run Langevin dy- namics to sample fromq1(x)with step size 1. Next, we run Langevin dynamics to sample fromq2(x) , starting from the fi nal samples of the previous simul

47、ation and using a reduced step size2 . We continue in this fashion, using the fi - nal samples of Langevin dynamics forqi1(x) as the initial samples of Langevin dynamic for qi(x), and tuning down the step sizeigradually withi= ? 2 i/2L. Finally, we run Langevin dynamics to sample from qL(x), which i

48、s close to pdata(x) when L 0. Since the distributionsqiL i=1are all perturbed by Gaussian noise, their supports span the whole space and their scores are well-defi ned, avoiding diffi culties from the manifold hypothesis. When 1 is suffi ciently large, the low density regions ofq1(x)become small and

49、 the modes become less isolated. As discussed previously, this can make score estimation more accurate, and the mixing of Langevin dynamics faster. We can therefore assume that Langevin dynamics produce good samples forq1(x). These samples are likely to come from high density regions ofq1(x), which

50、means 6 ModelInceptionFID CIFAR-10 Unconditional PixelCNN 594.6065.93 PixelIQN 425.2949.46 EBM 126.0240.58 WGAN-GP 187.86 .0736.4 MoLM 457.90 .1018.9 SNGAN 368.22 .0521.7 ProgressiveGAN 258.80 .05- NCSN (Ours)8.87 .1225.32 CIFAR-10 Conditional EBM 128.3037.9 SNGAN 368.60 .0825.5 BigGAN 69.2214.73 Ta

51、ble 1: Inception and FID scores for CIFAR-10 Figure 4: Intermediate samples of annealed Langevin dynamics. they are also likely to reside in the high density regions ofq2(x), given thatq1(x)andq2(x)only slightly differ from each other. As score estimation and Langevin dynamics perform better in high

52、 density regions, samples fromq1(x)will serve as good initial samples for Langevin dynamics of q2(x). Similarly,qi1(x)provides good initial samples forqi(x) , and fi nally we obtain samples of good quality from qL(x). There could be many possible ways of tuningiaccording toiin Alg. 1. Our choice isi

53、 2 i. The motivation is to fi x the magnitude of the “signal-to-noise” ratio is(x,i) 2iz in Langevin dynam- ics. Note thatEkis(x,i) 2iz k2 2 E iks(x,i)k2 2 4 1 4Ekis(x,i)k 2 2. Recall that empirically we foundks(x,)k2 1/when the score network is trained close to optimal, in which case Ekis(x;i)k2 2

54、1. Thereforek is(x,i) 2iz k2 1 4Ekis(x,i)k 2 2 1 4 does not depend oni. To demonstrate the effi cacy of our annealed Langevin dynamics, we provide a toy example where the goal is to sample from a mixture of Gaussian with two well-separated modes using only scores. We apply Alg. 1 to sample from the

55、mixture of Gausssian used in Section 3.2. In the experiment, we chooseiL i=1to be a geometric progression, withL = 10,1= 10and10= 0.1. The results are provided in Fig. 3. Comparing Fig. 3 (b) against (c), annealed Langevin dynamics correctly recover the relative weights between the two modes whereas

56、 standard Langevin dynamics fail. 5Experiments In this section, we demonstrate that our NCSNs are able to produce high quality image samples on several commonly used image datasets. In addition, we show that our models learn reasonable image representations by image inpainting experiments. SetupWe u

57、se MNIST, CelebA 34, and CIFAR-10 31 datasets in our experiments. For CelebA, the images are fi rst center-cropped to140 140and then resized to32 32. All images are rescaled so that pixel values are in0,1. We chooseL = 10different standard deviations such thatiL i=1is a geometric sequence with1= 1an

58、d10= 0.01. Note that Gaussian noise of = 0.01is almost indistinguishable to human eyes for image data. When using annealed Langevin dynamics for image generation, we chooseT = 100and? = 2 105, and use uniform noise as our initial samples. We found the results are robust w.r.t. the choice ofT, and?between5 106and5 105generally works fi ne. We provide additional details on model architecture and settings in

人人文库> 全部分类> 应用文书

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

iccv2019论文全集9361-generative-modeling-by-estimating-gradients-of-the-data-distribution

文档简介

温馨提示

最新文档

评论

iccv2019论文全集9361-generative-modeling-by-estimating-gradients-of-the-data-distribution

文档简介

温馨提示

最新文档

评论

相关文档