版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Implicit Generation and Modeling with Energy-Based Models Yilun Du MIT CSAIL Igor Mordatch Google Brain Abstract Energy based models (EBMs) are appealing due to their generality and simplicity in likelihood modeling, but have been traditionally diffi cult to train. We present techniques to scale MCM
2、C based EBM training on continuous neural networks, and we show its success on the high-dimensional data domains of ImageNet32x32, ImageNet128x128, CIFAR-10, and robotic hand trajectories, achieving better samples than other likelihood models and nearing the performance of contemporary GAN approache
3、s, while covering all modes of the data. We highlight some unique capabilities of implicit generation such as compositionality and corrupt image reconstruction and inpainting. Finally, we show that EBMs are useful models across a wide variety of tasks, achieving state-of-the-art out-of-distribution
4、classifi cation, adversarially robust classifi cation, state-of-the-art continual online class learning, and coherent long term predicted trajectory rollouts. 1Introduction Learning models of the data distribution and generating samples are important problems in machine learning for which a number o
5、f methods have been proposed, such as Variational Autoencoders (VAEs) Kingma and Welling, 2014 and Generative Adversarial Networks (GANs) Goodfellow et al., 2014.In this work, we advocate for continuous energy-based models (EBMs), represented as neural networks, for generative modeling tasks and as
6、a building block for a wide variety of tasks. These models aim to learn an energy functionE(x)that assigns low energy values to inputsxin the data distribution and high energy values to other inputs. Importantly, they allow the use of an implicit sample generation procedure, where samplexis found fr
7、omx eE(x)through MCMC sampling. Combining implicit sampling with energy-based models for generative modeling has a number of conceptual advantages compared to methods such as VAEs and GANs which use explicit functions to generate samples: Simplicity and Stability:An EBM is the only object that needs
8、 to be trained and designed. Separate networks are not tuned to ensure balance (for example, He et al., 2019 point out unbalanced training can result in posterior collapse in VAEs or poor performance in GANs Kurach et al., 2018). Sharing of Statistical Strength:Since the EBM is the only trained obje
9、ct, it requires fewer model parameters than approaches that use multiple networks. More importantly, the model being concen- trated in a single network allows the training process to develop a shared set of features as opposed to developing them redundantly in separate networks. Adaptive Computation
10、 Time:Implicit sample generation in our work is an iterative stochastic optimization process, which allows for a trade-off between generation quality and computation time. Work done at OpenAI Correspondence to: Additional results, source code, and pre-trained models are available at 3
11、3rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. This allows for a system that can make fast coarse guesses or more deliberate inferences by running the optimization process longer. It also allows for refi nement of external guesses. Flexibility Of Generatio
12、n:The power of an explicit generator network can become a bottleneck on the generation quality. For example, VAEs and fl ow-based models are bound by the manifold structure of the prior distribution and consequently have issues modeling discontinuous data manifolds, often assigning probability mass
13、to areas unwarranted by the data. EBMs avoid this issue by directly modeling particular regions as high or lower energy. Compositionality:If we think of energy functions as costs for a certain goals or constraints, summa- tion of two or more energies corresponds to satisfying all their goals or cons
14、traints Mnih and Hinton, 2004, Haarnoja et al., 2017. While such composition is simple for energy functions (or product of experts Hinton, 1999), it induces complex changes to the generator that may be diffi cult to capture with explicit generator networks. Despite these advantages, energy-based mod
15、els with implicit generation have been diffi cult to use on complex high-dimensional data domains. In this work, we use Langevin dynamics Welling and Teh, 2011, which uses gradient information for effective sampling and initializes chains from random noise for more mixing. We further maintain a repl
16、ay buffer of past samples (similarly to Tieleman, 2008 or Mnih et al., 2013) and use them to initialize Langevin dynamics to allow mixing between chains. An overview of our approach is presented in Figure 1. 123K Langevin Dynamics Eq. 1 X X- X+ ML Objective Eq. 2 Replay BufferTraining Data Figure 1:
17、Overview of our method and the interrelationship of the components involved. Empirically, we show that energy-based models trained on CIFAR-10 or ImageNet image datasets generate higher quality image samples than likelihood models and near- ing that of contemporary GANs approaches, while not sufferi
18、ng from mode collapse. The models exhibit prop- erties such as correctly assigning lower likelihood to out- of-distribution images than other methods (no spurious modes) and generating diverse plausible image comple- tions (covering all data modes). Implicit generation allows our models to naturally
19、 denoise or inpaint corrupted im- ages, convert general images to an image from a specifi c class, and generate samples that are compositions of mul- tiple independent models. Our contributions in this work are threefold. Firstly, we present an algorithm and techniques for training energy- based mod
20、els that scale to challenging high-dimensional domains. Secondly, we highlight unique properties of energy-based models with implicit generation, such as compositionality and automatic decorruption and inpainting. Finally, we show that energy- based models are useful across a series of domains, on t
21、asks such as out-of-distribution generalization, adversarially robust classifi cation, multi-step trajectory prediction and online learning. 2Related Work Energy-based models (EBMs) have a long history in machine learning. Ackley et al. 1985, Hinton 2006, Salakhutdinov and Hinton 2009 proposed laten
22、t based EBMs where energy is represented as a composition of latent and observable variables. In contrast Mnih and Hinton 2004, Hinton et al. 2006 proposed EBMs where inputs are directly mapped to outputs, a structure we follow. We refer readers to LeCun et al., 2006 for a comprehensive tutorial on
23、energy models. The primary diffi culty in training EBMs comes from effectively estimating and sampling the partition function. One approach to train energy based models is sample the partition function through amortized generation. Kim and Bengio 2016, Zhao et al. 2016, Haarnoja et al. 2017, Kumar e
24、t al. 2019 propose learning a separate network to generate samples, which makes these methods closely connected to GANs Finn et al., 2016, but these methods do not have the advantages of implicit sampling noted in the introduction. Furthermore, amortized generation is prone to mode collapse, especia
25、lly when training the sampling network without an entropy term which is often approximated or ignored. An alternative approach is to use MCMC sampling to estimate the partition function. This has an advantage of provable mode exploration and allows the benefi ts of implicit generation listed in the
26、2 introduction. Hinton 2006 proposed Contrastive Divergence, which uses gradient free MCMC chains initialized from training data to estimate the partition function. Similarly, Salakhutdinov and Hinton 2009 apply contrastive divergence, while Tieleman 2008 proposes PCD, which propagates MCMC chains t
27、hroughout training. By contrast, we initialize chains from random noise, allowing each mode of the model to be visited with equal probability. But initialization from random noise comes at a cost of longer mixing times. As a result we use Gradient based MCMC (Langevin Dynamics) for more effi cient s
28、ampling and to offset the increase of mixing time which was also studied previously in Teh et al., 2003, Xie et al., 2016. We note that HMC Neal, 2011 may be an even more effi cient gradient algorithm for MCMC sampling, though we found Langevin Dynamics to be more stable. To allow gradient based MCM
29、C, we use continuous inputs, while most approaches have used discrete inputs. We build on idea of PCD and maintain a replay buffer of past samples to additionally reduce mixing times. 3Energy-Based Models and Sampling Given a datapointx, letE(x) Rbe the energy function. In our work this function is
30、represented by a deep neural network parameterized by weights . The energy function defi nes a probability distribution via the Boltzmann distributionp(x) = exp(E(x) Z() , whereZ() = R exp(E(x)dx denotes the partition function. Generating samples from this distribution is challenging, with previous
31、work relying on MCMC methods such as random walk or Gibbs sampling Hinton, 2006. These methods have long mixing times, especially for high-dimensional complex data such as images. To improve the mixing time of the sampling procedure, we use Langevin dynamics which makes use of the gradient of the en
32、ergy function to undergo sampling xk= xk1 2 xE( xk1) + k, k N(0,)(1) where we let the above iterative procedure defi ne a distributionqsuch that xK q. As shown by Welling and Teh 2011 asK and 0thenq pand this procedure generates samples from the distribution defi ned by the energy function. Thus, sa
33、mples are generated implicitlyby the energy function E as opposed to being explicitly generated by a feedforward network. In the domain of images, if the energy network has a convolutional architecture, energy gradient xEin (1) conveniently has a deconvolutional architecture. Thus it mirrors a typic
34、al image generator network architecture, but without it needing to be explicitly designed or balanced. We take two views of the energy functionE : fi rstly, it is an object that defi nes a probability distribution over data and secondly it defi nes an implicit generator via (1). 3.1Maximum Likelihoo
35、d Training We want the distribution defi ned byEto model the data distributionpD, which we do by minimizing the negative log likelihood of the dataLML() = ExpDlogp(x)wherelogp(x) = E(x) logZ(). This objective is known to have the gradient (see Turner, 2005 for derivation)LML= Ex+pDE(x+) ExpE(x). Int
36、uitively, this gradient decreases energy of the positive data samplesx+, while increasing the energy of the negative samplesxfrom the modelp. We rely on Langevin dynamics in (1) to generate qas an approximation of p: LML Ex+pD ? E(x+) ? Exq ? E(x) ? .(2) This is similar to the gradient of the Wasser
37、stein GAN objective Arjovsky et al., 2017, but with an implicit MCMC generating procedure and no gradient through sampling. This lack of gradient is important as it controls between the diversity in likelihood models and the mode collapse in GANs. The approximation in (2) is exact when Langevin dyna
38、mics generates samples fromp, which happens after a suffi cient number of steps (mixing time). We show in the supplement thatpdandqappear to match each other in distribution, showing evidence thatpmatchesq. We note that even in cases when a particular chain does not fully mix, since our initial prop
39、osal distribution is a uniform distribution, all modes are still equally likely to be explored. Deterministic case of procedure in (1) is x = argminE(x), which makes connection to implicit functions more clear. 3 3.2Sample Replay Buffer Langevin dynamics does not place restrictions on sample initial
40、ization x0 given suffi cient sampling steps. However initialization plays an crucial role in mixing time. Persistent Contrastive Divergence (PCD) Tieleman, 2008 maintains a single persistent chain to improve mixing and sample quality. We use a sample replay bufferBin which we store past generated sa
41、mples xand use either these samples or uniform noise to initialize Langevin dynamics procedure. This has the benefi t of continuing to refi ne past samples, further increasing number of sampling stepsKas well as sample diversity. In all our experiments, we sample from B 95% of the time and from unif
42、orm noise otherwise. 3.3Regularization and Algorithm Arbitrary energy models can have sharp changes in gradients that can make sampling with Langevin dynamics unstable. We found that constraining the Lipschitz constant of the energy network can ameliorate these issues. To constrain the Lipschitz con
43、stant, we follow the method of Miyato et al., 2018 and add spectral normalization to all layers of the model. Additionally, we found it useful to weakly L2 regularize energy magnitudes for both positive and negative samples during training, as otherwise while the difference between positive and nega
44、tive samples was preserved, the actual values would fl uctuate to numerically unstable values. Both forms of regularization also serve to ensure that partition function is integrable over the domain of the input, with spectral normalization ensuring smoothness and L2 coeffi cient bounding the magnit
45、ude of the unnormalized distribution. We present the algorithm below, where () indicates the stop gradient operator. Algorithm 1 Energy training algorithm Input:data dist.pD(x), step size, number of steps K B while not converged do x+ i pD x0 i B with 95% probability and U otherwise .Generate sample
46、 fromqvia Langevin dynamics: for sample step k = 1 to K do xk xk1 xE( x k1) + , N(0,) end for x i = ( xk i) . Optimize objective L2+ LMLwrt : 1 N P i(E(x + i )2+ E(x i )2) + E(x+ i ) E(x i ) Update based on using Adam optimizer B B xi end while Figure 2:Conditional ImageNet32x32 EBM samples 4Image M
47、odeling (a) GLOW Model(b) EBM(c) EBM (10 historical)(d) EBM Sample Buffer Figure 3:Comparison of image generation techniques on unconditional CIFAR-10 dataset. In this section, we show that EBMs are effective generative models for images. We show EBMs are able to generate high fi delity images and e
48、xhibit mode coverage on CIFAR-10 and ImageNet. 4 We further show EBMs exhibit adversarial robustness and better out-of-distribution behavior than other likelihood models. Our model is based on the ResNet architecture (using conditional gains and biases per class Dumoulin et al. for conditional model
49、s) with details in the supplement. We present sensitivity analysis, likelihoods, and ablations in the supplement in A.4. We provide a comparison between EBMs and other likelihood models in A.5. Overall, we fi nd that EBMs are both more parameter/computationally effi cient than likelihood models, tho
50、ugh worse than GANs. 4.1Image Generation We show unconditional CIFAR-10 images in Figure 3, with comparisons to GLOW Kingma and Dhariwal, 2018, and conditional ImageNet32x32 images in Figure 2. We provide qualitative images of ImageNet128x128 and other visualizations in A.1. ModelInception*FID CIFAR
51、-10 Unconditional PixelCNN Van Oord et al., 20164.6065.93 PixelIQN Ostrovski et al., 20185.2949.46 EBM (single)6.0240.58 DCGAN Radford et al., 20166.4037.11 WGAN + GP Gulrajani et al., 20176.5036.4 EBM (10 historical ensemble)6.7838.2 SNGAN Miyato et al., 20188.2221.7 CIFAR-10 Conditional Improved G
52、AN8.09- EBM (single)8.3037.9 Spectral Normalization GAN8.5925.5 ImageNet 32x32 Conditional PixelCNN8.3333.27 PixelIQN10.1822.99 EBM (single)18.2214.31 ImageNet 128x128 Conditional ACGAN Odena et al., 201728.5- EBM* (single)28.643.7 SNGAN36.827.62 Figure 4:Table of Inception and FID scores for ImageN
53、et32x32 and CIFAR-10. Quantitative numbers for ImageNet32x32 from Ostrovski et al., 2018. (*) We use Inception Score (from original OpenAI repo) to compare with legacy models, but strongly encour- age future work to compare soley with FID score, since Langevin Dynamics converges to minima that artif
54、i cially infl ate Inception Score. (*) conditional EBM models for 128x128 are smaller than those in SNGAN. Salt and Pepper (0.1) Inpainting Ground Truth Initialization Figure 5:EBM image restoration on images in thetestset via MCMC. The right column shows failure (ap- prox. 10% objects change with g
55、round truth initialization and 30% of objects change in salt/pepper corruption or in- painting. Bottom two rows shows worst case of change.) We quantitatively evaluate image quality of EBMs with Inception score Salimans et al., 2016 and FID score Heusel et al., 2017 in Table 4. Overall we obtain sig
56、nifi cantly better scores than likelihood models PixelCNN and PixelIQN, but worse than SNGAN Miyato et al., 2018. We found that in the unconditional case, mode exploration with Langevin took a very long time, so we also experimented in EBM (10 historical ensemble) with sampling joint from the last 1
57、0 snapshots of the model. At training time, extensive exploration is ensured with the replay buffer (Figure 3d). Our models have similar number of parameters to SNGAN, but we believe that signifi cantly more parameters may be necessary to generate high fi delity images with mode coverage. On ImageNe
58、t128x128, due to computational constraints, we train a smaller network than SNGAN and do not train to convergence. 4.2Mode Evaluation We evaluate over-fi tting and mode coverage in EBMs. To test over-fi tting, we plotted histogram of energies for CIFAR-10 train and test dataset in Figure 11 and note
59、 almost identical curves. In the supplement, we show that the nearest neighbor of generated images are not identical to images in 5 the training dataset. To test mode coverage in EBMs, we investigate MCMC sampling on corrupted CIFAR-10 test images. Since Langevin dynamics is known to mix slowly Neal, 2011 and reach local minima, we believe that good den
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2026年小学英语阅读理解专项训练
- 2025年广西住房城乡建设领域施工现场专业人员岗位考证培训考试(资料员)复习题库
- 2026年食品质量与安全专业知识竞赛
- 2026年体育教师招聘体育理论笔试题
- 2026年幼儿园大班冬季育儿保健知识
- 2026年政府会计准则制度实施能力考试保险预测题
- 2026年中学生应急救护知识培训
- 2026年会计电算化初级题库及解析
- 2026年人力资源管理师考试模拟题解析
- 2026年医疗器械质量管理人员题
- CHT 4020-2018 管线制图技术规范
- (2024年)《工伤保险培训》ppt课件完整版
- 2024-2025年上海中考英语真题及答案解析
- 办公家具生产设备清单
- 赋能:打造应对不确定性的敏捷团队
- 学前儿童行为观察的方法(课堂PPT)
- 工业机器人技术与应用PPT完整全套教学课件
- dd5e人物卡可填充格式角色卡夜版
- 第五章 马尔可夫过程
- GB/T 3478.1-2008圆柱直齿渐开线花键(米制模数齿侧配合)第1部分:总论
- GB/T 19247.4-2003印制板组装第4部分:分规范引出端焊接组装的要求
评论
0/150
提交评论