iccv2019论文全集9021-rapid-convergence-of-the-unadjusted-langevin-algorithm-isoperimetry-suffices

上传人：我*** IP属地：北京上传时间：2020-06-18 格式：PDF 页数：13 大小：466.30KB 积分：12 举报 版权申诉

iccv2019论文全集9021-rapid-convergence-of-the-unadjusted-langevin-algorithm-isoperimetry-suffices_第2页

iccv2019论文全集9021-rapid-convergence-of-the-unadjusted-langevin-algorithm-isoperimetry-suffices_第3页

iccv2019论文全集9021-rapid-convergence-of-the-unadjusted-langevin-algorithm-isoperimetry-suffices_第4页

iccv2019论文全集9021-rapid-convergence-of-the-unadjusted-langevin-algorithm-isoperimetry-suffices_第5页

已阅读5页，还剩8页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

1、Rapid Convergence of the Unadjusted Langevin Algorithm: Isoperimetry Suffi ces Santosh S. Vempala College of Computing Georgia Institute of Technology Atlanta, GA 30332 Andre Wibisono College of Computing Georgia Institute of Technology Atlanta, GA 30332 Abstract

2、 We study the Unadjusted Langevin Algorithm (ULA) for sampling from a proba- bility distribution = e?fonRn. We prove a convergence guarantee in Kullback- Leibler (KL) divergence assuming satisfi es log-Sobolev inequality andfhas bounded Hessian. Notably, we do not assume convexity or bounds on highe

3、r deriva- tives. We also prove convergence guarantees in Rnyi divergence of orderq 1 assuming the limit of ULA satisfi es either log-Sobolev or Poincar inequality. 1Introduction Sampling is a fundamental algorithmic task. Many applications require sampling from probability distributions in high-dime

4、nsional spaces, and in modern applications the probability distributions are complicated and non-logconcave. While the setting of logconcave functions is well-studied, it is important to have effi cient sampling algorithms with good convergence guarantees beyond the logconcavity assumption. There is

5、 a close interplay between sampling and optimization, either via optimization as a limit of sampling (annealing) 34,55, or via sampling as optimization in the space of distributions 36,62. Motivated by the widespread use of non-convex optimization and sampling, there is resurgent interest in underst

6、anding non-logconcave sampling. In this paper we study a simple algorithm, the Unadjusted Langevin Algorithm (ULA), for sampling from a target probability distribution = e?fonRn. ULA requires oracle access to the gradientrf of the log densityf = ?log. In particular, ULA does not require knowledge of

7、f, which makes it applicable in practice where we often only know up to a normalizing constant. As the step size ! 0, ULA recovers the Langevin dynamics, which is a continuous-time stochastic process inRnthat converges to. We recall the optimization interpretation of the Langevin dynamics for sampli

8、ng as the gradient fl ow of the Kullback-Leibler (KL) divergence with respect toin the space of probability distributions with the Wasserstein metric 36. Whenis strongly logconcave, the KL divergence is a strongly convex objective function, so the Langevin dynamics as gradient fl ow converges expone

9、ntially fast 6,60. From the classical theory of Markov chains and diffusion processes, there are several known conditions milder than logconcavity that are suffi cient for rapid convergence in continuous time. These include isoperimetric inequalities such as Poincar inequality or log-Sobolev inequal

10、ity (LSI). Along the Langevin dynamics in continuous time, Poincar inequality implies an exponential convergence rate in?2-divergence, while LSIwhich is strongerimplies an exponential convergence rate in KL divergence (as well as in Rnyi divergence). However, in discrete time, sampling under Poincar

11、 inequality or LSI is a more challenging problem. ULA is an inexact discretization of the Langevin dynamics, and it converges to a biased limit 6= . Whenis strongly logconcave and smooth, it is known how to control the bias and prove a convergence guarantee on KL divergence along ULA 17,21,22,24. Wh

12、enis strongly 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. logconcave, there are many sampling algorithms with provable rapid convergence; these include the ball walk and hit-and-run 37,43,44,42 (which give truly polynomial algorithms), various discreti

13、zations of the overdamped or underdamped Langevin dynamics 21,22,24,8,26 (which have polynomial dependence on smoothness parameters but low dependence on dimension), and the Hamiltonian Monte Carlo 47,48,25,39,16. It is of great interest to extend these results to non-logconcave densities, where exi

14、sting results require strong assumptions with bounds that grow exponentially with the dimension or other parameters 2,18,45,49. There are recent works that analyze convergence of sampling using various techniques such as refl ection coupling 28, kernel methods 29, and higher-order integrators 40, al

15、beit still under some strong conditions such as distant dissipativity, which is similar to strong logconcavity outside a bounded domain. In this paper we study the convergence along ULA under minimal (and necessary) isoperimetric assumptions, namely, LSI and Poincar inequality. These are suffi cient

16、 for fast convergence in continuous time; moreover, in the case of logconcave distribution, the log-Sobolev and Poincar constants can be bounded and lead to convergence guarantees for effi cient sampling in discrete time. However, do they suffi ce on their own without the assumption of logconcavity?

17、 We note that LSI and Poincar inequality apply to a wider class of measures than logconcave distributions. In particular, LSI and Poincar inequality are preserved under bounded perturbation and Lipschitz mapping, whereas logconcavity is destroyed. Given these properties, it is easy to exhibit exampl

18、es of non-logconcave distributions satisfying LSI or Poincar inequality. For example, we can take a small perturbation of a convex body to make it nonconvex but still satisfi es isoperimetry; then the uniform probability distribution (or a smooth version of it) on the body is not logconcave but sati

19、sfi es LSI and Poincar inequality. Similarly, we can start with a strongly logconcave distribution and make bounded perturbations; then the resulting (normalized) probability distribution is not logconcave, but it satisfi es LSI and Poincar inequality. See Figure 1 for an illustration. Figure 1: Ill

20、ustrations of non-logconcave distributions satisfying isoperimetry: uniform distribution on a nonconvex set (left) and a perturbation of a logconcave distribution (right). We measure the mode of convergence using KL divergence and Rnyi divergence of orderq ? 1, which is stronger. Our fi rst main res

21、ult says the only further assumption we need is smoothness. We say = e?fisL-smooth ifrfisL-Lipschitz. HereH()is the KL divergence betweenand. See Theorem 2 in Section 3.1 for more detail. Theorem 2.Assume = e?f satisfi es log-Sobolev inequality with constant 0and isL-smooth. ULA with step size 0 4L2

22、 satisfi es H(k) e?kH(0) + 8nL2 . For0 1. While Rnyi divergence is a stronger measure of convergence than KL divergence, the situation is more complicated. First, we can hope to converge to the biased limit only for fi niteqfor any step size(as we illustrate with an example). Second, it is unclear h

23、ow to bound the Rnyi divergence betweenand . We fi rst show the following convergence guarantees of Rnyi divergence along the continuous-time Langevin dynamics under LSI or Poincar inequality; see Theorem 3 and Theorem 5. Here Rq,() is the Rnyi divergence of order q between and . Theorem 3. Suppose

24、satisfi es LSI with constant 0. Let q ? 1. Along the Langevin dynamics, Rq,(t) e? 2t q Rq,(0). Theorem 5.Suppose satisfi es Poincar inequality with constant 0. Letq ? 2. Along the Langevin dynamics, Rq,(t) ( Rq,(0) ? 2t q if Rq,(0) ? 1 and as long as Rq,(t) ? 1, e? 2t q Rq,(0)if Rq,(0) 1. Notice tha

25、t under Poincar inequality, compared to LSI, the convergence is slower in the beginning before it becomes exponential. For a reasonable starting distribution (such as a Gaussian centered at a stationary point), this leads to an extra factor ofncompared to the convergence under LSI. We then turn to d

26、iscrete time and show the convergence of Rnyi divergence along ULA to the biased limit under the assumption that itself satisfi es either LSI or Poincar inequality. We combine this with a decomposition result on Rnyi divergence to derive a convergence guarantee for Rnyi divergence to along ULA; see

27、Theorem 4 and Theorem 6. Inwhatfollows, we reviewKLdivergenceanditspropertiesalongtheLangevindynamicsinSection2, and prove a convergence guarantee for KL divergence along ULA under LSI in Section 3. We provide a review of Rnyi divergence and its properties along the Langevin dynamics in Section 4. W

28、e then prove the convergence guarantee for Rnyi divergence along ULA under LSI in Section 5, and under Poincar inequality in Section 6. We conclude with a discussion in Section 7. 2Review of KL divergence along Langevin dynamics In this section we review the defi nition of Kullback-Leibler (KL) dive

29、rgence, log-Sobolev inequality, and the convergence of KL divergence along the Langevin dynamics in continuous time under log-Sobolev inequality. See Appendix A.1 for a review on notation. 2.1KL divergence Let,be probability distributions onRn, represented via their probability density functions wit

30、h respect to the Lebesgue measure on Rn. We assume , have full support and smooth densities. Recall the Kullback-Leibler (KL) divergence of with respect to is H() = Z Rn (x)log (x) (x) dx.(1) KL divergence is the relative form of Shannon entropyH() = ? R Rn (x)log(x)dx. Whereas Shannon entropy can b

31、e positive or negative, KL divergence is nonnegative and minimized at: 3 H() ? 0for all, andH() = 0if and only if = . Therefore, KL divergence serves as a measure of (albeit asymmetric) “distance” of a probability distributionfrom a base distribution. KL divergence is a relatively strong measure of

32、distance; for example, Pinskers inequality implies that KL divergence controls total variation distance. Furthermore, under log-Sobolev (or Talagrand) inequality, KL divergence also controls the quadratic WassersteinW2distance, as we review below. We say = e?fisL-smoothiffhas bounded Hessian:?LI ? r

33、2f(x) ? LIfor allx 2 Rn. We provide the proof of Lemma 1 in Appendix B.1.1. Lemma 1.Suppose = e?fisL-smooth. Let = N(x, 1 LI)wherex is a stationary point off. Then H() f(x) + n 2 log L 2. 2.2Log-Sobolev inequality Recall we say satisfi es thelog-Sobolev inequality (LSI)with a constant 0if for all sm

34、ooth function g: Rn! R with Eg2 0, respectively, then1 2 satisfi es LSI with constantmin1,2 0. Thus, there are many examples of non-logconcave distributionsonRnsatisfying LSI (with a constant independent of dimension). There are also Lyapunov function criteria and exponential integrability condition

35、s that can be used to verify when a probability distribution satisfi es LSI; see for example 14, 15, 51, 61, 7. 2.2.1Talagrand inequality Recall the Wasserstein distance between and is W2(,) = inf EkX ? Y k2 1 2 (5) where the infi mum is over joint distributionsof(X,Y )with the correct marginalsX ,Y

36、 . Recall we say satisfi es Talagrand inequality with a constant 0 if for all : 2 W2(,)2 H().(6) Talagrands inequality implies concentration of measure of Gaussian type. It was fi rst studied by Talagrand 58 for Gaussian, and extended by Otto and Villani 54 to allsatisfying LSI; namely, if satisfi e

37、s LSI with constant 0, then also satisfi es Talagrands inequality with the same constant 54, Theorem 1. Therefore, under LSI, KL divergence controls the Wasserstein distance. Moreover, whenis log-concave, LSI and Talagrands inequality are equivalent 54, Corollary 3.1. We recall in Appendix A.2 the g

38、eometric interpretation of LSI and Talagrands inequality from 54. 4 2.3Langevin dynamics TheLangevin dynamicsfor target distribution = e?fis a continuous-time stochastic process (Xt)t?0in Rnthat evolves following the stochastic differential equation: dXt= ?rf(Xt)dt + p2dW t (7) where (Wt)t?0is the s

39、tandard Brownian motion in Rnwith W0= 0. If(Xt)t?0evolves following the Langevin dynamics(7), then their probability density function (t)t?0evolves following the Fokker-Planck equation: t t = r (trf) + ?t= r trlog t .(8) Hereris the divergence and?is the Laplacian operator. We provide a derivation i

40、n Appendix A.3. From(8), ift= , then t t = 0, sois the stationary distribution for the Langevin dynamics(7). Moreover, the Langevin dynamics brings any distributionXt tcloser to the target distribution, as the following lemma shows. Lemma 2. Along the Langevin dynamics (7) (or equivalently, the Fokk

41、er-Planck equation (8), d dtH(t) = ?J(t). (9) We provide the proof of Lemma 2 in Appendix B.1.2. SinceJ() ? 0, the identity(9)shows KL divergence is decreasing along the Langevin dynamics, so indeed the distribution tconverges to . 2.3.1Exponential convergence of KL divergence along Langevin dynamic

42、s under LSI When satisfi es LSI, KL divergence converges exponentially fast along the Langevin dynamics. Theorem 1. Suppose satisfi es LSI with constant 0. Along the Langevin dynamics (7), H(t) e?2tH(0).(10) Furthermore, W2(t,) q 2 H(0)e ?t. We provide the proof of Theorem 1 in Appendix B.1.3. We al

43、so recall the optimization interpretation of Langevin dynamics as the gradient fl ow of KL divergence in the space of distributions with the Wasserstein metric 36,60,54. Then the exponential convergence rate in Theorem 1 is a manifestation of the general fact that gradient fl ow converges exponentia

44、lly fast under gradient domination condition. This provides a justifi cation for using the Langevin dynamics for sampling from , as a natural steepest descent fl ow that minimizes the KL divergence H. 3Unadjusted Langevin Algorithm Suppose we wish to sample from a smooth target probability distribut

45、ion = e?finRn. The Unadjusted Langevin Algorithm (ULA) with step size 0 is the discrete-time algorithm xk+1= xk? rf(xk) + p2z k (11) wherezk N(0,I)is an independent standard Gaussian random variable inRn. Letkdenote the probability distribution of xkthat evolves following ULA. As ! 0, ULA recovers t

46、he Langevin dynamics(7) in continuous-time. However, for fi xed 0, ULA converges to a biased limiting distribution6= . Therefore, KL divergenceH(k)does not tend to 0 along ULA, as it has an asymptotic bias H() 0. Example 1.Let = N(0, 1 I). The ULA iteration isxk+1 = (1?)xk+ p2z k. For0 0and isL-smoo

47、th. If0 0and isL-smooth. ULA with step size 0 4L2 satisfi es H(k) e?kH(0) + 8nL2 . For0 ? ?2andq ? ?2 ?2?2, thenRq,() = 1. Otherwise, Rq,() = n 2 log ?2 ?2 ? n 2(q?1) log ?q ? (q ? 1)?2 ?2 ?. The following is analogous to Lemma 1. We provide the proof of Lemma 4 in Appendix B.3.2. Lemma 4.Suppose =

48、e?fisL-smooth. Let = N(x, 1 LI)wherex is a stationary point off. Then for all q ? 1, Rq,() f(x) + n 2 log L 2. 4.1.1Log-Sobolev inequality For q 0, we defi ne the Rnyi information of order q of with respect to as Gq,() = E h q? ? ?rlog ? ? ? 2i = E h q?2? ? ?r ? ? ? 2i = 4 q2 E h? ? ?r q 2? ? 2i .(1

49、5) The caseq = 1recovers relative Fisher information(3):G1,() = E h ? ?rlog ? ?2i= J(). We have the following relation under log-Sobolev inequality. Note the caseq = 1recovers LSI(4). We provide the proof of Lemma 5 in Appendix B.3.3. Lemma 5. Suppose satisfi es LSI with constant 0. Let q ? 1. For a

50、ll , Gq,() Fq,() ? 2 q2 Rq,().(16) 4.2Langevin dynamics Along the Langevin dynamics(7)for, we can compute the rate of change of the Rnyi divergence. Lemma 6. For all q 0, along the Langevin dynamics (7), d dtRq,(t) = ?q Gq,(t) Fq,(t) .(17) We provide the proof of Lemma 6 in Appendix B.3.4. In partic

51、ular, d dtRq,(t) 0, so Rnyi divergence is always decreasing along the Langevin dynamics. Furthermore, analogous to how the Langevin dynamics is the gradient fl ow of KL divergence under the Wasserstein metric, one can also show that the Langevin dynamics is the the gradient fl ow of Rnyi divergence

52、with respect to a suitably defi ned metric (which depends on ) on the space of distributions; see 13. 4.2.1Convergence of Rnyi divergence along Langevin dynamics under LSI When satisfi es LSI, Rnyi divergence converges exponentially fast along the Langevin dynamics. Note the case q = 1 recovers the exponential convergence rate of KL divergence from Theorem 1. Theorem 3. Suppose satisfi es LSI with constant 0. Let q ? 1. Along the Langevin dynamics, Rq,(t) e? 2t q Rq,(0). We provide the proof of Theorem 3 in A

人人文库> 全部分类> 应用文书

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

iccv2019论文全集9021-rapid-convergence-of-the-unadjusted-langevin-algorithm-isoperimetry-suffices

文档简介

温馨提示

最新文档

评论

iccv2019论文全集9021-rapid-convergence-of-the-unadjusted-langevin-algorithm-isoperimetry-suffices

文档简介

温馨提示

最新文档

评论

相关文档