iccv2019论文全集8958-estimating-convergence-of-markov-chains-with-l-lag-couplings_第1页
iccv2019论文全集8958-estimating-convergence-of-markov-chains-with-l-lag-couplings_第2页
iccv2019论文全集8958-estimating-convergence-of-markov-chains-with-l-lag-couplings_第3页
iccv2019论文全集8958-estimating-convergence-of-markov-chains-with-l-lag-couplings_第4页
iccv2019论文全集8958-estimating-convergence-of-markov-chains-with-l-lag-couplings_第5页
已阅读5页,还剩6页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Estimating Convergence of Markov chains with L-Lag Couplings Niloy Biswas Harvard University niloy_ Pierre E. Jacob Harvard University Paul Vanetti University of Oxford paul.vanettispc.ox.ac.uk Abstract Markov chain Monte Carlo (MCMC) methods generate samples

2、 that are asymptoti- cally distributed from a target distribution of interest as the number of iterations goes to infi nity. Various theoretical results provide upper bounds on the distance between the target and marginal distribution after a fi xed number of iterations. These upper bounds are on a

3、case by case basis and typically involve intractable quantities, which limits their use for practitioners. We introduceL-lag couplings to generate computable, non-asymptotic upper bound estimates for the total variation or the Wasserstein distance of general Markov chains. We applyL-lag couplings to

4、 the tasks of (i) determining MCMC burn-in, (ii) comparing different MCMC al- gorithms with the same target, and (iii) comparing exact and approximate MCMC. Lastly, we (iv) assess the bias of sequential Monte Carlo and self-normalized importance samplers. 1Introduction Markov chain Monte Carlo (MCMC

5、) algorithms generate Markov chains that are invariant with respect to probability distributions that we wish to approximate. Numerous works help understanding the convergence of these chains to their invariant distributions, hereafter denoted by. Denote bytthe marginal distribution of the chain(Xt)

6、t0at timet. The discrepancy betweentand can be measured in different ways, typically the total variation (TV) distance or the Wasserstein distance in the MCMC literature. Various results provide upper bounds on this distance, of the form C(0)f(t), whereC(0) L : Xt= YtL is almost surely fi nite. Fina

7、lly we assume that the chains remain faithful after meeting, i.e. Xt= YtLfor all t (L). Various constructions for Khave been derived in the literature: for instance coupled Metropolis- Hastings and Gibbs kernels in 31,29, coupled Hamiltonian Monte Carlo kernels in 36,5,26, and coupled particle Gibbs

8、 samplers in 9, 3, 28. Algorithm 1: Sampling L-lag meeting times Input: lag L 1, initial distribution 0, single kernel K and joint kernel K Output: meeting time (L), and chains (Xt)0t(L),(Yt)0t(L)L Initialize: generate X0 0, Xt|Xt1 K(Xt1,) for t = 1,.,L, and Y0 0 for t L do Sample (Xt,YtL)|(Xt1,YtL1

9、) K(Xt1,YtL1),) if Xt= YtLthen return (L):= t, and chains (Xt)0t(L),(Yt)0t(L)L end We next introduce integral probability metrics (IPMs, e.g. 52). Defi nition 2.1.(Integral Probability Metric). LetHbe a class of real-valued functions on a measurable spaceX. For all probability measuresP,QonX , the c

10、orresponding IPM is defi ned as: dH(P,Q) := sup hH ? ? ?E XPh(X) EXQh(X) ? ? ?. (1) Common IPMs include total variation distancedTVwithH := h : supxX|h(x)| 1/2, and 1-Wasserstein distancedWwithH = h : |h(x) h(y)| dX(x,y), x,y X, wheredXis a metric onX42. Our proposed method applies to IPMs such that

11、suphH|h(x)h(y)| MH(x,y) for allx,y X, for some computable functionMHonX X. FordTVwe haveMH(x,y) = 1, and for dWwe have MH(x,y) = dX(x,y). With a similar motivation for the assessment of sample approximations, and not restricted to the MCMC setting, 25 considers a restricted class of functionsH to de

12、velop a specifi c measure of sample quality based on Steins identity. 35,10 combine Steins identity with reproducing kernel Hilbert space theory to develop goodness-of-fi t tests. 24 obtains further results and draws connections to the literature on couplings of Markov processes. Here we directly ai

13、m at upper bounds on the total variation and Wasserstein distance. The total variation controls the maximal difference 2 between the masses assigned bytandon any measurable set, and thus directly helps assessing the error of histograms of the target marginals. The 1-Wasserstein distance controls the

14、 error made on expectations of 1-Lipschitz functions, which withX = RdanddX(x,y) = kx yk1(theL1norm on Rd ) include all fi rst moments. 2.1Main results We make the three following assumptions similar to those of 29. Assumption 2.2.(Marginal convergence and moments.) For allh H, ast ,Eh(Xt) EXh(X). A

15、lso, 0,D L : Xt= YtL satisfi esP( (L)L L t) Ctfor allt 0, for some constants C d(L) L t)/Leare equal to zero by Assumption 2.4. The above reasoning highlights that increasingLleads to sharper bounds through the use of fewer triangle inequalities. A formal proof is given in supplementary material. Th

16、eorem 2.5 gives the following bounds for dTVand dW, dTV(t,) E h max(0, ?(L) L t L ?)i, (4) dW(t,) E h ? (L)Lt L ? X j=1 dX(Xt+jL,Yt+(j1)L) i .(5) For the total variation distance, the boundedness part of Assumption 2.2 is directly satisfi ed. For the 1-Wasserstein distance onRdwithdX(x,y) = kx yk1(t

17、heL1norm onRd), the boundedness part is equivalent to a uniform bound of(2 + )-th moments of the marginal distributions for some 0. We emphasize that the proposed bounds can be estimated directly by running Algorithm 1Ntimes independently, and using empirical averages. All details of the MCMC algori

18、thms and their couplings mentioned below are provided in the supplementary material. 3 2.2Stylized examples 2.2.1A univariate Normal We consider a Normal example where we can compute total variation and1-Wasserstein distances (using theL1norm onRthroughout) exactly. The targetisN(0,1)and the kernelK

19、is that of a Normal random walk Metropolis-Hastings (MH) with step sizeMH= 0.5. We set the initial distri- bution0to be a point mass at10. The joint kernel Koperates as follows. Given(Xt1,YtL1), sample(X?,Y ?) from a maximal coupling ofp := N(Xt1,2 MH)andq := N(YtL1,2MH). This is done using Algorith

20、m 2, which ensuresX? p,Y ? qandP(X?6= Y ?) = dTV(p,q). Algorithm 2: A maximal coupling of p and q Sample X p, and W U(0,1) if p(X)W q(X) then set Y = Xand return (X,Y ) else sample Y q and W U(0,1) until q(Y )W p(Y ). Set Y = Y and return (X,Y ) Having obtained(X?,Y ?), sample U U(0,1); setXt= X?ifU

21、 t), which itself is small iftis a large quantile of the meeting timeTc. A limitation of this result is its reliance on the quantityr, which might be unknown or very close to one in challenging settings. Another difference is that we rely on pairs of lagged chains and tune the lagL, while the tuning

22、 parameter in 31 is the number of coupled chains c. 5 3Experiments and applications 3.1Ising model We consider an Ising model, where the target is defi ned on a large discrete space, namely a square lattice with32 32sites (each site has 4 neighbors) and periodic boundaries. For a statex 1,+13232 , w

23、e defi ne the target probability(x) exp( P ijxixj), where the sum is over all pairsi,jof neighboring sites. Asincreases, the correlation between nearby sites increases and single-site Gibbs samplers are known to perform poorly 39 . Diffi culties in the assessment of the convergence of these samplers

24、 are in part due to the discrete nature of the state space, which limits the possibilities of visual diagnostics. Users might observe trace plots of one-dimensional statistics of the chains, such asx 7 P ijxixj, and declare convergence when the statistic seems to stabilize; see 55, 60 where trace pl

25、ots of summary statistics are used to monitor Markov chains. Here we compute the proposed upper bounds for the TV distance for two algorithms: a single site Gibbs sampler (SSG) and a parallel tempering (PT) algorithm, where different chains target different with SSG updates, and regularly attempt to

26、 swap their states 22,53. The initial distribution assigns1and+1with equal probability on each site independently. For = 0.46, we obtain TV bounds for SSG using a lagL = 106, andN = 500independent repeats. For PT we use 12 chains, each targetingwithin an equispaced grid ranging from0.3to0.46, a freq

27、uency of swap moves of0.02, and a lagL = 2104. The results are in Figure 3, where we see a plateau for the TV bounds on SSG and faster convergence for the TV bounds on PT. Our results are consistent with theoretical work on faster mixing times of PT targeting multimodal distributions including Ising

28、 models 59. Note that the targets are different for both algorithms, as PT operates on an extended space. The behavior of meeting times of coupled chains motivated by the “coupling from the past” algorithm 44 for Ising models has been studied e.g. in 11. 0.00 0.25 0.50 0.75 1.00 1e+011e+021e+031e+04

29、1e+051e+06 iteration dTV SSG PT Figure 3: Single-site Gibbs (SSG) versus Parallel Tempering (PT) for an Ising model; bounds on the total variation distance between tand , for t up to 106and inverse temperature = 0.46. 3.2Logistic regression We next consider a target on a continuous state space defi

30、ned as the posterior in a Bayesian logistic regression. Consider the German Credit data from 34. There aren = 1000binary responses (Yi)n i=1 1,1nindicating whether individuals are creditworthy or not creditworthy, andd = 49 covariatesxi Rdfor each individuali. The logistic regression model statesP(Y

31、i= yi|xi) = (1 + eyix T i)1with a normal prior N(0,10Id). We can sample from the posterior using Hamiltonian Monte Carlo (HMC, 40) or the Plya-Gamma Gibbs sampler (PG, 43). The former involves tuning parameters?HMCandSHMCcorresponding to a step size and a number of steps in a leapfrog integration sc

32、heme performed at every iteration. We can use the proposed bounds to compare convergence associated with HMC for different?HMC,SHMC, and with the PG sampler. Figure 4 shows the total variation bounds for HMC with?HMC= 0.025andSHMC= 4,5,6,7and the corresponding bound for the parameter-free PG sampler

33、, both starting from0 N(0,10Id). In this example, the bounds are smaller for the PG sampler than for all HMC samplers under consideration. We emphasize that the HMC tuning parameters associated with the fastest convergence to stationarity might not necessarily be optimal in terms of asymptotic varia

34、nce of ergodic averages of functions of interest; see related discussions in 26. Also, since the proposed upper bounds are not tight, the 6 true convergence rates of the Markov chains under consideration may be ordered differently. The proposed upper bounds still allow a comparison of how confi dent

35、 we can be about the bias of different MCMC algorithms after a fi xed number of iterations. 0.00 0.25 0.50 0.75 1.00 050010001500 iteration dTV PolyaGamma SHMC= 4 SHMC= 5 SHMC= 6 SHMC= 7 Figure 4: Proposed upper bounds ondTV(t,)for a Plya-Gamma Gibbs sampler and for Hamilto- nian Monte Carlo on a49-

36、dimensional posterior distribution in a logistic regression model. For HMC the step size is ?HMC= 0.025 and the number of steps is SHMC= 4,5,6,7. 3.3Comparison of exact and approximate MCMC algorithms In various settings approximate MCMC methods trade off asymptotic unbiasedness for gains in computa

37、tional speed, e.g. 30,50,14. We compare an approximate MCMC method (Unadjusted Langevin Algorithm, ULA) with its exact counterpart (Metropolis-Adjusted Langevin Algorithm, MALA) in various dimensions. Our target is a multivariate normal: = N(0,) where i,j= 0.5|ij|for 1 i,j d. Both MALA and ULA chain

38、s start from 0 N(0,Id), and have step sizes of d1/6and 0.1d1/6 respectively. Step sizes are linked to an optimal result of 47, and the 0.1 multiplicative factor for ULA ensures that the target distribution for ULA is close to(see 13). We can use couplings to study the mixing timestmix(?)of the two a

39、lgorithms, wheretmix(?) := infk 0 : dTV(k,) ?. Figure 5 highlights how the dimension impacts the estimated upper bounds on the mixing time tmix(0.25), calculated asinfk 0 : b Emax(0,d(L) L k)/Le) 0.25where b Edenotes empirical averages. The results are consistent with the theoretical analysis in 18.

40、 For a strongly log-concave target such asN(0,), Table 2 of 18 indicates mixing time upper bounds of order O(d)andO(d2)for ULA and MALA respectively (with a non-warm start centered at the unique mode of the target). In comparison to theoretical studies in 13,18, our bounds can be directly estimated

41、by simulation. On the other hand, the bounds in 13,18 are more explicit about the impact of different aspects of the problem including dimension, step size, and features of the target. 100 300 1000 3000 02505007501000 dimension tmix(0.25) ULA MALA Figure 5: Mixing time bounds for ULA and MALA target

42、ing a multivariate Normal distribution, as a function of the dimension. Mixing timetmix(0.25) denotes the fi rst iterationtfor which the estimated TV between tand is less than 0.25. 4Assessing the bias of sequential Monte Carlo samplers Lastly, we consider the bias associated with samples generated

43、by sequential Monte Carlo (SMC) sam- plers 16; the bias of self-normalized importance samplers can be treated similarly. Let(wn,n)N n=1 7 be the weighted sample from an SMC sampler withNparticles targeting, and letq(N)be the marginal distribution of a particlesampled among(n)N n=1with probabilities(

44、wn)Nn=1. Our aim is to upper bound a distance betweenq(N)and for a fi xedN. We denote by Zthe normalizing constant estimator generated by the SMC sampler. The particle independent MH algorithm (PIMH, 2) operates as an independent MH algorithm using SMC samplers as proposals. Let(Zt)t0be the normaliz

45、ing constant estimates from a PIMH chain. Consider anL-lag coupling of a pair of such PIMH chains as introduced in 38, initializing the chains by running an SMC sampler. Here(L)is constructed so that it can be equal toLwith positive probability; more precisely, (L) (L 1)? ZL1 Geometric(ZL1),(6) wher

46、e(Z) := E?min(1, Z/Z)? Z?is the average acceptance probability of PIMH, from a state with normalizing constant estimate Z; see 38, Proposition 8 for a formal statement in the case of 1-lag couplings. With this insight, we can bound the TV distance between the target and particles generated by SMC sa

47、mplers, using Theorem 2.5 applied witht = 0. Details are in supplementary material. We obtain dTV(q(N),) E h max(0, ?(L) L L ?)i = E h 1 (ZL1) 1 (1 (ZL1)L i .(7) The bound in(7)depends only on the distribution of the normalizing constant estimator Z, and can be estimated using independent runs of th

48、e SMC sampler. We can also estimate the distribution of Zfrom a single SMC sampler by appealing to large asymptotic results such as 4, combined with asymptotically valid variance estimators such as 33. AsN goes to infi nity we expect(ZL1) to approach one and the proposed upper bound to go to zero. T

49、he proposed bound aligns with the common practice of considering the variance of Zas a measure of global performance of SMC samplers. Existing TV bounds for particle approximations, such as those in 15, Chapter 8 and 27, are more informative qualitatively but harder to approximate numerically. The r

50、esult also applies to self- normalized importance samplers (see 46, Chapter 3 and 41, Chapter 8). In that case 1, Theorem 2.1 showsdTV(q(N),) 6N1for = Eqw()2/Eqw()2, withwthe importance sampling weight function, which is a simpler and more informative bound; see also 8 for related results and concen

51、tration inequalities. 5Discussion The proposed method can be used to obtain guidance on the choice of burn-in, to compare different MCMC algorithms targeting the same distribution, and to compare mixing times of approximate and exact MCMC methods. The main requirement for the application of the meth

52、od is the ability to generate coupled Markov chains that can meet exactly after a random but fi nite number of iterations. The couplings employed here, and described in supplementary materials, are not optimal in any way. As the couplings are algorithm-specifi c and not target-specifi c, they can po

53、tentially be added to statistical software such as PyMC3 51 or Stan 7. The bounds are not tight, in part due to the couplings not being maximal 54, but experiments suggest that they can be practical. The proposed bounds go to zero astincreases, making them informative at least for large enought. The

54、 combination of time lags and coupling of more than two chains as in 31 could lead to new diagnostics. Further research might also complement the proposed upper bounds with lower bounds, obtained by considering specifi c functions among the classes of functions used to defi ne the integral probabili

55、ty metrics. Acknowledgments.The authors are grateful to Espen Bernton, Nicolas Chopin, Andrew Gelman, Lester Mackey, John OLeary, Christian Robert, Jeffrey Rosenthal, James Scott, Aki Vehtari and reviewers for helpful comments on an earlier version of the manuscript. The second author gratefully ack

56、nowledges support by the National Science Foundation through awards DMS-1712872 and DMS-1844695. The fi gures were created with packages 58, 57 in R Core Team 45. 8 References 1S. Agapiou, O. Papaspiliopoulos, D. Sanz-Alonso, and A. M. Stuart. Importance sampling: Intrinsic dimension and computation

57、al cost. Statistical Science, 32(3):405431, 08 2017. 2Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269342, 2010. 3Christophe Andrieu, Anthony Lee, and Matti Vihola. Uniform ergodicity of the iterated conditional SMC and geometric ergodicity of particle Gibbs samplers. Bernoulli, 24(2):842 872, 2018. 4Jean Brard, Pierre Del Moral, and Arnaud Doucet. A lognormal central limit theorem for particle approximations of normalizing constants. Electronic Journal of

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论