iccv2019论文全集8854-convergence-guarantees-for-adaptive-bayesian-quadrature-s_第1页
iccv2019论文全集8854-convergence-guarantees-for-adaptive-bayesian-quadrature-s_第2页
iccv2019论文全集8854-convergence-guarantees-for-adaptive-bayesian-quadrature-s_第3页
iccv2019论文全集8854-convergence-guarantees-for-adaptive-bayesian-quadrature-s_第4页
iccv2019论文全集8854-convergence-guarantees-for-adaptive-bayesian-quadrature-s_第5页
已阅读5页,还剩7页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Convergence Guarantees for Adaptive Bayesian Quadrature Methods Motonobu Kanagawa,#and Philipp Hennig# EURECOM, Sophia Antipolis, France #University of Tbingen and Max Planck Institute for Intelligent Systems, Tbingen, Germany motonobu.kanagawaeurecom.fr see (4) for details. Adaptive methods have al

2、so been theoretically studied in the information-based complexity literature 23,24,25,26. The key result is that optimal points for quadrature can be obtained without observing actual function values, if the hypothesis class of functions is symmetric and convex (e.g. the unit ball in a Hilbert space

3、): in this case adaptation does not help improve the performance. On the other hand, it the hypothesis class is either asymmetric or nonconvex, then adaptation may be helpful. For instance, a class of positive functions is assymetric because only one offorfcan be positive. These results thus support

4、 the choice of acquisition functions of existing ABQ methods, where the adaptivity to function values is motivated by modeling the positivity of the integrand. Notation.Ndenotes the set of positive integers,Rthe real line, andRdthed-dimensional Euclidean space ford N.Lp()for1 p 0 x . Such a constrai

5、nt is modeled by considering a certain transformation T : R R, and assuming that there exists a latent functiong : Rsuch that the integrandf is given as the transformation ofg, i.e.,f(x) = T(g(x),x . Examples ofTfor modeling the positivity include i) the square transformationT(y) = + 1 2y 2, where 0

6、is a small constant such that0 K1 n kn(x0),(3) wherekn(x) := (k(x,x1),.,k(x,xn) Rn,gn:= (g(x1),.,g(xn) Rnandmn= (m(x1),.,m(xn) Rn. Then a quadrature estimate3for the integral R f(x)(x)d(x)is given as the integral R T(mg,Xn(x)(x)d(x)of the transformed posterior mean functionT(mg,Xn), or as the integr

7、al of the posterior expectation of the transformation R E gT( g(x)(x)d(x), where g GP(mg,Xn,kXn)is the posterior GP. The posterior covariance for R f(x)(x)is given similarly; see 9, 16 for details. 2.2A Generic Form of Acquisition Functions The key remaining question is how to select good design poi

8、ntsx1,.,xn . ABQ methods sequentially and deterministically generatex1,.,xnusing an acquisition function. Many of the acquisition functions can be formulated in the following generic form: x+1 argmax x a(x),wherea(x) = F ?q2(x)k X(x,x) ? b(x),( = 0,1,.,n1) (4) wherekX0(x,x) := k(x,x),F : 0,) 0,)is a

9、n increasing function such thatF(0) = 0, q : (0,)andb: Ris a function that may change at each iteration. e.g., it may depend on the function valuesf(x1),.,f(x)of the target integrandf. Intuitively,b(x)is a data-dependent term that makes the point selection adaptive to the target integrand,q(x)may be

10、 seen as a proposal density in importance sampling, andFdetermines the balance between the uncertainty sampling partq2(x)kX(x,x)and the adaptation termb(x). We analyse ABQ with this generic form (4), aiming for results with wide applicability. Here are some representative choices. Warped Sequential

11、Active Bayesian Integration (WSABI) 16:Gunter et al. 16 employ the square transformationf(x) = T(g(x) = + 1 2g 2(x) with two acquisition functions: i) WSABI-L 16, Eq. 15, which is based on linearization ofTand recovered withF(y) = y,q(x) = (x)and b(x) = m2 g,X(x); and ii) WSABI-M 16, Eq. 14, the one

12、 based on moment matching given by F(y) = y, q(x) = (x) and b(x) = 1 2kX(x,x) + m 2 g,X(x). Moment-Matched Log-Transformation (MMLT) 9:Chai and Garnett 9, 3rd raw in Table 1 use the exponential transformationf(x) = T(g(x) = exp(g(x)with the acquisition function given by F(y) = exp(y) 1 , q(x) = 1 an

13、d b(x) = exp(kX(x,x) + 2mg,X(x). Variational Bayesian Monte Carlo (VBMC) 1, 2:Acerbi 2, Eq. 2 uses the identityf(x) = T(g(x) = g(x)with the acquisition function given byF(y) = y1,q(x) = 1andb(x) = 2 (x)exp(3mg,X(x), whereis the variational posterior at the-th iteration and1,2,3 0 are constants: sett

14、ing1= 2= 3= 1recovers the original acquisition function 1, Eq. 9. Acerbi 1, Sec. 2.1 considers an integrandf that is defi ned as the logarithm of a joint density, whileis an intractable posterior that is gradually approximated by the variational posteriors . For the WSABI and MMLT, the acquisition f

15、unction(4)is obtained by a certain approximation for the posterior variance of the integral R f(x)(x)d(x) = R T(g(x)(x)d(x); thus this is a form of uncertainty sampling. Such an approximation is needed because the posterior variance of the integral is not available in closed form, due to the nonline

16、ar transformationT. The resulting acquisition function includes the data-dependent termb(x), which encourages exploration in regions where the value ofg(x)is expected to be large. This makes ABQ methods adaptive to the target integrand. Alas, it also complicates analysis. Thus there has been no conv

17、ergence guarantee for these ABQ methods; which is what we aim to remedy in this paper. 3The point is that, in contrast to the integral over f, this estimate should be analytically tractable. This depends on the choices forT,kand. For instance, forT(y) = yorT(y) = + 1 2y 2 withkandGaussian, the estim

18、ate can be obtained analytically 16, while for T(y) = exp(y) one needs approximations; cf. 9. 4 2.3Bounding the Quadrature Error with Transformation Our fi rst result, which may be of independent interest, is an upper-bound on the error for the quadrature estimate R T(mg,Xn(x)(x)d(x)based on a trans

19、formation described in Sec. 2.1. It is applicable to any point setXn= x1,.,xn, and the bound is given in terms of the posterior variancekXn(x,x). This gives us a motivation to study the behavior of this quantity forx1,.,xn generated by ABQ(4)in the later sections. Note that the essentially same boun

20、d holds for the other estimator R E gT( g(x)(x)d(x)with g GP(mg,Xn,kXn), which we describe in Appendix A.2. To state the result, we need to introduce the Reproducing Kernel Hilbert Space (RKHS) of the covariance kernelkof the GP prior. See e.g. 35,36 for details of RKHSs, and 6,18 for discussions of

21、 their close but subtle relation to the GP notion. Let Hkbe the RKHS associated with the covariance kernelkof the GP prior(1), withh,iHkandk kHkbeing its inner-product and norm, respectively.Hkis a Hilbert space consisting of functions on, such that i)k(,x) Hk for allx , and ii)h(x) = hk(,x),hiHkfor

22、 allh Hkandx (the reproducing property), wherek(,x) denotes the function of the fi rst argument such thaty k(y,x), withx being fi xed. As a set of functions,Hkis given as the closure of the linear span of such functionsk(,x), i.e., Hk= spank(,x) | x , meaning that anyh Hkcan be written ash = P i=1ik

23、(,yi)for some(i) i=1 Rand(yi) i=1 such thatkhk2 Hk = P i,j=1ijk(yi,yj) . We are now ready to state our assumption: Assumption 1. T : R Ris continuously differentiable. Forf : R, there existsg : R such thatf(x) = T(g(x),x and that g := g m Hk. It holds thatkkkL():= supxk(x,x) and kmkL():= supx|m(x)|

24、. The assumption g := g m Hkis common in theoretical analysis of standard BQ methods, where T(y) = yandm = 0see e.g.7,40,8, and references therein. This assumption may be weakened by using proof techniques developed for standard BQ in the misspecifi d setting 19,20, but we leave it for a future work

25、. The other conditions on T, k and m are weak. Proposition 2.1. (proof in Appendix A.1)Letbe a compact metric space,Xn= x1,.,xn be such that the kernel matrixKn= (k(xi,xj)n i,j=1 Rnnis invertible, and : 0,) and q : 0,) be continuous functions such that C/q:= R (x)/q(x)d(x) . Suppose that Assumption

26、1 is satisfi ed. Then there exists a constantC g,m,k,Tdepending only on g,m,kand T such that ? ? ? ? Z f(x)(x)d(x) Z T (mg,Xn(x)(x)d(x) ? ? ? ? C g,m,k,TC/qk gkHksup x q(x) p kXn(x,x). Prop. 2.1 shows that to establish convergence guarantees for ABQ methods, it is suffi cient to analyze the converge

27、nce behavior of the quantitysupxq(x)pkXn(x,x)for pointsXn= x1,.,xn generated from (4). This is what we focus on in the remainder. 3Connections to Weak Greedy Algorithms in Hilbert Spaces To analyze the quantitysupxq(x)pkXn(x,x)for pointsXn= x1,.,xngenerated from ABQ(4), we show here that the ABQ can

28、 be interpreted as a certain weak greedy algorithm studied by DeVore et al. 13. To describe this, letHbe a (generic) Hilbert space andC Hbe a compact subset. To defi ne some notation, leth1,.,hn Cbe given. Denote bySn:= span(h1,.,hn) = Pn i=1ihi | 1,.,n R Hthe linear subspace spanned byh1,.,hn. For

29、a givenh C, let dist(h,Sn) be the distance between h and Sn defi ned by dist(h,Sn) := inf gSn kh gkH=inf 1,.,nR kh n X i=1 ihikH, wherek kHdenotes the norm ofH. Geometrically, this is the distance betweenhand its orthogonal projection onto the subspaceSn. The task considered in 13 is to selecth1,.,h

30、n Csuch that the worst case error in C defi ned by en(C) := sup hC dist(h,Sn)(5) 5 becomes as small as possible: h1,.,hn C are to be chosen to approximate well the set C. The following weak greedy algorithm is considered in DeVore et al. 13. Letbe a constant such that0 0 holds for all x . Then for a

31、ll x we have q2(x)kXn(x,x) = dist2(hx,Sn),(7) where kXn(x,x) is the GP posterior variance function given by (3). Moreover, we have en(Ck,q) = sup x q(x) p kXn(x,x),(8) where en(Ck,q ) is the worst case error defi ned by (5) with C := Ck,qand Sn defi ned here. Lemma 3.1(8)suggests that we can analyze

32、 the convergence properties ofsupxq(x)pkXn(x,x) forXn= x1,.,xngenerated from the ABQ rule(4)by analyzing those of the worst case error en(Ck,q) for the corresponding elements hx1,.,hxn, where hxi:= q(xi)k(,xi). Adaptive Bayesian Quadrature as a Weak Greedy Algorithm.We now show that the ABQ(4) gives

33、 a weak greedy approximation of the compact setCk,qin the RKHSHkin the sense of(6). We summarize required conditions in Assumptions 3 and 4. As mentioned in Sec. 1, Assumption 3 is the crucial one: its implications for certain specifi c ABQ methods will be discussed in Sec. 4.2. Assumption 3(Weak Ad

34、aptivity Condition).There are constantsCL,CU 0such thatCL b(x) CUholds for all x and for all N 0. Intuitively, this condition enforces ABQ to not overly focus on a specifi c local region inand to explore the entire domain. For instance, consider the following two situations where Assumption 3 does n

35、ot hod.: (a)b(x) +0as for some local regionx A , whileb(x)remains bounded from blow forx A; (b)b(x) +as for some local regionx B , whileb(x)remains bounded from above forx B. In case (a), ABQ will not allocate any points to this regionA at all, after a fi nite number of iterations. Thus, the informa

36、tion about the integrandf 6 on this regionA will not be obtained after a fi nite number of evaluations, which makes it diffi cult to guarantee the consistency of quadrature, unlessf has a fi nite degree of freedom onA. Similarly, in case (b), ABQ will generate points only in the regionBand no point

37、in the rest of the regionB, after a fi nite number of iterations. Assumption 3 prevents such problematic situations to occur. Assumption 4. F : 0,) 0,)is increasing and continuous, andF(0) = 0. For any 0 c 1, there is a constant0 0thenF1(y) = y1/and thus we have(c) = c1/ for0 0for the VBMC 1,2. If F

38、(y) = exp(y) 1as in the MMLT 9, we haveF1(y) = log(y + 1)and it can be shown that(c) = cfor0 c 1; see Appendix B.2. Note that in Assumption 4, the inverseF1is well-defi ned since F is increasing and continuous. In our analysis, we allow for the point selection procedure of ABQ itself “weak,” in the

39、sense that the optimization problem in(4)may be solved approximately.4That is, for a constant0 1we assume that the points x1,.,xnsatisfy a(x+1) max x a(x),( = 0,1,.,n 1),(9) The case = 1 amounts to exactly solving the global optimization problem of ABQ (4). The following lemma guarantees we can assu

40、me without loss of generality that the kernel matrix Knfor the pointsx1,.,xngenerated from the ABQ(4)is invertible under the assumptions above, since otherwisesupxkX(x,x) = 0holds, implying that the quadrature error is0from Prop. 2.1. This guarantees the applicability of Lemma 3.1 for points generat

41、ed from the ABQ (4). Lemma 3.2. (proof in Appendix B.3) Suppose that Assumptions 2, 3 and 4 are satisfi ed. For a constant0 1, assume thatx1,.,xnare generated by a -weak version of ABQ(4), i.e.,(9)is satisfi ed. Then either one of the following holds: i) the kernel matrixK= (k(xi,xj) i,j=1 Ris inver

42、tible for all = 1,.,n; or ii) there exists some = 1,.,nsuch thatsupxkX(x,x) = 0. Lemma 3.1 leads to the following theorem, which establishes a connection between ABQ and weak greedy algorithms. Theorem 3.3. (proof in Appendix B.4) Suppose that Assumptions 2, 3 and 4 are satisfi ed. For a constant0 0

43、andD0 0such that dn(C) C0exp(D0n)holds for alln N. Thenen(C) 2C 01exp(D1n) holds for all n N with D1:= 212D0. 4We thank George Wynne for pointing out that our analysis can be extended to this weak version of ABQ. 7 Polynomial decay:Assume that there exist constants 0andC0 0such thatdn(C) C0nholds fo

44、r alln N. Thenen(C) C1nholds for alln NwithC1:= 25+12C0. Generic case:We haveen(C) 21 min1 0, multiquadric kernelsk(x,x0) = (1)de(c2+ kx x0k2) with,c 0such that 6 N, wherededenotes the smallest integer greater than, and inverse multiquadric kernelsk(x,x0) = (c2+ kx x0k2)with 0. We have the following

45、 bound on the Kolmogorov n-width of the Ck,qfor these kernels; the proof is in Appendix C.2. Proposition 4.2.Let Rd be a cube, and suppose that Assumption 2 is satisfi ed. Letkbe a square-exponential kernel or an (inverse) multiquadric kernel. Then there exist constantsC0,D0 0 such that dn(Ck,q) C0e

46、xp(D0n1/d) holds for all n N. The requirement forto be a cube stems from the use of Wendland 38, Thm. 11.22 in our proof, which requires this condition. In fact, this can be weakened tobeing a compact set satisfying an interior cone condition, but the resulting rate weakens toO(exp(D1n1/2d)(note tha

47、t this is still exponential); see 38, Sec. 11.4. This also applies to the following results. Combining Prop. 4.2 with Lemma 3.1, Thm. 3.3 and Lemma 4.1, we now obtain a bound on supxq(x)pkXn(x,x). Theorem 4.3. (proof in Appendix C.3) Suppose that Assumptions 2, 3 and 4 are satisfi ed. Let Rdbe a cub

48、e, andkbe a square-exponential kernel or an (inverse) multiquadric kernel. For a constant0 0 such that sup x q(x) p kXn(x,x) C1( CL/CU)1/2exp(D1n1/d)(n N). As a directly corollary of Prop. 2.1 and Thm. 4.3, we fi nally obtain a convergence rate of the ABQ with an infi nitely smooth kernel, which is exponentially fast. Corollary 4.4. Suppose that Assumptions 1, 2, 3 and 4 are satisfi ed, and thatC/q:= R (x)/q(x

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论