iccv2019论文全集9108-a-stochastic-composite-gradient--with-incremental-variance-reduction_第1页
iccv2019论文全集9108-a-stochastic-composite-gradient--with-incremental-variance-reduction_第2页
iccv2019论文全集9108-a-stochastic-composite-gradient--with-incremental-variance-reduction_第3页
iccv2019论文全集9108-a-stochastic-composite-gradient--with-incremental-variance-reduction_第4页
iccv2019论文全集9108-a-stochastic-composite-gradient--with-incremental-variance-reduction_第5页
已阅读5页,还剩6页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、A Stochastic Composite Gradient Method with Incremental Variance Reduction Junyu Zhang University of Minnesota Minneapolis, Minnesota 55455 Lin Xiao Microsoft Research Redmond, Washington 98052 lin.xiao Abstract We consider the problem of minimizing the composition of a smooth (nonco

2、nvex) function and a smooth vector mapping, where the inner mapping is in the form of an expectation over some random variable or a fi nite sum. We propose a stochastic composite gradient method that employs an incremental variance-reduced estimator for both the inner vector mapping and its Jacobian

3、. We show that this method achieves the same orders of complexity as the best known fi rst-order methods for minimizing expected-value and fi nite-sum nonconvex functions, despite the additional outer composition which renders the composite gradient estimator biased. This fi nding enables a much bro

4、ader range of applications in machine learning to benefi t from the low complexity of incremental variance-reduction methods. 1Introduction In this paper, we consider stochastic composite optimization problems minimize xRd f ?E g(x) ? + r(x),(1) wheref:Rp Risasmoothandpossiblynonconvexfunction,isara

5、ndomvariable,g:Rd Rp is a smooth vector mapping fora.e. , andris convex and lower-semicontinuous. A special case we will consider separately is whenis a discrete random variable with uniform distribution over 1,2,. .,n. In this case the problem is equivalent to a deterministic optimization problem m

6、inimize xRd f ? 1 n n X i=1 gi(x) ? + r(x).(2) The formulations in(1)and(2)cover a broader range of applications than classical stochastic optimization and empirical risk minimization (ERM) problems where eachgis a scalar function (p =1) andfis the scalar identity map. Interesting examples include t

7、he policy evaluation in reinforcement learning (RL) e.g.,30, the risk-averse mean-variance optimization (e.g.,28,29, through a reformulation by 35), the stochastic variational inequality (e.g.,12,15 through a reformulation in 10), the 2-level composite risk minimization problems 7, etc. For the ease

8、 of notation, we defi ne g(x) := Eg(x),F(x) := f(g(x),(x) := F(x) + r(x).(3) In addition, letf 0 andF0denote the gradients offandFrespectively, andg0 (x) R pd denote the Jacobian matrix of gat x. Then we have F0(x) = ? f ?E g(x) ? = ? Eg0 (x) ?T f 0?Eg(x)? . 33rd Conference on Neural Information Pro

9、cessing Systems (NeurIPS 2019), Vancouver, Canada. Table 1: Sample complexities of CIVR (Composite Incremental Variance Reduction) Assumptions (common: f and gLipschitz and smooth, thus F smooth) ProblemF nonconvexF -gradient dominantF convex, r convex r convexr 0 -optimally strongly convex (1)O ?3/

10、2? O ?1? log?1?O ?1?1? log?1? (2)O ?min?3/2, n1/2?1? O ?n + n1/2? log?1?O ?n + 1n1/2? log?1? Inpractice, computingF0(x)exactlycanbeverycostlyifnotimpossible. Duetothenonlinearityofthe outer composition, simply multiplying the unbiased estimatorsE g(x) = g(x)andEg0(x) = g0(x) results in a biased esti

11、mator forF0(x), namely,Eg0(x)Tf 0( g(x) , F0(x), see e.g., 35. This is in great contrast to the classical stochastic optimization problem minimize xRd E ?g (x) ? + r(x),(4) where one can always get an unbiased gradient estimator for the smooth part. This fact makes such composition structure be of i

12、ndependent interest for research on stochastic and randomized algorithms. In this paper, we develop an effi cient stochastic composite gradient method called CIVR (Composite IncrementalVarianceReduction), forsolvingproblemsoftheforms(1)and(2) . Wemeasureeffi ciency by the sample complexity of the in

13、dividual functionsgand their Jacobiang0 , i.e., the total number of times they need to be evaluated at some point, in order to fi nd an?-approximate solution. For nonconvex functions, an?-approximate solution is some random output of the algorithm x Rdthat satisfi esEkG( x)k2 ?, whereG( x)is the pro

14、ximal gradient mapping of the objective function at x(see details in Section 2). Ifr 0, thenG( x) = F0( x)and the criteria for?-approximation becomesEkF0( x)k2 ?. If the objectiveis convex, we requireE( x) ? ?where ?= infx(x). For smooth and convex functions, these two notions are compatible, meanin

15、g that the dependence of the sample complexity on ? in terms of both notions are of the same order. Table 1 summarizes the sample complexities of the CIVR method under diff erent assumptions obtained in this paper. We can defi ne a condition number = O()for-gradient dominant functions and = O(1/)for

16、-optimally strongly convex functions, then the complexities become O ?1? log?1?andO ?n + n1/2? log?1?for(1)and(2)respectively. In order to better position our contributions, we next discuss related work and then putting these results into context. 1.1Related Work We fi rst discuss the nonconvex stoc

17、hastic optimization problem(4), which is a special cases of(1). Whenr 0 andg(x) = Eg(x)is smooth, Ghadimi and Lan9developed a randomized stochastic gradient method with iteration complexityO(?2). Allen-Zhu2obtainedO ?1.625? with additional second-order guarantee. There are also many recent works on

18、solving its fi nite-sum version minimize xRd 1 n n X i=1 gi(x) + r(x),(5) which is also a special case of(2). By extending the variance reduction techniques SVRG 13,34 and SAGA 6 to nonconvex optimization, Allen-Zhu and Hazan3and Reddi et al.24,25,26developed randomized algorithms with sample comple

19、xityO(n + n2/3?1). Under additional assumptions of gradientdominanceorstrongconvexity,theyobtainedsamplecomplexityO(n+n2/3)log?1),where is a suitable condition number. Allen-Zhu1and Lei et al.17obtainedO ?min?5/3, n2/3?1?. Based on a new variance reduction technique called SARAH 21, Nguyen et al.22a

20、nd Pham et al. 23developed nonconvex extensions to obtain sample complexitiesO ?3/2? andO ?n + n1/2?1? for solving the expectation and fi nite-sum cases respectively. Fang et al.8introduced another variance reduction technique called Spider, which can be viewed as a more general variant of SARAH. Th

21、ey obtained sample complexitiesO ?3/2? andO ?min?3/2, n1/2?1? for the two cases respectively, but require small step sizes that are proportional to?. Wang et al.33extended Spider to obtain the same complexities with constant step sizes andO ?(n + 2)log?1? under the gradient-dominant condition. In ad

22、dition, Zhou et al. 36 obtained similar results using a nested SVRG approach. 2 In addition to the above works on solving special cases of(1)and(2), there are also considerable recent works on a more general, two-layer stochastic composite optimization problem minimize xRd E ? f ?E g(x) ? + r(x),(6)

23、 wherefis parametrized by another random variables, which is independent of. For the case r 0, Wang et al.31 derived algorithms to fi nd an?-approximate solution with sample complexities O(?4),O(?3.5)andO(?1.25)for the smooth nonconvex case, smooth convex case and smooth strongly convex case respect

24、ively. For nontrivial convexr, Wang et al.32obtained improved sample complexity of O(?2.25), O(?2) and O(?1) for the three cases mentioned above respectively. As a special case of (6), the following fi nite-sum problem also received signifi cant attention: minimize xRd 1 m m X j=1 fj ? 1 n n X i=1 g

25、i(x) ? + r(x).(7) Whenr 0 and the overall objective function is strongly convex, Lian et al.19derived two algorithms based on the SVRG scheme to attain sample complexitiesO(m + n + 3)log?1)and O(m + n + 4)log?1)respectively, where is some suitably defi ned condition number. Huo et al. 11also used th

26、e SVRG scheme to obtain anO(m + n + (m + n)2/3?1)complexity for the smooth nonconvex case andO(m + n + 3)log?1)for strongly convex problems with nonsmoothr. More recently, Zhang and Xiao35proposed a composite randomized incremental gradient method based ontheSAGAestimator6,whichmatchesthebestknownO(

27、m+n+(m+n)2/3?1)complexitywhenF is smooth and nonconvex, and obtained an improved complexityO ?(m + n + (m + n)2/3)log?1? under either gradient dominant or strongly convex assumptions. When applied to the special cases(1) and (2) we focus on in this paper (m = 1), these results are strictly worse tha

28、n ours in Table 1. 1.2Contributions and Outline We develop the CIVR method by extending the variance reduction technique of SARAH 2123 and Spider 8,33 to solve the composite optimization problems(1)and(2). The complexities of CIVR in Table 1 match the best results for solving the non-composite probl

29、ems(4)and(5), despite the additional outer composition and the composite-gradient estimator always being biased. In addition: By settingfandg s to be the identity mapping and scalar mappings respectively, problem (2)includes problem(5)as a special case. Therefore, the lower bounds in 8 for the non-

30、composite fi nite-sum optimization problem(5)indicates that ourO ?min?3/2, n1/2?1? complexity for solving the more general composite fi nite-sum problem (2) is near-optimal. Undertheassumptionsofgradientdominanceorstrongconvexity,theO ?n + n1/2? log?1? complexity only appeared for the special case (

31、5) in the recent work 18. Our results indicate that the additional smooth composition in(1)and(2)does not incur higher complexity compared with(4)and(5) , despite the diffi culty of dealing with biased estimators. We believe these results can also be extended to the two-layer problems(6)and(7), by r

32、eplacingnwith m + n in Table 1. But the extensions require quite diff erent techniques and we will address them in a separate paper. The rest of this paper is organized as follows. In Section 2, we introduce the CIVR method. In Section3, wepresentconvergenceresultsofCIVRforsolvingthecompositeoptimiz

33、ationproblems(1) and(2)and the required parameter settings. Better complexities of CIVR under the gradient-dominant and optimally strongly convex conditions are given in Section 4. In Section 5, we present numerical experiments for solving a risk-averse portfolio optimization problem (5) on real-wor

34、ld datasets. 2The composite incremental variance reduction (CIVR) method With the notations in (3), we can write the composite stochastic optimization problem (1) as minimize xRd ?(x) = F(x) + r(x)? ,(11) where F is smooth and r is convex. The proximal operator of r with parameter is defi ned as pro

35、x r(x) := argmin y n r(y) + 1 2 ky xk2 o .(12) 3 Algorithm 1: Composite Incremental Variance Reduction (CIVR) input:initial pointx1 0, step size 0, number of epochsT 1, and a set of triplest, Bt,St for t = 1,. . .,T, where tis the epoch length and Btand Stare sample sizes in epoch t. for t = 1,.,T d

36、o Sample a set Btwith size Btfrom the distribution of , and construct the estimates yt 0 = 1 Bt X Bt g(xt 0), zt 0 = 1 Bt X Bt g0 (x t 0), (8) Compute F(xt 0) = (z t 0) T f 0(yt 0) and update: x t 1 = prox r ?xt 0 F(xt 0) ?. for i = 1,.,t 1 do Sample a set St i with size Stfrom the distribution of ,

37、 and construct the estimates yt i =yt i1+ 1 St X St i ?g (xti) g(xti1) ? ,(9) zt i =zt i1 + 1 St X St i ?g0 (x t i) g 0 (x t i1) ?. (10) Compute F(xt i) = (z t i) T f 0(yt i) and update: x t i+1 = prox r ?xt i F(xt i) ?. end Set xt+1 0 = xt t. end output: x randomly chosen from ?xt i ?t=1,.,T i=0,.,

38、t1. We assume thatris relatively simple, meaning that its proximal operator has a closed-form solution or can be computed effi ciently. The proximal gradient method e.g.,20,4 for solving problem(11)is xt+1= prox r ?xt F0(xt)?,(13) where is the step size. The proximal gradient mapping of is defi ned

39、as G(x) , 1 ? x prox r ?x F0(x)?. (14) As a result, the proximal gradient method(13)can be written asxt+1= xt G(xt). Notice that when r 0, prox r() becomes the identity mapping and we have G(x) F0(x) for any 0. Suppose xis generated by a randomized algorithm. We call xan?-stationary point in expecta

40、tion if E?kG( x)k2? ?.(15) (We assume thatis a constant that does not depend on?.) As we mentioned in the introduction, we measure the effi ciency of an algorithm by its sample complexity ofgand their Jacobiang0 , i.e., the total number of times they need to be evaluated, in order to fi nd a point x

41、 that satisfi es(15). Our goal is to develop a randomized algorithm that has low sample complexity. We present in Algorithm 1 the Composite Incremental Variance Reduction (CIVR) method. This methods employs a two time-scale variance-reduced estimator for both the inner function value of g() = Eg()an

42、d its Jacobiang0(). At the beginning of each outer iterationt(each called an epoch), we construct a relatively accurate estimateyt 0 forg(xt 0)and zt 0 forg0(xt 0)respectively, using a relatively large sample sizeBt. During each inner iterationiof thetth epoch, we construct an estimateyt i forg(xt i

43、)and zt i forg0(xt i)respectively, using a smaller sample size Stand incremental corrections from the previous iterations. Note that the epoch lengthtand the sample sizesBtandSt are all adjustable for each epocht. Therefore, besides setting a constant set of parameters, we can also adjust them gradu

44、ally in order to obtain better theoretical properties and practical performance. This variance-reduction technique was fi rst proposed as part of SARAH 21 where it is called recursive variance reduction. It was also proposed in 8 in the form of a Stochastic Path-Integrated Diff erential EstimatoR (S

45、pider). Here we simply call it incremental variance reduction. A distinct 4 feature of this incremental estimator is that the inner-loop estimates yt i and zt i are biased, i.e., ? Eyt i|x t i = g(x t i) g(x t i1) + y t i1 , g(xt i), Ezt i|x t i = g 0(xt i) g 0(xt i1) + z t i1 , g0(xt i). (16) This

46、is in contrast to two other popular variance-reduction techniques, namely, SVRG 13 and SAGA 6, whose gradient estimators are always unbiased. Note that unbiased estimators forg(xt i) and g0(xt i)are not essential here, because the composite estimator F(xt i) = (z t i) T f 0(yt i)is always biased. Th

47、erefore the our main task is to control the variance and bias altogether for the proposed estimator. 3Convergence Analysis In this section, we present theoretical results on the convergence properties of CIVR (Algorithm 1) when the composite function F is smooth. More specifi cally, we make the foll

48、owing assumptions. Assumption 1. The following conditions hold concerning problems (1) and (2): f : Rp R is a C1smooth and f-Lipschitz function and its gradient f 0 is Lf-Lipschitz. Eachg:Rd Rpis aC1smooth andg-Lipschitz vector mapping and its Jacobiang0 is Lg-Lipschtiz. Consequently, g in (3) is g-

49、Lipschitz and its Jacobian g0is Lg-Lipschitz. r : Rd R is a convex and lower-semicontinuous function. The overall objective function is bounded below, i.e., = infx(x) . Assumption 2. For problem (1), we further assume that there exist constants gand g0such that Ekg(x) g(x)k2 2 g, Ekg0 (x) g 0(x)k2 2

50、 g0. (17) As a result of Assumption 1, F(x) = f ?g(x)? is smooth and F0is LF-Lipschitz continuous with LF= 2 gLf + fLg (see proof in the supplementary materials). For convenience, we also defi ne two constants G0:= 2?4 gL 2 f + 2 fL 2 g ? ,and2 0 := 2?2 gL 2 f 2 g+ 2 f 2 g0 ? .(18) It is important t

51、o notice thatG0= O(L2 F), hence the step size used later is = (1/ G 0) = (1/LF). We are allowed to use this constant step size mainly due to the assumption that eachg()is smooth, instead of the weaker assumption that Eg(x) is smooth as in classical stochastic optimization. In the next two subsection

52、s, we present complexity analysis of CIVR for solving problem(1)and(2) respectively. Due to the space limitation, all proofs are provided in the supplementary materials. 3.1The composite expectation case The following results for solving problem(1) are presented with notations defi ned in(3),(14)and

53、(18). Theorem 1. Suppose Assumptions 1 and 2 hold. Given any ? 0, we set T = d1/?e and t= = d1/?e,Bt= B = d2 0/?e, St= S = d1/?e,fort = 1,. . .,T. Then as long as 4 LF+L2 F+12G0 , the output x of Algorithm 1 satisfi es E?kG( x)k2? ? 8?(x1 0) ?1 + 6 ? ? = O(?).(19) As a result, the sample complexity

54、of obtaining an?-approximate solution isTB +2TS = O ?3/2?. Note that in the above scheme, the epoch lengthstand all the batch sizesBtandStare set to be constant (depending on a pre-fi xed?) without regard oft. Intuitively, we do not need as many samples in the early stage of the algorithm as in the

55、later stage. In addition, it will be useful in practice to have a variant of the algorithm that can adaptively chooset,BtandStthroughout the epochs without dependence on a pre-fi xed precision. This is done in the following theorem. 5 Theorem 2.Suppose Assumptions 1 and 2 hold. We sett= St= dat + be

56、andBt= d2 0(at + b) 2e where a 0 and b 0. Then as long as 4 LF+L2 F+12G0 , we have for any T 1, E?kG( x)k2? 2 aT2+ (a + 2b)T 8?(x1 0) ? + 6 a + b + 6 a ln ? aT + b a + b ?! = O ?lnT T2 ? .(20) As a result, obtaining an?-approximate solution requiresT = O(1/?)epochs and a total sample complexity of O

57、 ?3/2?, where the O() notation hides logarithmic factors. 3.2 The composite fi nite-sum case In this section, we consider the composite fi nite-sum optimization problem(2). In this case, the random variable has a uniform distribution over the fi nite index set1,.,n. At the beginning of each epoch in Algorithm 1, we use the full sample sizeBt= 1,. .,nto computeyt 0 andzt 0. Therefore Bt= n for all t and Equation (8) in Algorithm 1 becomes yt 0 = g(xt 0) = 1 n n X j=1 gj(xt 0), zt 0 = g0(xt 0) = 1 n n X j=1 g0 j(x t 0). (21) Also in this case, we no longer need Assumption 2. Theorem3.SupposeAs

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论