资料教程成果neruips2019集9068a communication efficient stochastic multi block alternating direction of multipliers

上传人：汤*** IP属地：北京上传时间：2022-03-08 格式：DOCX 页数：10 大小：1.64MB 积分：6 举报 版权申诉

资料教程成果neruips2019集9068a communication efficient stochastic multi block alternating direction of multipliers_第2页

资料教程成果neruips2019集9068a communication efficient stochastic multi block alternating direction of multipliers_第3页

已阅读5页，还剩5页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

1、A Communication Efficient Stochastic Multi-Block Alternating Direction Method of MultipliersHao YuAmazoneeyuhaoAbstractThe alternating direction method of multipliers (ADMM) has recently received tremendous interests for distributed large scale optimization in machine learning, statistics, multi-age

2、nt networks and related applications. In this paper, we propose a new parallel multi-block stochastic ADMM for distributed stochastic optimization, where each node is only required to perform simple stochastic gradient descent updates. The proposed ADMM is fully parallel, can solve problems with arb

3、itrary block structures, and has a convergence rate comparable to or better than existing state-of-the-art ADMM methods for stochastic optimization. Existing stochastic (or deterministic) ADMMs require each node to exchange its updated primal variables across nodes at each iteration and hence cause

4、significant amount of communication overhead. Existing ADMMs require roughly the samber of inter-node communication rounds as the number of in-node computation rounds. In contrast, the number of communication rounds required by our new ADMM is only the square root of the number of computation rounds

5、.1IntroductionFix integer N 2. Consider multi-block linearly constrained stochastic convex programs given by:NNXXf (x) =f (x ) s.t.A x = b,(1)miniii ixiiXi,i=1i=1where xi Rdi , Ai Rmdi , b Rm, XiRdiare closed convex sets, and fi(xi) =Efi(xi; ) are convex functions. To have a compact representation o

6、f (1), we define x =QPPNNi=1Ni=1x ; x ; . . . ; x Rdi , X =X , f (x) =f (x ) and A = A , A , . . . , A i=112Niii12NPPNNRmd. Note that constraintA x = b now can be written as Ax = b.ii=1i ii=1The problem (1) captures many important applications in machine learning, network scheduling,statistics and f

7、inance. For example, (stochastic) linear programs that are too huge to be solved over a single node can be written as (1). To solve such large scale linear programs in a distributed manner,we can save each Ai and fi() at a separate node and let each node iteratively solves smaller sub-problems (with

8、 necessary inter-node communication). Another important application of formulation(1) is the distributed consensus training of a machine learning mover N nodes 15, 17, 23 described as follows: In an online training setup, i.i.d. realizations of fi(; ) are sampled at each node. In an offlineP 1iNtrai

9、ning setup, fi(xi) = Efi(xi; ) are approximated byfij(xi) where Ni is thej=1Ninumber of training samples at node i and each f () represents one training sample. To enforce all N nodes are training the same m, our constraint Ax = b is given by xi = xjfor all i 6= j 1, 2, . . . , N . (In fact, we only

10、 need such constraints for pairs (i, j) that constructa connected graph for all nodes.)The Alternating Direction Method of Multipliers (ADMM) is an effective and popular method to solve linearly constrained convex programs, especially distributed consensus optimiation 28, 5,33rd Conference on Neural

11、 Information Processing Systems (NeurIPS 2019), Vancouver, Canada.since it often yields distributed implementations with low complexity 4. Conventional ADMMs are developed for the special case of problem (1) with N = 2 and/or deterministic fi(xi). To solve a two-block problem (1) where f1 is a stoch

12、astic function and f2 is a deterministic function, previous works 21, 25, 31, 1 have developed stochastic (two-block) ADMMs to solve problem (1) withN = 2. It is unclear whether these methods can be extended to solve the case N 3. In fact, even forproblem (1) where all fi(xi) are deterministic, 6 pr

13、oves that the classical (two-block) ADMM, onwhich the stochastic versions in 21, 25 are built, converges for N = 2 but diverges for N 3. Tosolve stochastic convex program (1) with N 3, randomized block coordinate updated ADMMs withO(1/ 2) convergence are developed in 27, 11. Due to the challenging s

14、tochastic objective functions, the convergence rate of stochastic ADMMs is fundamentally slower than deterministic ADMMs, i.e., O(1/ 2) v.s. O(1/ ) 13, 7, 11. The O(1/ 2) convergence is optimal since it is optimal even for unconstrained stochastic convex optimization without strong convexity 20. How

15、ever, in distributive implementations of ADMMs, each node has to pass its most recent xi value to its neighbors or a fusion center and then updates the dual variable . Existing stochastic ADMM methods 21, 25, 11 require a communication step immediately after each xi computation step. In practice, th

16、e inter-node communication over TCP/IP is much slower than in-node memory computations and often requires additional set-up time such that communication overhead is the performance bottleneck of mostdistributed optimization methods.As a consequence, communication efficient optimization recently attr

17、acted a lot of research interests 29, 14, 24, 15, 17, 18, 23. Work 17 proposes a primal-dual method that can solve problem (1) with stochastic objective functions using O(1/ 2) computation iterations and O(1/ ) communicationiterations. However, the method in 17 requires each objective function fi()

18、to satisfy the stringentcondition that there exists M such that fi(u) fi(v) + hd, u vi + M ku vk for any u, v andd fi(v) . Such a condition is more stringent than the smoothness when u and v are far apart fromeach other. For example, the simple scalar smooth function f (x) = x2 does not satisfy this

19、 conditionover X = R. Work 18 proposes a communication efficient method to solve deterministic convexprograms based on the quadratic penalty method and can obtain an -optimal solution with O(1/ 2+) computation rounds ( is a positive constant) and O(1/ ) communication rounds. For distributed consensu

20、s optimization over a network, which can be formulated as a special case of problem (1) where Ai and b are chosen to ensure all xi are identical, mixing or local averaging based methodswith fast convergence (and low communication overhead) are recently developed in 26, 22, 23, 19.Our Contributions:

21、This paper proposes a new communication efficient stochastic multi-block ADMM which has communication rounds less frequently than computation rounds. For stochastic convex programs with general convex objective functions, our algorithm can achieve an -solution with O(1/ 2) computation1 rounds and O(

22、1/ ) communication rounds. That is, our communication efficient ADMM has the same computation convergence rate as the ADMM in 11 but only requiresthe square root of communication rounds required by the method in 11. For stochastic convexprograms with strongly convex objective functions, our algorith

23、m can achieve an -accuracy solutionwith O(1/ ) computation rounds and O(1/ ) communication rounds2. The fast computation convergence (and even faster communication convergence) for strongly convex stochastic programs is not possessed by the ADMM in 11. When applying our new multi-block ADMM to the s

24、pecial case of two-block problems, our algorithm has the same computation convergence as existing two-block stochastic ADMM methods in 21, 25, 31, 1. However, the number of communication rounds used by our ADMM is only the squared root of these previous methods.Notations: This paper uses kAk to deno

25、te the spectral norm of matrix A; kzk to denote the Euclideannorm of vector z; and hy, zi = yTz to denote the inner product of vectors y and z. If symmetricmatrix Q 0 is positive semi-definite, then we define k2Formulation and New Algorithmzk= zTQz for any vector z.2QFollowing the convention in 8, a

26、 function h(x) is said to be convex with modulus , or equivalently,-convex, if h(x) kxk is convex. The -convex definition unifies the conventional definitions of22convexity and strong convexity. That is, a general convex function, which is not necessarily stronglyconvex, is convex with modulus = 0;

27、and a strongly convex function is convex with modulus 0.Throughout this paper, convex program (1) is assumed to satisfy the following standard assumption:1A computation round of our algorithm is a just a single iteration of the SGD update.2A logarithm factor log( ) is hidden in the notation O().1 2A

28、ssumption 1. Convex program (1) has a saddle point (x, ). That is, x is an optimal solutionand Rm is a Lagrange multiplier attaining strong duality q() = f (x), where q() =infxiXi,if (x) + h, Ax bi is the Lagrangian dual function.Note that strong duality in Assumption 1 is often stated as its equiva

29、lent “KKT conditions”, e.g., in 7. A mild sufficient condition for Assumption 1 to hold is (1) has at least one feasible point and the domain of each fi(xi) includes Xi as an interior 3.Assume unbiased subgradients Gi(xi; ) satisfying EGi(xi; ) = fi(xi), xi Xi=for each function f (x ) can be sampled

30、.Denote the stacked column vector G(x; )iiPNG (x ; )T, . . . , G (x ; ) Rdi . We have E G(x; ) = f (x).T Ti=111NNConsider the communication efficient stochastic multi-block ADMM described in Algorithm 1. Since fi(xi) are stochastic, i(xi) defined in (2) is fundamentally unknown. However, each i(xi)

31、is (t)-convex and its unbiased stochastic subgradient is available as long as we have unbiased stochastic subgradients of fi(xi). The sub-procedure STO-LOCAL involved in Algorithm 1 is a simple stochastic subgradient decent (SGD) procedure (with particular choices of parameters, starting(t)points an

32、d averaging schemes) to minimize () over set X and is described in Algorithm 2.iiAlgorithm 1 Two-Layer Communication Efficient ADMM1: Input: Algorithm parameters T , (t)t1, (t)t1 and K(t)t1.P(0)(0)N(0)2: Initialize arbitrary y X , i, r=(0)A y b, = 0, and t = 1.iiii=1i3: while t T do4:Each node i def

33、ines(t) (t1) 1 (t)b(t)(t1)(t1)2(2)i(xi) =fi(xi) + r+(t), Aixi N+kx ykii2and in parallel updates x(t), y(t) using local sub-procedure Algorithm 2 viaii(t)(t)(t)(t1)(x , y ) = STO-LOCAL( (), X , y, K)(t)(3)iiiii5:Each node i passes x(t) and y(t) between nodes or to a parameter server. Update (t) andii

34、r(t) via XN(t)(t) =(t1) + (t)(4)A x biii=1XN(t)(t)(5)r=A y b.iii=16:Update t t + 1.7: end while8: Output: x(T ) = PPTt=11(t)x(t)T (t)t=1Algorithm 2 STO-LOCAL(z), Z, zinit, K)1: Input: : strong convexity modulus of (z); Algorithm parameters:k0 0;(k)= 2., k 1, 2, . . . , K(k+k0 )2: Initialize z(0) = z

35、init and k = 1.3: while k K do4:Observe an unbiased gradient such that E = (z(k1) and update z(k) via(k)(k)z(k) = PZ hz(k1) (k)(k)i(6)where PZ is the projection onto Z.5: end while6: Output: (z, z) where z is the time average of z(0), . . . , z(K) defined in Lemmas 1 or 2.b(K)b3We now justify why Al

36、gorithm 1 is a two-layer ADMM method. (See Supplement 6.1 for a more detailed discussion.) The Lagrange multiplier update (4) is identical to that used in existing ADMM methods or otherLagrangian based methods. It is helpful to enforce the linear constraint. At the first sight, the primal update in

37、Algorithm (4) is quite different from existing deterministicADMMs in 10, 4, 7, which require to solve an “argmin problem, or stochastic ADMMs in 21, 25, 11, which perform a single gradient descent step . However, with a simple manipulation,it is not difficult to show that that function (t)(xi) in (2

38、) is similar to the “argmin target in thei(t1)proximal Jacobi ADMM method 7 with the distinction that the proximal term kx yk is2iiregarding a newly introduced variable y(t1) rather than x(t1).iiRecall that the fastest stochastic ADMMs in 21, 25, 11 can solve general convex problem (1) (withN = 2) w

39、ith O(1/ T ) convergence. That is, to obtain a solution with errors for both the objective value and the constraint violation, the ADMMs in 21, 25, 11 require O(1/ 2) computation steps, each of which uses a single gradient evaluation and variable update. The ADMMs in 21, 25, 11 has asingle layer str

40、ucture and hence are communication inefficient in the sense that each computation step involves a communication steps. Thus, the communication complexity of these stochastic ADMMs is also O(1/ 2). Compared with existing ADMMs in 21, 25, 11, Algorithm 1 has a two layer structure where each outer laye

41、r step involves a single inter-node communication step given by (4)-(5)(t)(t)and calls the sub-procedure, i.e. Algorithm 2, STO-LOCAL( (), X , y , K), which is run by(t)iiieach nocally and in parallel and hence does not incur any inter-node communication overhead.PTSince each call of Algorithm 2 inc

42、urs KSGD update, T iterations of Algorithm 1 use(t)K(t)t=1computation steps. We shall show that to achieve an solution for general convex problem (1),PTAlgorithm 1 uses T = O(1/ ) communication rounds andK(t) = O(1/ 2) computation steps.t=1That is, Algorithm 1 is as fast as existing fastest stochast

43、ic ADMMs but uses only a square root of thenumber of communications rounds in 21, 25, 11.Note that inter-node communication in Algoirthm 1 can be either centralized or decentralized. To usecentralized communication, we can ll nodes pass their x(t) to a parameter server, where (4)-(5) areiexecuted, a

44、nd then pull the updated (t) and r(t) from the server. It is possible to implement (4)-(5) using decentralized communication by exploring the structure of matrix A = A1, A2, . . . , AN . Forexample, consider distributed machine learning in a line network where Ax = b is given by N 1(t)(t)equality co

45、nstraints x x= 0, i . In this case, and ronly depend on1, 2, . . . , N 1ii+1iix(t) and x(t)and are only used to updates x(t+1) and x(t+1). Thus, to implement Algorithm 1, eachii+1ii+1node only needs to send its local x(t) to and pull (t) and r(t) from its neighbors in the line network.ijj2.1 Basic F

46、acts of Algorithm 2Since each iteration of Algorithm 1 calles Algorithm 2, which essentially applies SGD with carefully(t)designed step size rules to newly introduced objective functions (). This subsection providesisome useful insight of SGD for strongly convex stochastic minimization.It is known t

47、hat SGD can have O(1/ ) convergence for strongly convex minimization. The next two lemmas summarize the convergence of SGD Algorithm 2. When characterizing O(1/ ) rate, our lemmas also include a push-back term involving the last iteration solution. This term ensures when the SGD solution from Algori

48、thm 2 is used in the outer-level ADMM dynamics, the accumulatederror of our final solution does not explode. It also explains why we use y(t1), which is the lastiiteration solution from the SGD sub-procedure, rather than conventional x(t1) to define (t)(xi).iiLemma 1 (16). Assume (z) is a -convex fu

49、nction ( 0) over set Z and there exists a constantB such that the unbiased subgradient (k) used in Algorithm 2 satisfies Ek(k)k2 B2, k 1, 2, . . . , K. If we take k0 = 1 in Algorithm 2, then for all z Z, we have2B2 E(K)2E(z) (z) 2 kz zkb(7) +,(K + 1)|(7)-terzm (I)PK1where z =bP1(k + k )z.(k)0K1(k+k0

50、 )k=0k=04Remark 1. It is firstly shown in 16 that Algorithm 2 with k0 = 1 (vanilla SGD with a particular averaging scheme) has O(1/ ) convergence for non-smooth strongly convex problems. Note that (7)holds for all z Z (not necessarily the minimizer of (). The push-back term (7)-term (I) is oftenigno

51、red in convergence rate analysis for SGD but is important for our analysis of Algorithm 1.Recall that a function h(x) is said to be L-smooth if its gradient h(x) is Lipschitz with modulusL. The next lemma is new and extends Lemma 1 to smooth minimization such that the error term depends only on the

52、variance of stochastic gradients (using a different averaging scheme).Lemma 2. Assume (z) is a L-smooth and -convex function ( 0) with conditional number =and there exists 0 such that unbiased gradient (k) (at point z(k1) in Algorithm 2Lsatisfies Ek(k) (z(k1)k2 2, k 1, 2, . . . , K. If we take integ

53、er k0 2, then forany z Z, we have22 (k ) 2k k0 0 E(0 Ekz z k Ekz(0) 2 zk Ekz(K) 2(K) 2(8)bz) (z) + zk +2K(K + 2k0 1)2(K + 2k0 1)PKk=1where1(k + k0 1)z(k).bz =PKk=1(k+k0 1)Proof. See Supplement 6.6.3Performance Analysis of Algorithm 1This section shows that Algorithm 1 can achieve an -accuracy soluti

54、on using O(1/ 2) computationrounds and O(1/ ) communication rounds for general convex stochastic programs; or using O(1/ )computation rounds and O(1/ ) communication rounds for strongly convex stochastic programs.3.1 General objective functions (possibly non-smooth non-strongly convex)Theorem 1. Consider convex program (1) under Assumption 1. Let (x, ) be any saddle point defined in Assumption 1. Assume that The constraint set X is bounded, i.e., there exists co

人人文库> 全部分类> 应用文书

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

资料教程成果neruips2019集9068a communication efficient stochastic multi block alternating direction of multipliers

文档简介

温馨提示

最新文档

评论

资料教程成果neruips2019集9068a communication efficient stochastic multi block alternating direction of multipliers

文档简介

温馨提示

最新文档

评论

相关文档