iccv2019论文全集8598-communication-efficient-distributed-learning-via-lazily-aggregated-quantized-gradients

上传人：我*** IP属地：北京上传时间：2020-06-18 格式：PDF 页数：11 大小：3.25MB 积分：12 举报 版权申诉

iccv2019论文全集8598-communication-efficient-distributed-learning-via-lazily-aggregated-quantized-gradients_第2页

iccv2019论文全集8598-communication-efficient-distributed-learning-via-lazily-aggregated-quantized-gradients_第3页

iccv2019论文全集8598-communication-efficient-distributed-learning-via-lazily-aggregated-quantized-gradients_第4页

iccv2019论文全集8598-communication-efficient-distributed-learning-via-lazily-aggregated-quantized-gradients_第5页

已阅读5页，还剩6页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

1、Communication-Effi cient Distributed Learning via Lazily Aggregated Quantized Gradients Jun Sun Zhejiang University Hangzhou, China 310027 sunjun16sj Tianyi Chen Rensselaer Polytechnic Institute Troy, New York 12180 Georgios B. Giannakis University of Minnesota, Twin Cities Minneapoli

2、s, MN 55455 Zaiyue Yang Southern U. of Science and Technology Shenzhen, China 518055 yangzy3 Abstract The present paper develops a novel aggregated gradient approach for distributed machine learning that adaptively compresses the gradient communication. The key idea is to fi rst quan

3、tize the computed gradients, and then skip less informative quantized gradient communications by reusing outdated gradients. Quantizing and skipping result in lazy worker-server communications, which justifi es the term LazilyAggregatedQuantized gradient that is henceforth abbreviated asLAQ. Our LAQ

4、 can provably attain the same linear convergence rate as the gradient descent in the strongly convex case, while effecting major savings in the communication overhead both in transmitted bits as well as in communication rounds. Empirically, experiments with real data corroborate a signifi cant commu

5、nication reduction compared to existing gradient- and stochastic gradient-based algorithms. 1Introduction Considering the massive amount of mobile devices, centralized machine learning via cloud computing incurs considerable communication overhead, and raises serious privacy concerns. Today, the wid

6、espread consensus is that besides in the cloud centers, future machine learning tasks have to be performed starting from the network edge, namely devices 16,19. Typically, distributed learning tasks can be formulated as an optimization problem of the form min X mM fm() with fm() := Nm X n=1 (xm,n;)(

7、1) where Rpdenotes the parameter to be learned,Mwith|M| = Mdenotes the set of servers, xm,nrepresents then-th data vector at workerm(e.g., feature and label), andNmis the number of data samples at workerm. In(1),(x;)denotes the loss associated withandx, andfm()denotes the aggregated loss correspondi

8、ng toand all data at workerm. For the ease in exposition, we also defi nef() = P mMfm()as the overall loss function. In the commonly employed worker-server setup, the server collects local gradients from the workers and updates the parameter using a gradient descent (GD) iteration given by GD iterat

9、ionk+1= k X mM fm?k?(2) Jun Sun and Tianyi Chen contributed equally to this work. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. wherekdenotes the parameter value at iterationk,is the stepsize, andf(k)= P mMfm( k) is the aggregated gradient. When the data

10、 samples are distributed across workers, each worker computes the corresponding local gradientfm(k), and uploads it to the server. Only when all the local gradients are collected, the server can obtain the full gradient and update the parameter. To implement(2)however, the server has to communicate

11、with all workers to obtain fresh gradientsfm?k?M m=1.In several settings though, communication is much slower than computation 15. Thus, as the number of workers grows, worker-server communications become the bottleneck 9. This becomes more challenging when incorporating popular deep learning-based

12、learning models with high-dimensional parameters, and correspondingly large-scale gradients. 1.1Prior art Communication-effi cient distributed learning methods have gained popularity recently 9,22. Most popular methods build on simple gradient updates, and are centered around the key idea of gradien

13、t compression to save communication, including gradient quantization and sparsifi cation. Quantization.Quantization aims to compress gradients by limiting the number of bits that repre- sent fl oating point numbers during communication, and has been successfully applied to several engineering tasks

14、employing wireless sensor networks 21. In the context of distributed machine learning, a 1-bit binary quantization method has been developed in 5,24. Multi-bit quantization schemes have been studied in 2,18, where an adjustable quantization level can endow additional fl exibility to control the trad

15、eoff between the per-iteration communication cost and the convergence rate. Other variants of quantized gradient schemes include error compensation 32, variance-reduced quantization 34, quantization to a ternary vector 31, and quantization of gradient difference 20. Sparsifi cation. Sparsifi cation

16、amounts to transmitting only gradient coordinates with large enough magnitudes exceeding a certain threshold 27. Empirically, the desired accuracy can be attained even after dropping 99% of the gradients 1. To avoid losing information, small gradient components are accumulated and then applied when

17、they are large enough 17. The accumulated gradient offers variance reduction of the sparsifi ed stochastic (S)GD iterates 11,26. With its impressive empirical performance granted, except recent efforts 3 , deterministic sparsifi cation schemes lack performance analysis guarantees. However, randomize

18、d counterparts that come with the so-termed unbiased sparsifi cation have been developed to offer convergence guarantees 28, 30. Quantization and sparsifi cation have been also employed simultaneously 8,12,13. Nevertheless, they both introduce noise to (S)GD updates, and thus deterioratee convergenc

19、e in general. For problems with strongly convex losses, gradient compression algorithms either converge to the neighborhood of the optimal solution, or, they converge at sublinear rate. The exception is 18, where the fi rst linear convergence rate has been established for the quantized gradient-base

20、d approaches. However, 18 only focuses on reducing the required bits per communication, but not the total number of rounds. Nevertheless, for exchanging messages, e.g., the p-dimensional or its gradient, other latencies (initiating communication links, queueing, and propagating the message) are at l

21、east comparable to the message size-dependent transmission latency 23. This motivates reducing the number of communication rounds, sometimes even more so than the bits per round. Distinct from the aforementioned gradient compression schemes, communication-effi cient schemes that aim to reduce the nu

22、mber of communication rounds have been developed by leveraging higher- order information 25,36, periodic aggregation 19,33,35, and recently by adaptive aggregation 6,10,29; see also 4 for a lower bound on communication rounds. However, whether we can save communication bits and rounds simultaneously

23、 without sacrifi cing the desired convergence properties remains unresolved. This paper aims to address this issue. 1.2Our contributions Before introducing our approach, we revisit the canonical form of popular quantized (Q) GD methods 24-20 in the simple setup of (1) with one server and M workers:

24、QGD iterationk+1= k X mM Qm?k?(3) whereQm?k?is the quantized gradient that coarsely approximates the local gradientfm(k).While the exact quantization scheme is different across algorithms, transmittingQm?k?generally requires 2 fewer number of bits than transmittingfm(k).Similar to GD however, only w

25、hen all the local quantized gradientsQm?k?are collected, the server can update the parameter. In this context, the present paper puts forth a quantized gradient innovation method (as simple as QGD) that can skip communication in certain rounds. Specifi cally, in contrast to the server-to-worker down

26、link communication that can be performed simultaneously (e.g., by broadcastingk), the server has to receive the workers gradients sequentially to avoid interference from other workers, which leads to extra latency. For this reason, our focus here is on reducing the number of worker-to-server uplink

27、communications, which we will also refer to as uploads. Our algorithmLazilyAggregated Quantized gradient descent (LAQ) resembles (3), and it is given by LAQ iterationk+1= k kwith k=k1+ X mMk Qk m (4) wherekis an approximate aggregated gradient that summarizes the parameter change at iteration k,andQ

28、k m:= Qm( k)Qm( k1 m )is the difference between two quantized gradients offmat the current iteratekand the old copy k1 m . With a judicious selection criterion that will be introduced later,Mkdenotes the subset of workers whose localQk mis uploaded in iterationk, while parameter iterates are given b

29、y k m:= k, m Mk,and k m:= k1 m ,m / Mk.Instead of requesting fresh quantized gradient from every worker in(3), the trick is to obtaink by refi ning the previous aggregated gradientk1; that is, using only the new gradients from the selected workers in Mk,while reusing the outdated gradients from the

30、rest of workers. Ifk1is stored in the server, this simple modifi cation scales down the per-iteration communication rounds from QGDsMto LAQs |Mk|. Throughout the paper, one round of communication means one workers upload. Compared to the existing quantization schemes, LAQ fi rst quantizes the gradie

31、nt innovation the difference of current gradient and previous quantized gradient, and then skips the gradient communication if the gradient innovation of a worker is not large enough, the communication of this worker is skipped. We will rigorously establish that LAQ achieves the same linear converge

32、nce as GD under the strongly convex assumption of the loss function. Numerical tests will demonstrate that our approach outperforms existing methods in terms of both communication bits and rounds. Notation. Bold lowercase letters denote column vectors;kxk2andkxkdenote the2-norm and -norm ofx, respec

33、tively; andxirepresentsi-th entry ofx; whilebacdenotes downward rounding ofa; and| |denotes the cardinality of the set or vector. 2LAQ: Lazily aggregated quantized gradient To reduce the communication overhead, two complementary stages are integrated in our algorithm design: 1) gradient innovation-b

34、ased quantization; and 2) gradient innovation-based uploading or aggregation giving the nameLazilyAggregatedQuantized gradient (LAQ). The former reduces the number of bits per upload, while the latter cuts down the number of uploads, which together guarantee parsimonious communication. This section

35、explains the principles of our two-stage design. 2.1Gradient innovation-based quantization Rk m 2Rk m Qm( k1 m )i ) Qm(k)ifm(k)i Figure 1: Quantization example (b = 3) Quantization limits the number of bits to represent a gradient vector during communication. Suppose we usebbits to quantize each coo

36、rdinate of the gradient vector in contrast to32bits as in most computers. WithQdenoting the quantization operator, the quan- tized gradient for workermat iterationkisQm(k) = Q(fm(k),Qm( k1 m ), which depends on the gradientfm(k)and the previous quantizationQm( k1 m ). The gradient is element-wise qu

37、an- tized by projecting to the closest point in a uniformly discretized grid. The grid is ap-dimensional hypercube which is centered atQm( k1 m )with the radiusRk m= kfm( k) Qm( k1 m )k. With := 1/(2b1) defi ning the quantization granularity, the gradient innovationfm(k)Qm( k1 m )can be quantized by

38、 b bits per coordinate at worker m as: qm(k)i= $ fm(k)i Qm( k1 m )i+ Rk m 2Rk m + 1 2 % ,i = 1, ,p(5) 3 which is an integer within0,2b 1, and thus can be encoded bybbits. Note that addingRk min the numerator ensures the non-negativity ofqm(k)i, and adding1/2in(5)guarantees rounding to the closest po

39、int. Hence, the quantized gradient innovation at workermis (with1 := 1, ,1) Qk m= Qm( k) Qm( k1 m ) = 2Rk mqm( k) Rk m1 : transmit Rk m and qm(k)(6) which can be transmitted by32+bpbits (32bits forRk mandbpbits forqm( k) instead of the original 32pbits. With the outdated gradientsQm( k1 m )stored in

40、 the memory andknown a priori, after receivingQk mthe server can recover the quantized gradient asQm( k) = Qm( k1 m ) + Qk m. Figure 1 gives an example for quantizing one coordinate of the gradient withb = 3bits. The original value is quantized with3bits and23= 8values, each of which covers a range

41、of length 2Rk mcentered at itself. With k m:= fm( k) Qm(k) denoting the local quantization error, it is clear that the quantization error is less than half of the length of the range that each value covers, namely,kk mk R k m. The aggregated quantized gradient isQ( k) = P mMQm( k), and the aggregate

42、d quantization error isk:= f(k) Q(k) = PM m=1 k m; that is,Q( k) = f(k) k. 2.2Gradient innovation-based aggregation The idea of lazy gradient aggregation is that if the difference of two consecutive locally quantized gradients is small, it is safe to skip the redundant gradient upload, and reuse the

43、 previous one at the server. In addition, we also ensure the server has a relatively “fresh gradient for each k+1 = k k kQk 1 k kQk M Server Workers Quantization Selection Quantization Selection Quantization Selection Figure 2: Distributed learning via LAQ worker by enforcing communication if any wo

44、rker has not uploaded during the lasttrounds. We set a clocktm, m Mfor workermcounting the number of iterations since last time it uploaded information. Equipped with the quantization and selection, our LAQ update takes the form as (4). Now it only remains to design the selection criterion to decide

45、 which worker to upload the quantized gra- dient or its innovation. We propose the following communication criterion: workerm Mskips the upload at iterationk , if it satisfi es kQm( k1 m ) Qm(k)k2 2 1 2M2 D X d=1 dkk+1d kdk2 2+ 3 ? kk mk 2 2+ k k1 m k2 2 ? ;(7a) tmt(7b) whereD tanddD d=1are predeter

46、mined constants, k mis the current quantization error, and k1 m = fm( k1 m ) Qm( k1 m )is the error of the last uploaded quantized gradient. In next section we will prove the convergence and communication properties of LAQ under criterion (7). 2.3LAQ algorithm development In summary, as illustrated

47、in Figure 2, LAQ can be implemented as follows. At iterationk, the server broadcasts the learning parameter to all workers. Each worker calculates the gradient, and then quantizes it to judge if it needs to upload the quantized gradient innovationQk m. Then the server updates the learning parameter

48、after it receives the gradient innovation from the selected workers. The algorithm is summarized in Algorithm 2. To make the difference between LAQ and GD clear, we re-write (4) as: k+1=k Q(k) + X mMk c (Qm( k1 m ) Qm(k)(8a) =k f(k) k+ X mMk c (Qm( k1 m ) Qm(k)(8b) whereMk c:= MM k, is the subset of

49、 workers which skip communication with server at iteration k. Compared with the GD iteration in(2), the gradient employed here degrades due to the quantization error,kand the missed gradient innovation, P mMk c(Qm( k1) Q m(k). It is clear that if large 4 Algorithm 1 QGD 1: Input: stepsize 0, quantiz

50、ation bit b. 2: Initialize: k. 3: for k = 1,2, ,K do 4:Server broadcasts kto all workers. 5:for m = 1,2, ,M do 6:Workermcomputesfm(k)andQm(k). 7:Worker m uploads Qk mvia (6). 8:end for 9:Server updates following (4) with Mk= M. 10: end for Algorithm 2 LAQ 1: Input: stepsize 0, b, D, dD d=1and t. 2:

51、Initialize: k, and Qm( 0 m),tmmM. 3: for k = 1,2, ,K do 4:Server broadcasts kto all workers. 5:for m = 1,2, ,M do 6:Workermcomputesfm(k)andQm(k). 7:if (7) holds for worker m then 8:Worker m uploads nothing. 9:Set k m= k1 m and tm tm+ 1. 10:else 11:Worker m uploads Qk mvia (6). 12:Set k m= k, and tm

52、= 0. 13:end if 14:end for 15:Server updates according to (4). 16: end for Table 1: A comparison of QGD and LAQ. enough number of bits are used to quantize the gradient, and alldD d=1are set 0 thusM k := M, then LAQ reduces to GD. Thus, adjustingbanddD d=1 directly infl uences the performance of LAQ.

53、 The rationale behind selection criterion(7)lies in the judicious comparison between the descent amount of GD and that of LAQ. To compare the descent amount, we fi rst establish the one step descent amount of both algorithms. For all the results in this paper, the following assumption holds. Assumpt

54、ion 1.The local gradientfm()isLm-Lipschitz continuous and the global gradient f()is L-Lipschitz continuous, i.e., there exist constantsLmandLsuch that kfm(1) fm(2)k2Lmk1 2k2, 1, 2;(9a) kf(1) f(2)k2Lk1 2k2, 1, 2.(9b) Building upon Assumption 1, the next lemma describes the descent in objective by GD.

55、 Lemma 1. The gradient descent update yields following descent: f(k+1) f(k) k GD (10) wherek GD:= (1 L 2 )kf(k)k2 2. The descent of LAQ distinguishes from that of GD due to the quantization and selection, which is specifi ed in the following lemma. Lemma 2. The LAQ update yields following descent: f

56、(k+1) f(k) k LAQ+ k kk2 2 (11) wherek LAQ:= 2kf( k)k2 2+ k P mMk c(Qm( k1 m ) Qm(k)k2 2+ ( L 2 1 2)k k+1 kk2 2. In lazy aggregation, we consider onlyk LAQwith the quantization error in(11)ignored. Rigorous theorem showing the property of LAQ taking into account the quantization error will be establi

57、shed in next section. The following part shows the intuition for criterion(7a), which is not mathematically strict but provides the intuition. The lazy aggregation mechanism selects the quantized gradient innovation by judging its contribution to decreasing the loss function. LAQ is expected to be m

58、ore communication- effi cient than GD, that is, each upload results in more descent, which translates to: k LAQ |Mk| k GD M .(12) which is tantamount to (see the derivations in the supplementary materials) k(Qm( k1 m ) Qm(k)k2 2 kf( k)k2 2/(2M 2), m Mk c. (13) 5 However, for each worker to check(73)locally

人人文库> 全部分类> 应用文书

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

iccv2019论文全集8598-communication-efficient-distributed-learning-via-lazily-aggregated-quantized-gradients

文档简介

温馨提示

最新文档

评论

iccv2019论文全集8598-communication-efficient-distributed-learning-via-lazily-aggregated-quantized-gradients

文档简介

温馨提示

最新文档

评论

相关文档