NIPS20198866-optimal-sparsity-sensitive-bounds-for-distributed-mean-estimation_第1页
NIPS20198866-optimal-sparsity-sensitive-bounds-for-distributed-mean-estimation_第2页
NIPS20198866-optimal-sparsity-sensitive-bounds-for-distributed-mean-estimation_第3页
NIPS20198866-optimal-sparsity-sensitive-bounds-for-distributed-mean-estimation_第4页
NIPS20198866-optimal-sparsity-sensitive-bounds-for-distributed-mean-estimation_第5页
已阅读5页,还剩6页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Optimal Sparsity-Sensitive Bounds for Distributed Mean Estimation Zengfeng Huang School of Data Science Fudan University huangzf Ziyue Huang Department of CSE HKUST zhuangbqcse.ust.hk Yilei Wang Department of CSE HKUST ywanggqcse.ust.hk Ke Yi Department of CSE HKUST yikecse.ust.hk Abstract We consider the problem of estimating the mean of a set of vectors, which are stored in a distributed system. This is a fundamental task with applications in distributed SGD and many other distributed problems, where communication is a main bottleneck for scaling up computations. We propose a new sparsity-aware algorithm, which improves previous results both theoretically and empirically. The communication cost of our algorithm is characterized by Hoyers measure of sparseness. Moreover, we prove that the communication cost of our algorithm is information-theoretic optimal up to a constant factor in all sparseness regime. We have also conducted experimental studies, which demonstrate the advantages of our method and confi rm our theoretical fi ndings. 1Introduction Consider a distributed system withnnodes, called clients, each of which holds ad-dimensional vectorXi Rd. The goal of distributed mean estimation (DME) is to estimate the mean of these vectors, i.e.,X := 1 n Pn i=1Xi, subject to a constraint on the communication cost (i.e. the total number of bits transmitted by all clients). DME is a fundamental task in distributed machine learning and optimization problems 3,10,18, 14,12. For example, gradient aggregation in distributed stochastic gradient decent (SGD) is a form of DME. In the standard synchronous implementation, in each round, clients evaluate their local gradients with respect to local mini-batches and communicate them to a central server; the server then computes the mean of all these gradients, which is used to update the model parameters. It is widely observed that the communication cost of gradient exchange has become a signifi cant bottleneck for scaling up distributed training 5,19,24 . Therefore, communication-effi cient gradient aggregation has received lots of attention recently 1,2,15,23,26. DME is also a critical subproblem in many other applications such as the distributed implementation of Lloyds algorithm for K-means clustering 16 and power iteration for computing eigenvectors 21. However, the communication complexity of this fundamental problem has not been fully understood, especially when the input is sparse or skew. In this paper, we provide a tight connection between communication complexity and input sparsity. Specifi cally, we propose a new sparsity-aware lossy compression scheme, which reduces the communication cost both theoretically and empirically. We also prove that the communication cost of our method is information-theoretic optimal up to a constant factor in all sparsity regime, under Hoyers measure of sparseness 9. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. 1.1 Problem defi nition and notation The problem setting in this paper is the same as in 20. Each clientiholds a private vectorXi Rd and transmits messages only to the central sever according to some protocol; at the end, the sever outputs an estimate for the meanX = 1 n Pn i=1Xi based on all the messages it has received. The communication cost of a protocol is measured by the total number of bits exchanged between clients and the sever. Let Xdenote the estimated mean and we wish to minimize the mean square error (MSE) of the estimate, i.e.,E = E kX Xk2 2,under a certain communication budget. Note that the problem considered here is non-stochastic, i.e., the input vectors are arbitrary or even adversarial. This is different from distributed statistical estimation 27,8,4, where the inputs are i.i.d samples from some distribution and the goal is to estimate the parameters of the underlying distribution. In particular, the expectation in the above defi nition of MSE is only over the randomness of the protocol. We defi neF1:= Pn i=1kXik1, i.e., the sum of1-norms of input vectors; letF2 := Pn i=1kXik 2 2be the sum of squared2-norms of the input vectors andF0be the total number of non-zero entries in all the input vectors. We will always usedto denote the dimensionality of input vectors andnfor the number of clients. 1.2Previous results Naively sending all vectors to the sever needsndrbits of communication, whereris the number of bits to represent a fl oating point number. In 20, several methods to save communication are proposed. The best of them usesO(nd)bits of communication while achieving an MSE ofF2/n2. Their algorithm fi rst applies stochastic quantization and then encodes the quantized vectors by entropy encoding schemes such as arithmetic coding. Moreover, it is also proved that, in the worst case, this cost is optimal for one-round protocols. Similar bounds are also obtained in 2,11. One major limitation of the methods in 20 is that they cannot exploit the sparseness in the inputs due to the nature of their quantization and encoding methods. In many distributed learning scenarios, the input vectors can be very sparse or skew, i.e., a large fraction of the entries can be zero or close to zero. The sparsity can be caused by either data unbalance (large entries occur in a few clients) or feature unbalance (large entries occur in a few dimensions). QSGD of 2 works well in practice for sparse data, but doesnt have an upper bound on the cost that is parameterized by input sparsity: to achieve an MSE of F2/n2, the cost is still O(nd) bits (Theorem 3.2, Corollary 3.3 in 2). Intuitively, one could drop entries with small absolute values without affecting the MSE too much. Gradient sparsifi cation utilizes this idea, which has been successfully applied in distributed gradient compression 19,1,15,22,23. However, such methods either do not have optimal sparsity-sensitive theoretical guarantees or work only under strong sparsity assumptions. There are various sparsity notions, but it is currently not clear which notion best characterizes the inherent complexity of DME. To get meaningful sparsity-sensitive bounds, it is essential to identify an appropriate sparsity notion for DME. In this paper, we propose to use a modifi ed notion of Hoyers to measure the sparseness of vectors 9. For ad-dimensional vectorX , its sparseness is defi ned as kXk2 1 dkXk2 2 .1Since our inputs can be viewed as annd -dimensional vector, the global sparseness is defi ned ass := F2 1/ndF2. Note that 1 nd s 1;s = 1(densest) iff all entries are non-zero and have equal absolute values, ands = 1 nd (sparsest) iff the input contains a single non-zero entry. Wangni et al. 23 obtain a sparsity-aware bound based on a different sparsity notion, but our result implies theirs and can be much better for some inputs (see the supplementary for details). 1.3Our contributions First, we propose a sparsity-sensitive compression method that provably exploits the sparseness of the input. Specifi cally, to achieve an MSE ofE F2 n2, our protocol only needs to transmit C ndslog ?1 s + 1?bits (ignoring some lower order terms), wheresis the sparseness of the input defi ned earlier. Sinceslog ?1 s + 1? 1whens 1, this is always no worse thannd(the cost of 20) and can be much smaller on sparse inputs, i.e., when s ? 1. Secondly, we prove that, for any sparsenesss 1, the communication cost of our protocol is optimal, up to a constant factor. Specifi cally, for anys 1, we construct a family of inputs with 1The original Hoyers measure is the ratio between the 1 and 2norm, normalized to the range 0,1. 2 sparseness equal tos, and prove that any protocol achieving an MSE of F2 n2 on this family must incur (ndslog 1 s)bits of communication in expectation for some inputs in this family. This lower bound holds for multi-round protocols in the broadcasting model (where each message can be seen by all clients). As observed in 20, any lower bound for distributed statistical mean estimation can be translated to a lower bound for the DME problem. However, current lower bounds in this area do not suffi ce to obtain tight sparsity-sensitive bounds for DME. Finally, we complement our theoretical fi ndings with experimental studies. Empirical results show that, under the same communication bandwidth, our proposed method has a much lower MSE, especially on sparse inputs, which verifi es our theoretical analyses. Moreover, as a subroutine, our protocol outperforms previous approaches consistently in various distributed learning tasks, e.g., Lloyds algorithm for K-means clustering and power iteration. 2Sparsity-Sensitive DME Protocol Overview of our techniques.Algorithms in 20 applyk-level stochastic quantization and then encode the quantized vectors using variable length coding. Specifi cally, for eachXi, the client divides the rangeXmin i ,Xmax i intok1 intervals of equal length, and then identifi es the interval containing eachXijand rounds it either to the left point or the right point of the corresponding interval with probability depending on its distance to the end points. After quantization, the vector can be viewed as a string of length d over an alphabet of size k, which is then compressed using arithmetic coding. QSGD is similar, but encodes signs separately and uses the Elias coding method. Since the sparseness depends on the1normF1, our quantization step size depends onF1as in Wang et al. 22. In addition toF1quantization, our protocol has the following algorithmic ingredients. 1) All clients in our protocol use the same interval size in stochastic quantization. This means that the number of levels may vary for different clients, as opposed to all previous methods, where all clients use a fi xed number of levels. This is another major difference in our quantization step, which is necessary to get communication bounds in terms of global sparseness. 2) As in QSGD, we encode the sign of each entry separately and only quantize the absolute values, which can be conveniently implemented by a scaling and rounding procedure. 3) Instead of encoding the quantized vectors directly with entropy coding methods, we fi rst convert each integer vector into a bit string using a one-to-one map: for any integer vectorv = (v1,v2, ,vd), the length of its corresponding bit string isd+kvk11, among which the number of1s is exactlyd1. 4) We then apply effi cient coding methods, e.g., arithmetic coding, to encode the entire bit string using roughly log ?d+kvk 1 d ? kvk1log d+kvk1 kvk1 bits. Scaling and Rounding. We fi rst introduce the scaling and rounding procedure (Algorithm 1), which is essentially equivalent to stochastic quantization (for the absolute values only). The next Lemma summarizes the key properties of SaR, the proof of which is in the supplementary. Algorithm 1 Scaling and Rounding (SaR) input v Rdand a scaling factor F 1:u = 1 F v 2:Randomized rounding: for j = 1, ,d uj= ?bu jc + 1, with probability uj bujc bujc,otherwise. 3:return u Lemma 2.1. Let v = F u, then E v = v and Ek v vk2 2 Fkvk1. Moreover, E| vi| = |vi|. In our protocol, we apply Algorithm 1 on eachXiwithF = F1/C, whereCis a tunable parameter. Let uibe the output forXiand Xi= F ui. At the end, the server uses X = Pn i=1 Xi/nas the estimate for the mean, then by Lemma 2.1, the MSE is E = E kX Xk2 2= 1 n2 n X i=1 EkXi Xik2 2 F n2 n X i=1 kXik1= F2 1 Cn2 .(1) 3 Constant-weight binary sequence.The Hamming weight of a length-dbinary sequencevis denoted byw(v) = |vi: vi= 1|. Constant-weight binary codesC(d,w)is the set of all length-d sequences with weightw. Since|C| = ?d w ?, the number of bits to represent a sequence in Cis at least dlog ?d w ?e . There exists effi cient encoding methods, such as arithmetic coding or its variants 17, to encode sequences inCusing binary strings of length very close todlog ?d w ?e, which has encoding and decoding time O(d). Constant-weight non-negative integer vector coding.Denote the weight of a length-dnon- negative integer vectorvbyw(v) = Pd i=1vi. Constant-weight integer codesI(d,w)is the set of length-dnon-negative integer vectors with weightw. In our protocol, we map eachv I(d,w)to a binary sequencef(v) C(d + w 1,d 1)as follows. Fori = 1,2, ,d 1we writevi0s and one 1, and in the end we writevd0s. It is an(d + w 1,d 1)constant-weight binary code. One can also verify thatfis a one-to-one and onto map fromI(d,w)toC(d + w 1,d 1). By applying encoding methods for C mentioned above, we have the following lemma. Lemma 2.2.Sequences inI(d,w)can be encoded losslessly bydlog ?d+w1 d1 ?e-bit binary strings, with encoding and decoding time O(d + w). 2.1The Protocol We are now ready to describe our sparsity-sensitive DME protocol. 1. (Initialization)Clients and the server determine the scaling factorFto be used in Algorithm 1 and we will useF = F1/Cfor someC nd. To computeF1, each clientisendskXik1to the server usingrbits, wherer is the number of bits to represent fl oating points. Then the server computes F1= P ikXik1and broadcasts it to all the clients. This step use2rbits of communication per client. 2. (Quantization)ClientirunsSaR(Xi,F1/C)(Algorithm 1) and get an integer vector ui. The absolute value and sign of each entry in uiwill be encoded separately. Let vi= (| ui1|, ,| uid|). 3.(Encoding)Note thatvi I(d,wi), wherewi= w(vi).Clientiencodesviusing a dlog ?d+w i1 d1 ?e-bit string (Lemma 2.2) and sends it to the sever. The client also sends the value wi= wi bCkXik1/F1c with log(2d + 1) bits 2, as wi is needed for decoding vi. 3 4. (Sending the signs)Letsibe a binary sequence indicating the signs of non-zero entries in ui. Clientisimply sends this sequence withdibits of communication, wherediis the number of non-zero values in vi. Moreover, we can apply constant-weight coding to compress this sequence. 5. (Decoding)The server decodesvi, which contains the absolute values of ui. Given the signs of its the non-zero entriessi, the server is now able to recover ui losslessly. It fi nally computes the estimated mean X = 1 n P i Xi= F1 Cn P i ui. The correctness of the protocol readily follows from Lemma 2.1 and(1):EX = XandEkX Xk2 2 F2 1 Cn2. Below we analyze its communication cost. By part 2 of Lemma 2.1, we have Ewi = P jE| uij| = P j C|Xij| F1 = CkXik1 F1 . Therefore, EPn i=1wi = Pn i=1 CkXik1 F1 = C. Because of the observation di wi, we have the expected total communication cost is at most E n X i=1 ? 2r + log(2d + 1) + log ?d + w i 1 d 1 ? + di ?# 6 E n X i=1 wilog ? d wi + 1 ?# + O(C + nr + nlogd). 2Clearly, wiis an integer and|wi| 6 d. One can also use universal code such as Elias gamma code 7 to reduce the bits of transmitting wi. 3One can also use entropy coding to encode vi, but it is unclear whether such methods achieve the same theoretical guarantee as ours. 4 From the concavity of the function xlog(1 x + 1) on R0and Jensens inequality, we have E n X i=1 wilog ? d wi + 1 ?# 6 E ( n X i=1 wi)log ? nd Pn i=1wi + 1 ?# 6 C log ?nd C + 1 ? . Therefore, we get the following theorem, and by setting C = F2 1/F2, the next corollary follows. Theorem 2.3.For anyC nd, there exists a DME protocol that achieves an MSE of F2 1 Cn2 with C log ?nd C + 1?+ O(C + nr + nlogd)bits of communication, whereris the number of bits to represent a fl oating point. Corollary 2.4. There exists a DME protocol that achieves an MSE of F2 n2 using nds log ?1 s + 1?+ O(nds + nr + nlogd)bits, wheres = F2 1/ndF2is the Hoyers measure of sparseness of the inputs. Remark.The authors of 20 discuss how to use client or coordinate sampling to obtain a trade-off between MSE and communication. Their analysis shows that, to achieve an MSE ofF2/pn2, the communication cost isO(pnd)bits (ignoring low order terms), where0 p 1is the sampling probability. Applying our sparsity-sensitive protocol on the sampled clients or coordinates, we can achieve the same MSE with a communication ofO(pndslog(1 s +1)bits, which will never be worse and can be much better on inputs with small spareness s. We also would like to point out that, our algorithm can also be run without the synchronization round. For this setting, we can derive a communication bound for each client by simply settingn = 1in Corollary 2.4, althoughsin the bound will become the local sparsity of the client when doing so. Local sparsity bound is worse than global sparsity when there is data unbalance, but the bound is still better than prior work as long as there is dimension unbalance across different clients. This is also verifi ed in our experimental results below. Since the sparseness depends on the1normF1, the key to getting a sparsity-sensitive bound is to understand the connection betweenF1and the MSE-communication trade-off. So our quantization step size depends onF1, which is one of the main differences in the quantization step compared with 20,2,23. Wang et al. 22 also useF1quantization, but only consider1-level quantization and doesnt specify an appropriate encoding method to achieve an optimal sparsity-sensitive bound. Our protocol usesC log ?nd C + 1?bits of communication to achieve an MSE ofF2 1/n2C, whereC nd is a tunable parameter; and if we setC = F2 1/F2, the MSE and communication cost areF2/n2and ndslog(1/s + 1)respectively, as claimed earlier. HavingCas a tunable parameter gives us a better control on the cost of the protocol; our result in fact implies the MSE-communication trade-off of 22 (and could be much better) but not vice versa. Wang et al. 22 prove that their algorithm can compress a vectorX Rdusingkrbits (wherer is the number of bits to represents fl oating points andkis a tuning parameter) with MSEF2 1/k; ours algorithm (the special case whenn = 1) can compressXusingC log( d C + 1)bits wit

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论