iccv2019论文全集9651-generalization-error-analysis-of-quantized-compressive-learning_第1页
iccv2019论文全集9651-generalization-error-analysis-of-quantized-compressive-learning_第2页
iccv2019论文全集9651-generalization-error-analysis-of-quantized-compressive-learning_第3页
iccv2019论文全集9651-generalization-error-analysis-of-quantized-compressive-learning_第4页
iccv2019论文全集9651-generalization-error-analysis-of-quantized-compressive-learning_第5页
已阅读5页,还剩6页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Generalization Error Analysis of Quantized Compressive Learning Xiaoyun Li Department of Statistics Rutgers University Piscataway, NJ 08854 Ping Li Cognitive Computing Lab Baidu Research USA Bellevue, WA 98004 liping11 Abstract Compressive learning is an effective method to deal with very high dimensional datasets by applying learning algorithms in a randomly projected lower dimensional space. In this paper, we consider the learning problem where the projected data is further compressed by scalar quantization, which is called quantized compres- sive learning. Generalization error bounds are derived for three models: nearest neighbor (NN) classifi er, linear classifi er and least squares regression. Besides studying fi nite sample setting, our asymptotic analysis shows that the inner product estimators have deep connection with NN and linear classifi cation problem through the variance of their debiased counterparts. By analyzing the extra error term brought by quantization, our results provide useful implications to the choice of quantizers in applications involving different learning tasks. Empirical study is also conducted to validate our theoretical fi ndings. 1Introduction Random projection (RP) method 32 has become a very popular tool for dimensionality reduction in numerous machine learning and database applications, e.g. 12,8,3 , including classifi cation, matrix sketching, compressive sensing, regression and etc. The great success of random projection lies in the favorable distance preserving property with fairly elegant statement given by the famous Johnson-Lindenstrauss Lemma 17,9. In short, under some conditions we can always project a set ofnpointsX Rndin a high-dimensional space onto a lowerk-dimensional space such that pair-wise distances are approximately preserved, with high probability. Herek ? dis the number of random projections. This nice theoretical guarantee has originated the study of generalization performance of learning in the reduced dimensional space instead of the original space. This line of work is called compressive learning 14, 2, 26, 11, 18, 30, 31, 19. In many cases, it is useful to further condense the projected data, possibly due to storage saving, privacy consideration and etc. Consequently, research on quantized random projections (QRP) has been conducted for a while. QRP itself has been developed into many promising fi elds in computer science, such as 1-bit compressive sensing, simhash and so on 28,1,5. More recently, it is shown that quantized random projection is also very convenient for cosine estimation and similarity search 23,22,24. However, to the best of our knowledge, theoretical analysis of QRP in learning mechanisms is still missing in literature. In this paper, we investigate the generalization error bounds of applying QRP in three models: nearest neighbor classifi er, linear classifi er and least squares regression. Apart from fi nite k analysis, we also consider the case where k is asymptotically large. Contributions.An important implication of our analysis is to answer the following questionThe generalization performance using quantized random projections is determined by what factors of a The work of Xiaoyun Li was conducted during the internship at Baidu Research USA. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. quantizer? Our theoretical analysis illustrates that for nearest neighbor and linear classifi cation, the extra loss of quantization decreases askgets large, and the learning performance is determined by the variance of debiased inner product estimator when data samples are allocated on the unit sphere. For regression problems, the distortion of a quantizer becomes crucial. Our theoretical fi ndings are validated by empirical study. Practically, our results also suggest appropriate quantizing strategies for different learning models, which would be helpful for various applications. 2Preliminaries Problem setting.Assume datasetX,Y DnwithX = x1,.,xnT Rnd, andxi,i = 1,.,nare i.i.d. drawn from some marginal distributionX. Throughout this paper, we assume that every sample inXis standardized to have unit Euclidean norm2. Therefore, the domain ofXis the unit Euclidean sphereSd, which allows us to call “inner product” and ”cosine” interchangeably. For classifi cation problems,Y 0,1n, while in regression modelY Rn. We will focus on the Gaussian random projection matrixR = r1,.,rk Rdkwith i.i.d. standard normal entries. Random projection is realized byXR= 1 kXR, where the factor 1 kis for the ease of presentation. Quantized RPs.AnM-level scalar quantizerQ() : A C is specifi ed byM + 1decision borderst0 t1 tMandMreconstruction levels (or codes)i,i = 0,.,M. Given a signalv , the quantizing operator is defi ned asQb(v) = i C,such thatti1 v ti. Here, Ais the domain of the original signal andC is the set of codes. The number of bits is defi ned as b = log2M 1. We note thatt0andtM can be either fi nite or infi nite depending on the support of signal. For generality, in this paper we do not restrict our analysis to any specifi c quantizer, but cast most basic assumption of increasing and bounded reconstruction levels, i.e.1 Mand ti1 i tifor all i = 1,.,M. Defi nition 1. (Maximal gap)For anM-level quantizerQ defi ned above and an intervala,b, denote = i : ti1 0.Tis-separatedif a b T ,a 6= b,ka bk holds. The-packing numberofTisNkk(,T ) = max|T 0| : T 0 is -separable,T 0 T . Defi nition 5.The-entropyofT is defi ned asZ(,T ) = logN(,T ), and functionZ(,T )is called the metric entropy of T . Theorem 1.20. LetX Rd, andR dka random matrix with i.i.d. Gaussian or Rademarcher entries with mean 0 and variance2.T = ab kabk : a,b Xbe the set of all pair-wise normalized chords. Defi nemetric entropy integralas(T ) = R1 0 pZ(,T )d, then there exists an absolute constantc, such that, (0,1), ifk c2(T )2+ log(2/), then with probability at least 1 , we have R is -isometry on X, namely, (1 )k2kx yk2 kRTx RTyk2 (1 + )k2kx yk2,x,y X. Theorem 1 is a generalization of Johnson-Lindenstrauss Lemma, which characterizes the probability of getting a “good” projection matrix with nice isometry property. By a careful analysis under a slightly different assumption on the domainX, we present the generalization bound on compressive NN classifi er (learning with XR) in 19 as follows. Theorem2. X Xn,Y 0,1nwithX = x1,.,xnT Rnd,xisontheunitsphere. Assume that(x) = Pr(y = 1|x)isL-Lipschitz. LetR dk,k 0, HR(x) = 1hT RR Tx 0, HQ(x) = 1hT QQ(R Tx) 0. (6) Now supposexis a test sample with unknown classy, we are interested in the probability of making a wrong prediction by training the classifi er in the quantized space, PrHQ(Q(RTx) 6= y = EL(HQ(Q(RTx),y). Existing results on such compressive linear classifi er have studied bounds on the same type of objective in the projected space, with exact expression in fi nitekcase 11. Here we look at this problem in the asymptotic domain. When studying the error incurred by learning in the projected space, an important tool is the following defi nition. Defi nition 6.Leth,x Rd be defi ned above,khk = kxk = 1. Lethh,xi = cos(h,x) = 0, and R= hTRRTx k .R Rdka i.i.d. standard Gaussian random matrix. The fl ipping probability is defi ned as fk() = PrR 0.(7) Intuitively, this quantity measures the probability of changed prediction when we project the data space by R with RT h as the classifi er. 11 gives the exact formula of this quantity, fk() = (k) (k/2)2 Z1 1+ 0 z(k2)/2 (1 + z)k dz = Fk,k(1 1 + ), (8) whereFis the cumulative distribution function (CDF) of F-distribution with(k,k)degrees of freedom. This formula also holds for 0. As it is well-known thatE R = andV ar R = 1+2 k , Rshould asymptotically follow N(, 1+2 k ) as k . So the asymptotic fl ipping probability should be fk() = ( k p1 + 2).(9) The following calculation confi rms this asymptotic convergence. Proposition 1. As k , we have fk() fk() for 0. For quantized compressive classifi er, sign fl ipping may also happen (an illustrative example is given in Figure 1). By analyzing this event, in the following we state the asymptotic generalization error bound for linear classifi ers when working in quantized space instead of data space. 6 Theorem 5. Let the (0,1)-loss and ERM classifi er be defi ned as (5) and (6).R Rdkisi.i.d standard normal random matrix. Let L(0,1)(S,h) = 1 n Pn i=1L(0,1)(H(xi),yi)be the empirical loss in the data space.Qis a quantizer and the quantized estimator Q= Q(RTs)TQ(RTt) k has mean, 0, and debiased variance2 /kat = cos(s,t),s,t X. Given(x,y)a test sample withy unknown, when k , with probability at least 1 2 we have PrHQ(Q(RTx) 6= y L(0,1)(S,h) + 2 s (k + 1)log 2en k+1 + log 1 n + 1 n n X i=1 fk,Q(i) + min ?r 3log 1 v u u t1 n n X i=1 fk,Q(i), 1 n n X i=1 fk,Q(i) ? , where the fl ipping probabilityfk,Q(i) = ( k| i| i ), withithe cosine between training sample xi and ERM classifi erh in the data space. In Theorem 5, the fi rst term is the empirical loss in the data space, and the second term is the generic sample complexity in learning theory. Whenb (full-precision RPs), the bound reduces to that derived in 11 for compressive linear classifi er, according to Proposition 1. One important observation is that the quantization error again depends on the debiased variance of the quantized inner product estimator, at different levels i, i = 1,.,n. Choice of Q. Unlike NN classifi er, the extra generalization error depends more on the region near 0 for linear classifi er. To see this, we notice that the fl ipping probabilities (8) and (9) decrease as increases. Intuitively, label fl ipping is much more likely to occur for the points near the boundary (i.e. with smallhTx). As a result, one needs to choose a quantizer with small debiased variance around = 0 for linear classifi cation tasks, e.g. logistic regression, linear SVM and etc. In fact, by the results and analysis from 24, one can show that Lloyd-Max (LM) quantizer gives minimal debiased variance of Qat = 0, among all quantizers with equal bits. Hence, we recommend using LM quantization for linear classifi cation problems. 5Quantized Compressive Least Squares Regression Compressive least squares (CLS) regression has been studied in several papers, e.g. 26,18. 30 shows that in many cases, CLS can match the performance of principle component regression (PCR) but runs faster by avoiding large scale SVD or optimization, especially on high-dimensional data. In CLS, the projected design matrixXR, instead of the originalX, is used for ordinary least squares (OLS) regression. We are interested in the extra error brought by further quantizing the projections, whereXQis used as the new design matrix. We call this approach QCLS. In particular, we consider a fi x design problem where dataX Rndis determinant andY Rnare treated as random. OLS regression with Gaussian error is modeled by yi= xT i + ?i, (10) withthe parameter to estimate and?iare i.i.d. Gaussian with mean 0 and variance2. For projected data and quantized data,yiis the same while the predictors becomes 1 kRTxiand 1 kQ(RTxi) respectively. The corresponding models are given by yi= 1 kxT iRR+ ?i, yi= 1 kQ(xT iR)Q+ ?i. (11) The squared loss in data space, projected space and quantized space are defi ned as L() = 1 nEY kY Xk2,LR(R) = 1 nEY |RkY 1 kXRRk2, LQ(Q) = 1 nEY |RkY 1 kQ(XR)Qk2. (12) Note that here the expectation is takenw.r.t. Y, andRis given. Denote the true minimizers to above losses as, Rand Q , respectively. The risk of an estimator in the data space is defi ned 7 asr(w) = L(w) L(), and analoguesrR(wR)andrQ(wQ) can be also defi ned in projected and quantized spaces. On the other hand, we have the empirical losses L() = 1 nkY Xk 2, LR(R) = 1 nkY 1 kXRRk2, LQ(Q) = 1 nkY 1 kQ(XR)Qk2, (13) which are derived from the data. The OLS estimates minimize the empirical losses in a given space, namely, = argmin Rd L(), R= argmin Rk LR() and Q= argmin Rk LQ(). Theorem 6. Consider the OLS problem defi ned in (10), (11) with losses given by (12) and (13). Suppose all samples inXhas unit norm, = XTX/n, andR Rdkare i.i.d. standard normal andk n.Qis a quantizer with distortionDQw.r.t.standard Gaussian distribution. The expected QCLS excess risk of the estimator in quantized space over the minimal loss in data space is bounded by ERLQ( Q) L( ) k n + 1 k kk2 +Tr()Id+ 1 k k Rk 2 DQIk, (14) where kwkB= wT Bw is the Mahalanobis norm and Ipis the identity matrix with rank p. In Theorem 6, the fi rst two terms are the bound on CLS risk, and the last term characterizes the extra loss brought by quantization. Note that there are several improvements upon the CLS error term 31,30. These results typically focus on one step in the proof of 18 that is independent of quantization error, and the improved bounds are more complex and contain singular values ofX. We can safely replace the CLS error bound in Theorem 6 by the results mentioned above, and our major emphasize in this paper, the quantization error term, will not change. Choice of Q.Since cosine estimation no longer play a role in QCLS regression model, the amount of encoded information becomes crucial, which is perfectly ev

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论