NIPS20199477-outlier-robust-estimation-of-a-sparse-linear-model-using-ell_1-penalized-hubers-m-estimator_第1页
NIPS20199477-outlier-robust-estimation-of-a-sparse-linear-model-using-ell_1-penalized-hubers-m-estimator_第2页
NIPS20199477-outlier-robust-estimation-of-a-sparse-linear-model-using-ell_1-penalized-hubers-m-estimator_第3页
NIPS20199477-outlier-robust-estimation-of-a-sparse-linear-model-using-ell_1-penalized-hubers-m-estimator_第4页
NIPS20199477-outlier-robust-estimation-of-a-sparse-linear-model-using-ell_1-penalized-hubers-m-estimator_第5页
已阅读5页,还剩6页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Outlier-robust estimation of a sparse linear model using 1-penalized Hubers M-estimator Abstract We study the problem of estimating ap-dimensionals-sparse vector in a linear1 model with Gaussian design and additive noise. In the case where the labels are2 contaminated by at mostoadversarial outliers, we prove that the1-penalized3 HubersM-estimator based onnsamples attains the optimal rate of convergence4 (s/n)1/2+ (o/n), up to a logarithmic factor. For more general design matrices,5 our results highlight the importance of two properties: the transfer principle and6 the incoherence property. These properties with suitable constants are shown to7 yield the optimal rates, up to log-factors, of robust estimation with adversarial8 contamination.9 1Introduction10 Is it possible to attain optimal rates of estimation in outlier-robust sparse regression using penalized11 empirical risk minimization (PERM) with convex loss and convex penalties? Current state of literature12 on robust estimation does not answer this question. Furthermore, it contains some signals that might13 suggest that the answer to this question is negative. First, it has been shown in (Chen et al., 2013,14 Theorem 1) that in the case of adversarially corrupted samples, no method based on penalized15 empirical loss minimization, with convex loss and convex penalty, can lead to consistent support16 recovery. The authors then advocate for robustifying the1-penalized least-squares estimators by17 replacing usual scalar products by their trimmed counterparts. Second, (Chen et al., 2018) established18 that in the multivariate Gaussian model subject to Hubers contamination, coordinatewise median19 which is the ERM for the1-lossis sub-optimal. Similar result was proved in (Lai et al., 2016,20 Prop. 2.1) for the geometric median, the ERM corresponding to the2-loss. These negative results21 prompted researchers to use other techniques, often of higher computational complexity, to solve the22 problem of outlier-corrupted sparse linear regression.23 In the present work, we prove that the1-penalized empirical risk minimizer based on Hubers loss is24 minimax-rate-optimal, up to possible logarithmic factors. Naturally, this result is not valid in the most25 general situation, but we demonstrate its validity under the assumptions that the design matrix satisfi es26 some incoherence condition and only the response is subject to contamination. The incoherence27 condition is shown to be satisfi ed by the Gaussian design with a covariance matrix that has bounded28 and bounded away from zero diagonal entries. This relatively simple setting is chosen in order to29 convey the main message of this work: for properly chosen convex loss and convex penalty functions,30 the PERM is minimax-rate-optimal in sparse linear regression with adversarially corrupted labels.31 To describe more precisely the aforementioned optimality result, letD n= (Xi,yi);i = 1,.,n 32 be iid feature-label pairs such thatXi Rpare Gaussian with zero mean and covariance matrix33 and y i are defi ned by the linear model34 y i = X i + i,i = 1,.,n, wheretherandomnoisei, independentofXi, isGaussianwithzeromeanandvariance2. Insteadof35 observing the “clean” dataD n, we have access to a contaminated version of it,Dn= (Xi,yi);i = 36 1,.,n, in which a small numbero 1,.,nof labelsy i are replaced by an arbitrary value.37 Setting i = (yiy i)/ n, and using the matrix-vector notation, the described model can be written 38 as39 Y = X+ n + ,(1) Submitted to 33rd Conference on Neural Information Processing Systems (NeurIPS 2019). Do not distribute. whereX = X 1;.;X nis then pdesign matrix,Y = (y1,.,yn)is the response vector, 40 = ( 1,.,n)is the contamination and = (1,.,n)is the noise vector. The goal is to 41 estimate the vector Rp. The dimensionpis assumed to be large, possibly larger thannbut, for42 some small values 1,.,p, the vectoris assumed to bes-sparse:kk0= Cardj : 6=43 0 s. In such a setting, it is well-known that if we have access to the clean dataD nand measure 44 the quality of an estimatorb by the Mahalanobis norm1k1/2(b )k2, the optimal rate is45 r(n,p,s) = ?slog(p/s) n ?1/2 . In the outlier-contaminated setting, i.e., whenD n is unavailable but one has access toDn, the46 minimax-optimal-rate (Chen et al., 2016) takes the form47 r(n,p,s,o) = ?slog(p/s) n ?1/2 + o n .(2) The fi rst estimators proved to attain this rate (Chen et al., 2016; Gao, 2017) were computationally48 intractable2for largep,sando. This motivated several authors to search for polynomial-time49 algorithms attaining nearly optimal rate; the most relevant results will be reviewed later in this work.50 The assumption that only a small numberoof labels are contaminated by outliers implies that the51 vectorin(1)iso-sparse. In order to take advantage of sparsity of bothandwhile ensuring52 computational tractability of the resulting estimator, a natural approach studied in several papers53 (Laska et al., 2009; Nguyen and Tran, 2013; Dalalyan and Chen, 2012) is to use some version of the54 1 -penalized ERM. This corresponds to defi ning55 b arg min Rp min Rn n 1 2nkY X nk2 2+ skk1+ okk1 o ,(3) wheres,o 0are tuning parameters. This estimator is very attractive from a computational56 perspective, since it can be seen as the Lasso for the augmented design matrixM = X, nI n, 57 where Inis the n n identity matrix. To date, the best known rate for this type of estimator is58 ?slogp n ?1/2 + ?o n ?1/2 ,(4) obtained in (Nguyen and Tran, 2013) under some restrictions on(n,p,s,o). A quick comparison of59 (2)and(4)shows that the latter is sub-optimal. Indeed, the ratio of the two rates may be as large as60 (n/o)1/2. The main goal of the present paper is to show that this sub-optimality is not an intrinsic61 property of the estimator(3) , but rather an artefact of previous proof techniques. By using a refi ned62 argument, we prove thatb defi ned by (3) does attain the optimal rate under very mild assumptions.63 In the sequel, we refer tobas1-penalized HubersM-estimator. The rationale for this term is that64 the minimization with respect toin(3)can be done explicitly. It yields (Donoho and Montanari,65 2016, Section 6)66 b arg min Rp n 2 o n X i=1 ?yi X i on ? + skk1 o ,(5) where : R R is Hubers function defi ned by (u) = (1/2)u2 (|u| 1/2).67 To prove the rate-optimality of the estimatorb , we fi rst establish a risk bound for a general design68 matrixXnot necessarily formed by Gaussian vectors. This is done in the next section. Then, in69 Section 3, we state and discuss the result showing that all the necessary conditions are satisfi ed for the70 Gaussian design. Relevant prior work is presented in Section 4, while Section 5 discusses potential71 extensions. Section 7 provides a summary of our results and an outlook on future work. The proofs72 are deferred to the supplementary material.73 1In the sequel, we use notation kkq = (Pj|j|q)1/qfor any vector Rpand any q 1. 2In the sense that there is no algorithm computing these estimators in time polynomial in (n,p,s,o). 2 2Risk bound for the 1-penalized Hubers M-estimator74 This section is devoted to bringing forward suffi cient conditions on the design matrix that allow for75 rate-optimal risk bounds for the estimatorb defi ned by(3)or, equivalently, by(5). There are two76 qualitative conditions that can be easily seen to be necessary: we call them restricted invertibility and77 incoherence. Indeed, even when there is no contamination, i.e., the number of outliers is known to78 beo = 0, the matrixXhas to satisfy a restricted invertibility condition (such as restricted isometry,79 restricted eigenvalue or compatibility) in order that the Lasso estimator(3)does achieve the optimal80 ratep(s/n)log(p/s). On the other hand, in the case wheren = pandX = nI n, even in 81 the extremely favorable situation where the noise is zero, the only identifi able vector is+ .82 Therefore, it is impossible to consistently estimatewhen the design matrixXis aligned with the83 identity matrix Inor close to be so.84 The next defi nition formalizes what we call restricted invertibility and incoherence by introducing85 three notions: the transfer principle, the incoherence property and the augmented transfer principle.86 We will show that these notions play a key role in robust estimation by 1-penalized least squares.87 Defi nition 1. Let Z Rnpbe a (random) matrix and Rpp. We use notation Z(n)= Z/n.88 (i)We say thatZ satisfi es the transfer principle witha1 (0,1)anda2 (0,), denoted by89 TP(a1;a2), if for all v Rp,90 ? ?Z(n)v? 2 a1k1/2vk2 a2kvk1.(6) (ii)We say thatZ satisfi es the incoherence propertyIP(b1;b2;b3)for some positive numbers91 b1, b2and b3, if for all v;u Rp+n,92 |uZ(n)v| b1? ?1/2v? 2kuk2 + b2kvk1kuk2+ b3? ?1/2v? 2kuk1. (iii)We say thatZ satisfi es the augmented transfer principleATP(c1;c2;c3)for some positive93 numbers c1, c2and c3, if for all v;u Rp+n,94 kZ(n)v + uk2 c1? ?1/2v;u? 2 c2kvk1 c3kuk1.(7) These three properties are inter-related and related to extreme singular values of the matrix Z(n).95 (P1) If Z satisfi es ATP(c1;c2;c3 ) then it also satisfi es TP(c1;c2).96 (P2)IfZ satisfi esTP(a1;a2)andIP(b1;b2;b3) then it also satisfi esATP(c1;c2;c3)with97 c2 1= a21 b1 2, c2= a2+ 2b2/ and c3= 2b3/ for any positive 0, ifk1/2vk2 kvJk2for112 any vector v Rpand any set J 1,.,p such that Card(J) s and kvJck1 c0kvJk1.113 Theorem 1.Letsatisfy theRE(s,5)condition with constant 0. Letb1,b2,b3,c1,c2,c3114 be some positive real numbers such thatX satisfi es theIP(0;b2;b3)and theATP(c1;c2;c3).115 Assume that for some (0,1), the tuning parameter satisfi es116 n p 8log(n/) _? max j=1,.,p kX(n) ,jk2 ?p 8log(p/). 3 If the sparsity s and the number of outliers o satisfy the condition117 s 2 + o c2 1 400?c2 c3 5b2/c1?2 ,(8) then, with probability at least 1 2, we have118 ? ?1/2(b )? ? 2 24 c2 1 ?2c2 c1 _b3 c2 1 ? s 2 + 7o ? + 5s 6c2 1 .(9) Theorem 1 is somewhat hard to parse. At this stage, let us simply mention that in the case of a119 Gaussian design considered in the next section,c1is of order1whileb2,b3,c2,c3are of ordern1/2,120 up to a factor logarithmic inp,nand1/. Hereis an upper bound on the probability that the121 Gaussian matrixXdoes not satisfy eitherIPorATP. Since Theorem 1 allows us to choose122 of the order plog(p + n)/n, we infer from (9)that the error of estimating, measured in123 Euclidean norm, is of order s n2 + o n + ( s n2) 1/2 = O( o n + ( s n2) 1/2), under the assumption that 124 ( s n2 + o n)log(np/) is smaller than a universal constant. 125 To complete this section, we present a sketch of the proof of Theorem 1. In order to convey the main126 ideas without diving too much into technical details, we assume = Ip. This means that theRE127 condition is satisfi ed with = 1for anysandc0. From the fact that theATPholds forX, we infer128 thatX nI n satisfi es theRE(s + o,5)condition with the constantc1/2. Using the well-known129 risk bounds for the Lasso estimator (Bickel et al., 2009), we get130 kb k2 2+ kb k2 2 C 2(s + o) andkb k1+ kb k1 C(s + o).(10) Note that these are the risk bounds established in3(Cands and Randall, 2008; Dalalyan and Chen,131 2012; Nguyen and Tran, 2013). These bounds are most likely unimprovable as long as the estimation132 ofis of interest. However, if we focus only on the estimation error of, consideringas a133 nuisance parameter, the following argument leads to a sharper risk bound. First, we note that134 b arg min Rp n 1 2nkY X n b k2 2+ kk1 o . The KKT conditions of this convex optimization problem take the following form135 1/nX(Y Xb n b ) sgn(b), wheresgn(b)is the subset ofRpcontaining all the vectorswsuch thatwjbj= |bj|and|wj| 1136 for every j 1,.,p. Multiplying the last displayed equation from left by b, we get137 1/n(b )X(Y Xb n b ) ?kk1 kbk1?. Recall now that Y = X+ n + and set v = b and u = b. We arrive at138 1/nkXvk2 2= 1/nvXXv v(X(n)u 1/nvX + ?kk1 kb k1?. On the one hand, the duality inequality and the lower bound onimply that|vX| 139 kvk1kXk nkvk1/2. On the other hand, well-known arguments yieldkk1 kbk1140 2kvSk1 kvk1. Therefore, we have141 1/nkXvk2 2 |v (X(n)u| +/2?4kvSk1 kvk1?.(11) SinceX satisfi es theATPI(c1,c2,c3)that implies theTPI(c1,c2), we getc2 1kvk22 2/nkXvk2 2+ 142 2c2 2kvk21. Combining with (11), this yields 143 c2 1kvk 2 2 2|v(X(n)u| + ?4kvSk1 kvk1?+ 2c2 2kvk 2 1 IPI(0,b2,b3) 2b3kvk2kuk1+ 2b2kvk1kuk2+ ?4kvSk1 kvk1?+ 2c2 2kvk 2 1 (12) c2 1 2 kvk2 2+ 2b2 3 c2 1 kuk2 1+ kvk1(2b2kuk2 ) + 4kvSk1+ 2c 2 2kvk 2 1. 3 the fi rst two references deal with the small dimensional case only, that is where s = p ? n. 4 Using the fi rst inequality in(10)and condition(8), we upper bound(2b2kuk2 )by 0. To upper144 bound the second last term, we use the Cauchy-Schwarz inequality:4kvSk1 4skvk2145 (4/c1)22s + (c1/2)2kvk2 2. Combining all these bounds and rearranging the terms, we arrive at 146 (c2 1/4)kvk 2 2 2(b3/c1) c2 2(kuk1 + kvk1)2+ (4/c1)22s. Taking the square root of both sides and using the second inequality in(10), we obtain an inequality147 of the same type as(9)but with slightly larger constants. As a concluding remark for this sketch of148 proof, let us note that if instead of using the last arguments, we replace all the error terms appearing149 in (12) by their upper bounds provided by (10), we do not get the optimal rate.150 3The case of Gaussian design151 Our main result, Theorem 1, shows that if the design matrix satisfi es the transfer principle and the152 incoherence property with suitable constants, then the1-penalized HubersM-estimator achieves153 the optimal rate under adversarial contamination. As a concrete example of a design matrix for which154 the aforementioned conditions are satisfi ed, we consider the case of correlated Gaussian design. As155 opposed to most of prior work on robust estimation for linear regression with Gaussian design, we156 allow the covariance matrix to have a non degenerate null space. We will simply assume that the157 nrows of the matrixXare independently drawn from the Gaussian distributionNp(0,)with a158 covariance matrixsatisfying theRE(s,5)condition. We will also assume in this section that all the159 diagonal entries ofare equal to 1:jj= 1. The more formal statements of the results, provided in160 the supplementary material, do not require this condition.161 Theorem 2.Let (0,1/7)be a tolerance level andn 100 . For every positive semi-defi nite162 matrixwith all the diagonal entries bounded by one, with probability at least1 2, the matrixX163 satisfi es the TP(a1,a2), the IP(b1,b2,b3) and the ATP(c1,c2,c3) with constants164 a1= 1 4.3 + p2log(9/) n,a2= b2= 1.2 r 2logp n b1= 4.82 + p2log(81/) n,b3= 1.2 r 2logn n , c1= 3 4 17.5 + 9.6p2log(2/) n,c2= 3.6 r 2logp n ,c3= 2.4 r 2logn n . The proof of this result is provided in the supplementary material. It relies on by now standard tools165 such as Gordons comparison inequality, Gaussian concentration inequality and the peeling argument.166 Note that theTPand related results have been obtained in Raskutti et al. (2010); Oliveira (2016);167 Rudelson and Zhou (2013). TheIPis basically a combination of a high probability version of168 Chevets inequality (Vershynin, 2018, Exercises 8.7.3-4) and the peeling argument. A property similar169 to theATPfor Gaussian matrices with non degenerate covariance was established in (Nguyen and170 Tran, 2013, Lemma 1) under further restrictions on n,p,s,o.171 Theorem 3. There exist universal positive constants d1, d2, d3such that if slogp 2 + ologn d1nand1/7 2ed2n then, with probability at least1 4,1-penalized HubersM-estimator with2 sn = 92log(p/) 172 and 2 on = 82 log(n/) satisfi es173 ? ?1/2(b )? ? 2 d3 ?slog(p/) n2 ?1/2 + olog(n/) n ? .(13) Even though the constants appearing in Theorem 2 are reasonably small and smaller than in the174 analogous results in prior work, the constantsd1,d2andd3are large, too large for being of any175 practical relevance. Finally, let us note that ifsandoare known, it is very likely that following the176 techniques developed in (Bellec et al., 2018, Theorem 4.2), one can replace the termslog(p/)and177 log(n/) in (13) by log(p/s) and log(n/o), respectively.178 5 Comparing Theorem 3 with (Nguyen and Tran, 2013, Theorem 1), we see that our rate improvement179 is not only in terms of its dependence on the proportion of outliers,o/n, but also in terms of the180 condition number , which is now completely decoupled from o in the risk bound.181 While our main focus is on the high dimensional situation in whichpcan be larger thann, it also182 applies to the case of small dimensional dense vectors, i.e., whens = p is signifi cantly smaller than183 n. One of the applications of such a setting is the problem of stylized communication considered, for184 instance, in (Cands and Randall, 2008). The problem is to transmit a signal Rpto a remote185 receiver. What the receiver gets is a linearly transformed codewordXcorrupted by small noise186 and malicious errors. While all the entries of the received codeword are affected by noise, only a187 fraction of them is corrupted by malicious errors, corresponding to outliers. The receiver has access188 to the corrupted version ofXas well as to the encoding matrixX. Theorem 3.1 from (Cands and189 Randall, 2008) establishes that the Dantzig selector (Cands and Tao, 2007), for a properly chosen190 tuning parameter proportional to the noise level, achieves the (sub-optimal) rate2(s + o)/n, u

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论