




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Outlier-Robust High-Dimensional Sparse Estimation via Iterative Filtering Ilias Diakonikolas University of Wisconsin - Madison ilias.diakonikolas Sushrut Karmalkar UT Austin s.sushrut Daniel Kane University of California, San Diego Eric Price UT Austin Alistair St
2、ewart Web3 Foundation stewart.al Abstract We study high-dimensional sparse estimation tasks in a robust setting where a constant fraction of the dataset is adversarially corrupted. Specifi cally, we focus on the fundamental problems of robust sparse mean estimation and robust sparse PCA. We give the
3、 fi rst practically viable robust estimators for these problems. In more detail, our algorithms are sample and computationally effi cient and achieve near-optimalrobustnessguarantees. Incontrasttopriorprovablealgorithmswhich relied on the ellipsoid method, our algorithms use spectral techniques to i
4、teratively remove outliers from the dataset. Our experimental evaluation on synthetic data shows that our algorithms are scalable and signifi cantly outperform a range of previous approaches, nearly matching the best error rate without corruptions. 1Introduction 1.1Background The task of leveraging
5、sparsity to extract meaningful information from high-dimensional datasets is a fundamental problem of signifi cant practical importance, motivated by a range of data analysis applications. Various formalizations of this general problem have been investigated in statistics and machine learning for at
6、 least the past two decades, see, e.g., HTW15 for a recent textbook on the topic. This paper focuses on the unsupervised setting and in particular on estimating the parameters of a high-dimensional distribution under sparsity assumptions. Concretely, we study the problems of sparse mean estimation a
7、nd sparse PCA under natural data generating models. The classical setup in statistics is that the data was generated by a probabilistic model of a given type. This is a simplifying assumption that is only approximately valid, as real datasets are typically exposed to some source of contamination. Th
8、e fi eld of robust statistics Hub64, HR09, HRRS86 aims to design estimators that are robust in the presence of model misspecifi cation. In recent years, designing computationally effi cient robust estimators for high-dimensional settings has become a pressing challenge in a number of applications. T
9、hese include the analysis of biological datasets, where natural outliers are common RPW+02, PLJD10, LAT+08 and can contaminate the down- stream statistical analysis, and data poisoning attacks BNJT10, where even a small fraction of fake data (outliers) can substantially degrade the learned model BNL
10、12, SKL17. This discussion motivates the design of robust estimators that can tolerate a constant fraction of adversarially corrupted data. We will use the following model of corruptions (see, e.g., DKK+16): 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.
11、Defi nition 1.1. Given 0 0. Let S be an -corrupted set of samples from D of size N = e (k2log(d)/2). There exists an algorithm that, on input S, k, and runs in polynomial time returns b such that with probability at least 2/3 it holds kb k2= O(plog(1/). Some comments are in order. First, the sample
12、complexity of our algorithm is asymptotically the same as that of BDLS17, and matches the lower bound of DKS17 against Statistical Query algorithms for this problem. The major advantage of our algorithm over BDLS17 is that while their algorithm made use of the ellipsoid method, ours uses only spectr
13、al techniques and is scalable. For robust sparse PCA in the spiked covariance model, we show: Theorem 1.3 (Robust Sparse PCA). Let D N(0,I + vvT) be a Gaussian distribution on Rd with spiked covariance for an unknown k-sparse unit vector v, and 0 0 such that PrXuS|v (X e )| T + 9 erfc(T/2) + 32 T2ln
14、(kln(Nd/). 8:return the multiset S0= x S : |v (x e )| T + . 9:Let p(x) = ? (x e )T(e I)T (U)(x e ) Tr(e I)(U) ? /k(e I)(U)kF. 10:Find T 4 such that PrXuS|p(X)| T 9exp(T/4) + 32/(T ln2T) . 11:return the multiset S0= x S : |p(x)| T. 4 Robust Sparse Mean Estimation. Here we briefl y describe the motiva
15、tion and analysis of Algo- rithm 1, describing a single iteration of our fi lter for the robust sparse mean setting. In order to estimate the k-sparse mean , it suffi ces to ensure that our estimate 0has |v (0 )| small for any 2k-sparse unit vector v. The now-standard idea in robust statistics DKK+1
16、6 is that if a small number of corrupted samples suffi ce to cause a large change in our estimate of v , then this must lead to a substantial increase in the sample variance of v x, which we can detect. Thus, a very basic form of a robust algorithm might be to compute a sample covariance matrix e ,
17、and let v be the 2k-sparse unit vector that maximizes vTe v. If this number is close to 1, it certifi es that our estimate 0 obtained by truncating the sample mean to its k-largest entries is a good estimate of the true mean . If not, this will allow us to fi lter our sample set by throwing away the
18、 values where v x is furthest from the true mean. This procedure guarantees that we have removed more corrupted samples than uncorrupted ones. We then repeat the fi lter until the empirical variance in every sparse direction is close to 1. Unfortunately, the optimization problem of fi nding the opti
19、mal v is computationally challenging, requiring a convex program. To circumvent the need for a convex program, we notice that vTev 1 = (eI)(vvT) is large only ifeI has large entries on the (2k)2non-zero entries of vvT. Thus, if the 4k2largest entries of e I had small 2-norm, this would certify that
20、no such bad v existed and would allow us to return the truncated sample mean. In case these entries have large 2-norm, we show that we can produce a fi lter that removes more bad samples than good ones. Let A be the matrix consisting of the large entries ofe (for the moment assume that they are all
21、off diagonal, but this is not needed). We know that the sample mean of p(x) = (x0)TA(x0) = e A = kAk2 F. On the other hand, if 0approximates on the O(k2) entries in question, we would have that kpk2= kAkF. This means that if kAkFis reasonably large, an -fraction of corrupted points changed the mean
22、of p from 0 to kAk2 F = kAkFkpk2. This means that many of these errors must have had |p(x)| ? kAkF/kpk2. This becomes very unlikely for good samples if kAkFis much larger than (by standard results on the concentration of Gaussian polynomials). Thus, if 0is approximately on these O(k2 ) coordinates,
23、we can produce a fi lter. To ensure this, we can use existing fi lter-based algorithms to approximate the mean on these O(k2) coordinates. This results in Algorithm 1. For the analysis, we note that if the entries of A are small, then vT(eI)v must be small for any unit k-sparse v, which certifi es t
24、hat the truncated sample mean is good. Otherwise, we can fi lter the samples using the fi rst kind of fi lter. This ensures that our mean estimate is suffi ciently close to the true mean that we can then fi lter using the second kind of fi lter. It is not hard to show that the above works if we are
25、given suffi ciently many samples, but to obtain a tight analysis of the sample complexity, we need a number of subtle technical ideas. The detailed analysis of the sample complexity is deferred to the full version of our paper. Robust Sparse PCA Here we briefl y describe the motivation and analysis
26、of Algorithm 2, de- scribing a single iteration of our fi lter for the sparse PCA setting. Note that estimating the k-sparse vector v is equivalent to estimating EXXT I = vvT. In fact, estimating EXXT I to error in Frobenius norm allows one to estimate v within error in 2-norm. Thus, we focus on he
27、task of robustly approximating the mean of Y = XXT I. Our algorithm is going to take advantage of one fact about X that even errors cannot hide: that Varv X is large. This is because removing uncorrupted samples cannot reduce the variance by much more than an -fraction, and adding samples can only i
28、ncrease it. This means that an adversary attempting to fool our algorithm can only do so by creating other directions where the variance is large, or simply by adding other large entries to the sample covariance matrix in order to make it hard to fi nd this particular k-sparse eigenvector. In either
29、 case, the adversary is creating large entries in the empirical mean of Y that should not be there. This suggests that the largest entries of the empirical mean of Y , whether errors or not, will be of great importance. These large entries will tell us where to focus our attention. In particular, we
30、 can fi nd the k2largest entries of the empirical mean of Y and attempt to fi lter based on them. When we do so, one of two things will happen: Either we remove bad samples and make progress or we verify that these entries ought to be large, and thus must come from the support of v. In particular, w
31、hen we reach the second 5 Algorithm 2 Robust Sparse PCA via Iterative Filtering 1:procedure ROBUST-SPARSE-PCA(S,k,e,) input: A multiset S, an estimate of the true covariancee, a real number R. output: A multiset S0with smaller fraction of corrupted samples or a matrix 0with k0kF O( + log(1/) 2:For a
32、ny x Rd defi ne (x) := vec(xxT I) Rd 2. 3:Compute := ES(x), = hk2() and Q := Supp( ). 4:Compute MQ:= ES(x) )(x) )TQQ Rk 2 Rk 2 5:Let ,vbe the maximum eigenvalue and corresponding eigenvector of MQ CovXN(0,e )(x)Q). 6:if log(1/) satisfying PrS|(x)Q v | CT + 3 T2log2(T) . return S0= x S | |(x)Q v) | T
33、. case, since the adversary cannot shrink the empirical variance of v X by much, almost all of the entries on the support of v must remain large, and this can be captured by our algorithm. The above algorithm works under a set of deterministic conditions on the good set of samples that are satisfi e
34、d with high probability with poly(k)log(d)/2samples. Our current analysis does not establish the information-theoretically optimal sample size of O(k2log(d)/2), though we believe that this plausible via a tighter analysis. We note that a naive implementation of this algorithm will achieve error poly
35、() in our fi nal estimate for v, while our goal is to obtain O() error. To achieve this, we need to overcome two diffi culties: First, when trying to fi lter Y on subsets of its coordinates, we do not know the true variance of Y , and thus cannot expect to obtain O() error. This is fi xed with a bootstrapping method similar to that in Kan18 to estimate the covariance of a Gaussian. In particular, we do not know VarY a priori, but aft
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025年医用高频仪器设备项目发展计划
- 小学1-6年级奥数题及答案每日一练
- 2025年血橙提取物化妆品项目发展计划
- 2025年多功能抑尘车项目合作计划书
- 父亲的拥抱阅读答案 父亲让我抱抱你 答案
- 消防设施检测合同范本(2024版)
- 2025年绝缘材料:绝缘套管合作协议书
- 2025年全数字摄影测量系统合作协议书
- 教育法规执行中的挑战与对策
- 2025年PU系列水乳型聚氨酯皮革涂饰剂项目建议书
- 部编八下语文游记阅读训练题语文八年级下册能力训练(部编版)
- 保修管理控制程序
- GB/T 9117-2010带颈承插焊钢制管法兰
- GB/T 12513-2006镶玻璃构件耐火试验方法
- 人教版音乐三年级上册教材介绍-课件
- 装修改造工程施工总平面图6
- 教师的职业生涯规划与专业发展课件
- 生物安全自查表
- 广州小升初-学籍表打印版
- 天津市-解除劳动合同证明书
- 公司一年完税证明模板
评论
0/150
提交评论