




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Outlier Detection and Robust PCA Using a Convex Measure of Innovation Mostafa Rahmani and Ping Li Cognitive Computing Lab Baidu Research USA 10900 NE 8th St. Bellevue, WA 98004 mostafarahmani,liping11 Abstract This paper presents a provable and strong algorithm, termed Innovation Search (iSearch), t
2、o robust Principal Component Analysis (PCA) and outlier detection. An outlier by defi nition is a data point which does not participate in forming a low dimensional structure with a large number of data points in the data. In other words, an outlier carries some innovation with respect to most of th
3、e other data points. iSearch ranks the data points based on their values of innovation. A convex optimization problem is proposed whose optimal value is used as our measure of innovation. We derive analytical performance guarantees for the proposed robust PCA method under different models for the di
4、stribution of the outliers including randomly distributed outliers, clustered outliers, and linearly dependent outliers. Moreover, it is shown that iSearch provably recovers the span of the inliers when the inliers lie in a union of subspaces. In the challenging scenarios in which the outliers are c
5、lose to each other or they are close to the span of the inliers, iSearch is shown to outperform most of the existing methods. 1Introduction Outlier detection is an important research problem in unsupervised machine learning. Outliers are associated with important rare events such as malignant tissue
6、s 14, the failures of a system 10, 12,31, web attacks 16, and misclassifi ed data points 9,27. In this paper, the proposed outlier detection method is introduced as a robust Principal Component Analysis (PCA) algorithm, i.e., the inliers lie in a low dimensional subspace. In the literature of robust
7、 PCA, two main models for the data corruption are considered: the element-wise model and the column-wise model. These two models are corresponding to two different robust PCA problems. In the element-wise model, it is assumed that a small subset of the elements of the data matrix are corrupted and t
8、he support of the corrupted elements is random. This problem is known as the low rank plus sparse matrix decomposition problem 1,3,4,23,24. In the column-wise model, a subset of the columns of the data are affected by the data corruption 5,7,8,11,17,20,25,26,3639. Section 2 provides a review of the
9、robust (to column-wise corruption) PCA methods. This paper focuses on the column-wise model, i.e., we assume that the given data follows Data Model 1. Data Model 1.The data matrixD RM1M2can be expressed asD = B (A + N) T ,where A Rmni,B Rmno,Tis an arbitrary permutation matrix, andB (A + N)represent
10、s the concatenation ofBand(A + N). The columns ofAlie in anr-dimensional subspaceU. The columns ofBdo not lie entirely inU, i.e., thenicolumns ofAare the inliers and thenocolumns of Bare the outliers. The matrixNrepresents additive noise. The orthonormal matrixU RM1ris a basis for U. Evidently, M2=
11、ni+ no. In the robust PCA problem, the main task is to recoverU. Clearly, ifUis estimated accurately, the outliers can be located using a simple subspace projection 22. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. Summary of Contributions: The main cont
12、ributions can be summarized as follows. The proposed approach introduces a new idea to the robust PCA problem. iSearch uses a convex optimization problem to measure the Innovation of the data points. It is shown that iSearch mostly outperforms the exiting methods in handling close outliers and noisy
13、 data. To the best of our knowledge, the proposed approach and the CoP method presented in 27 are the only robust PCA methods which are supported with analytical performance guarantees under different models for the distributions of the outliers including the randomly distributed outliers, the clust
14、ered outliers, and the linearly dependent outliers. In addition to considering different models for the distribution of the outliers, we provide analytical performance guarantees under different models for the distributions of the inliers too. The presumed models include the union of subspaces and t
15、he uniformly at random distribution on U SM11where SM11denotes the unit 2-norm sphere in RM1. Notation:Given a matrixA,kAkdenotes its spectral norm. For a vectora,kakpdenotes itsp-norm anda(i)itsithelement. Given two matricesA1andA2with an equal number of rows, the matrix A3= A1A2is the matrix forme
16、d by concatenating their columns. For a matrixA,aidenotes itsithcolumn. The subspaceUis the complement ofU. The cardinality of setI is defi ned as|I|. Also, for any positive integer n, the index set1,.,nis denotedn. The coherence between vector a and subspace H with orthonormal basis H is defi ned a
17、s kaTHk2. 2Related Work In this section, we briefl y review some of the related works. We refer readers to 18,27 for a more comprehensive review on the topic. One of the early approaches to robust PCA was to replace the Frobenius norm in the cost function of PCA with1-norm because1-norm were shown t
18、o be robust to the presence of the outliers 2,15. The method proposed in 6 leveraged the column-wise structure of the corruption matrix and replaced the1-norm minimization problem with an1,2-norm minimization problem. In 19 and 39, the optimization problem used in 6 was relaxed to a convex optimizat
19、ion problem and it was proved that under some suffi cient conditions the optimal point is a projection matrix which spansU. In 34, a provable outlier rejection method was presented. However, 34 assumed that the outliers are randomly distributed onSS1and the inliers are distributed randomly onU SM11.
20、 In 36, a convex optimization problem was proposed which decomposes the data into a low rank component and a column sparse component. The approach presented in 36 is provable but it requiresno to be signifi cantly smaller thanni. In 32, it was assumed that the outliers are randomly distributed onSM1
21、1and a small number of them are not linearly dependent. The method presented in 32 detects a data point as an outlier if it does not have a sparse representation with respect to the other data points. Connection and Contrast to Coherence Pursuit:In 27, Coherence Pursuit (CoP) was proposed as a prova
22、ble robust PCA method. CoP computes the Coherence Values for all the data points to rank the data points. The Coherence value corresponding to data columndis a measure of resemblance betweendand the rest of the data columns. CoP uses the inner product betweendand the rest of the data points to measu
23、re the resemblance betweendand the rest of data. In sharp contrast, iSearch fi nds an optimal direction corresponding to each data column. The optimal direction corresponding to data columndis used to measure the innovation ofdwith respect to the rest of the data columns. We show through theoretical
24、 studies and numerical experiments that fi nding the optimal directions makes iSearch signifi cantly stronger than CoP in detecting outliers which carry weak innovation. Connection and Contrast to Innovation Pursuit:In 28,29, Innovation Pursuit was proposed as a new subspace clustering method. The o
25、ptimization problem proposed in 28 fi nds a direction in the span of the data such that it is orthogonal to the maximum number of data points. We present a new discovery about the applications of Innovation Pursuit. It is shown that the idea of innovation search can be used to design a strong outlie
26、r detection algorithm. iSearch uses an optimization problem similar to the linear optimization problem used in 28 to measure the innovation of the data points. 3Proposed Approach Algorithm 1 presents the proposed method along with the defi nition of the used symbols. iSearch consists of 4 steps. In
27、the next subsections, Step 2 and Step 4 are discussed. In this paper, we use an ADMM solver to solve (1). The computation complexity of the solver isO(max(M1M2 2,M 2 1M2). 2 Algorithm 1 Subspace Recovery Using iSearch 1. Data Preprocessing. The input is data matrix D RM1M2. 1.1 Defi neQ RM1rd as the
28、 matrix of fi rstrdleft singular vectors ofDwhererdis the number of non-zero singular values. SetD = QTD. If dimensionality reduction is not required, skip this step. 1.2 Normalize the 2-norm of the columns of D, i.e., set diequal to di/kdik2for all 1 i M2. 2. Direction Search. Defi ne C RrdM2such t
29、hat c i Rrd1is the optimal point of min c kcTDk1subject tocTdi= 1 or defi ne C RrdM2as the optimal point of min C k(CTD)Tk1subject todiag(CTD) = 1 .(1) 3. Computing the Innovation Values. Defi ne vector x RM21such that x(i) = 1/kDTc ik1. 4. Building Basis.Construct matrixYfrom the columns ofDcorresp
30、onding to the smallest elements of x such that they span an r-dimensional subspace. Output: The column-space of Y is the identifi ed subspace. If PCA is used in the prepossessing step to reduce the dimensionality of the data tord, the computation complexity of the solver is O(max(rdM2 2,r2dM2) 1. 3.
31、1An Illustrative Example for Innovation Value We use a synthetic numerical example to explain the idea of computing the Innovation Value. Suppose D R20250, ni= 200, no= 50, and r = 3. Assume that D follows Assumption 1. Assumption 1.The columns ofAare drawn uniformly at random fromU SM11. The column
32、s of Bare drawn uniformly at random fromSM11. To simplify the exposition and notation, it is assumed without loss of generality that T in Data Model 1 is the identity matrix, i.e, D = B A. Suppose d is a column of D, defi ne cas the optimal point of min c kcTDk1subject tocTd = 1 ,(2) and defi ne the
33、 Innovation Value corresponding todas1/kDTck1. The main idea of iSearch is that chas two completely different behaviours with respect toU(whendis an outlier and whendis an inlier). Supposedis an outlier. The optimization problem (2) searches for a direction whose projection ondis non-zero and it has
34、 the minimum projection on the rest of the data points. Asdis an outlier,dhas a non-zero projection onU. In addition, asniis large, (2) searches for a direction in the ambient whose projection onUis as weak as possible. Thus,clies inUor it is close toU. -0.5 0 0.5 1 c* D d ins an outlier, p=1 050100
35、150200250 Element Index -1 -0.5 0 0.5 1 c* D d ins an inlier, p=1 050100150200250 Element Index 0 0.2 0.4 0.6 0.8 1 x Innovation Values, p=1 050100150200250 Element Index Figure 1: The fi rst 50 columns are outliers. The left panel shows vectorDTcwhendis an outlier. The middle panel depictsDTcwhendi
36、s an inlier. The right panel shows the Innovation Values corresponding to all the data points (vector x was defi ned in Algorithm 1). The left plot of Figure 1 showsDTcwhendis an outlier. In this case,cis orthogonal to all the inliers. Accordingly, whendis an outliers,kDTck1is approximately equal to
37、kBTck1. On the 1If the data is noisy, rdshould be set equal to the number of dominant singular values. In this paper, we do not theoretically analyze iSearch in the presence of noise. In the numerical experiments, we set rdequal to the index of the largest singular value which is less than or equal
38、to 0.01 % of the fi rst singular value. 3 other hand, whendis an inlier, the linear constraint strongly discouragescto lie inUor to be close toU. Inliers lie in a low dimensional subspace and mostly they are close to each other. Since chas a strong projection ond, it has strong projections on many o
39、f the inliers. Accordingly, the value ofkATck1is much larger whendis an inlier. Therefore, the Innovation Value corresponding to an inlier is smaller than the Innovation Value corresponding to an outlier becausekATck1is much larger whendis an inliers. Figure 1 compares the vectorDTcwhendis an outlie
40、rs with the same vector whendis an inlier. In addition, it shows the vector of Innovation Values (right plot). One can observe that the Innovation Values make the outliers clearly distinguishable. 3.2Building the Basis Matrix The data points corresponding to the least Innovation Values are used to c
41、onstruct the basis matrixY. If the data follows Assumption 1, therdata points corresponding to thersmallest Innovation Values spanUwith overwhelming probability 35. In practise, the algorithm should continue adding new columns toYuntil the columns ofYspans ar-dimensional subspace. This approach requ
42、ires to check the singular values ofYseveral times. We propose two techniques to avoid this extra steps. The fi rst approach is based on the side information that we mostly have about the data. In many applications, we can have an upper-bound onnobecause outliers are mostly associated with rare even
43、ts. If we know that the number of outliers is less thanypercent of the data, matrixYcan be constructed using(1y)percent of the data columns which are corresponding to the least Innovation Values. The second approach is the adaptive column sampling method proposed in 27. The adaptive column sampling
44、method avoids sampling redundant columns. 4Theoretical Studies In this section, we analyze the performance of the proposed approach with three different models for the distribution of the outliers: unstructured outliers, clustered outliers, and linearly dependent outliers. Moreover, we analyze iSear
45、ch with two different models for the distribution of the inliers. These models include the union of subspaces and uniformly at random distribution onU SM11. In the theoretical investigations, we do not consider noisy inliers. In Section 5, it is shown with real and synthetic data that iSearch accura
46、tely estimatesUeven in the low signal to noise ratio cases and it mostly outperforms the existing approaches when the data is noisy. The theoretical results are followed by short discussions which highlight the important aspects of the theorems. The proofs of the presented theorems are available in
47、an extended version of this work 30. 4.1Randomly Distributed Outliers In this section, it is assumed thatDfollows Assumption 1. In order to guarantee the performance of the proposed approach, it is enough to show that the Innovation Values corresponding to the outliers are greater than the Innovatio
48、n Values corresponding to the inliers. In other word, it suffi ces to show max ? 1/kDTc ik1 M2 i=no+1 ? no M1 + 2 r no M1 + s 2nolog1/ (M1 1)M1 + s noc 00 logno/ M2 1 + n 0 z s c 00 M2 1 + s? no M2 1 + M1 ? logno/ #r 4M1c M1 cr and A max no M 1 + 2no(1 + ) + 2 s 2nolog 1 M 1 ,2n 0 z r cr M1 + 2 s no
49、crlogno/ M1 , (4) 4 then (3) holds andUis recovered exactly with probability at least1 7where c = 3max ? 1, q 8M1 (M11)r, q 8M1log n0/ (M11)r ? , q c 00 = 3max ? 1, q 8M1 M11, q 16M1log n0/ M11 ? , and= max ? 4 3 log 2M1 , q 4 no M1 log 2M1 ? . Theorem 1 shows that as long asni/r is suffi ciently la
50、rger thanno/M1, the proposed approach is guaranteed to detect the randomly distributed outliers exactly.It is important to note that in the suffi cient conditionsniis scaled with1/rbutnois scaled with1/M1. It shows that ifr is suffi ciently smaller thanM1, iSearch provably detects the unstructured o
51、utliers even ifnois much larger thanni. The numerical experiments presented in Section 5 confi rms this feature of iSearch and they show that if the outliers are unstructured, iSearch can yield exact recovery even ifno 100 ni. It is important to note that when the outliers are structured, by the def
52、i nition of outlier,nocannot be larger thanni. 4.2Structured Outliers In this section, we analyze the proposed approach with structured outliers. In contrast to the unstructured outliers, structured outliers can form a low dimensional structure different from the structure of the majority of the dat
53、a points. Structured outliers are associated with important rare events such as malignant tissues 14 or web attacks 16. In this section, we assume that the outliers form a cluster outside ofU . The following assumption specifi es the presumed model for the distribution of the structured outliers. As
54、sumption 2.A column ofBis formed asbi= 1 1+2 (q+vi). The unit2-norm vectorqdoes not lie in U, vino i=1are drawn uniformly at random from S M11, and is a positive number. According to Assumption 2, the outliers cluster around vectorqwhereq 6 U. In Algorithm 1, if the dimensionality reduction step is
55、performed, the direction search optimization problem is applied to QTD. Thus, (2) is equivalent to min c kcTDk1subject tocTd = 1andc Q ,(5) wherec RM11andD RM1M2. The subspaceQis the column-space ofD. In this section, we are interested in studying the performance of iSearch in identifying tightly cl
56、ustered outliers because some of the existing outlier detection algorithms fail if the outliers form a tight cluster. For instance, the thresholding based method 13 and the sparse representation based algorithm 32 fail when the outliers are close to each other. Therefore, we assume that the span ofQ
57、is approximately equal to the column-space ofU q. The following Theorem shows that even if the outliers are close to each other, iSearch successfully identifi es the outliers provided that ni/ r is suffi ciently larger than no. Theorem 2.Suppose the distribution of the inliers/outliers follows Assum
58、ption-1/Assumption-2. Assume thatQis equal to the column-space ofU q . Defi neq= (IUUT)q k(IUUT)qk2 , defi ne = max ?1/|dT i q| : di B? , defi nec i as the optimal point of (5) withd = di, and assume that nokUTqk2+ s norclogno/ M1 , A no|qTq| + no s c00 logno/ M1 , (6) then (3) holds and U is recovered exactly with probability at least 1 5. In contrast to (4), in (6)nois not scaled with1/M1. Theorem 2 shows that in contrast to the unstructured outliers, the number of the structured outliers should be suffi ciently smaller than the number of the inliers for the small values of. T
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 乡镇增减挂钩督查方案(3篇)
- 2025年教师师德师风知识考试题库及答案5
- 《台阶》中考试题及答案
- 吊装安全管理办法
- 后勤应急管理办法
- 呆账税款管理办法
- 哈啰网格管理办法
- 商业后街管理办法
- 商务扶贫管理办法
- 商场巡场管理办法
- 无人机训练方案
- 内蒙古包头市2024-2025学年八年级下学期期末语文试题(含答案)
- 2024年西藏林芝县人民医院公开招聘护理工作人员试题带答案详解
- 健康体重教育小学课件
- 2025年华住储备干部考试题库
- 肌力评估及护理
- 医疗护理员培训课件
- 供水公司报装管理制度
- 床上用品采购 投标方案
- 标识、文化墙及灯箱采购服务方案
- 2025至2030年中国超级电容活性炭行业市场调查研究及投资策略研究报告
评论
0/150
提交评论