NIPS20199656-understanding-sparse-jl-for-feature-hashing

上传人：我*** IP属地：北京上传时间：2020-06-01 格式：PDF 页数：11 大小：2.24MB 积分：12 举报 版权申诉

NIPS20199656-understanding-sparse-jl-for-feature-hashing_第2页

NIPS20199656-understanding-sparse-jl-for-feature-hashing_第3页

NIPS20199656-understanding-sparse-jl-for-feature-hashing_第4页

NIPS20199656-understanding-sparse-jl-for-feature-hashing_第5页

已阅读5页，还剩6页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

Understanding Sparse JL for Feature Hashing Meena Jagadeesan Harvard University Cambridge, MA 02138 Abstract Feature hashing and other random projection schemes are commonly used to reduce the dimensionality of feature vectors. The goal is to effi ciently project a high-dimensional feature vector living inRninto a much lower-dimensional spaceRm, while approximately preserving Euclidean norm. These schemes can be constructed using sparse random projections, for example using a sparse Johnson- Lindenstrauss (JL) transform. A line of work introduced by Weinberger et. al (ICML 09) analyzes the accuracy of sparse JL with sparsity 1 on feature vectors with small-to-2norm ratio. Recently, Freksen, Kamma, and Larsen (NeurIPS 18) closed this line of work by proving a tight tradeoff between-to-2norm ratio and accuracy for sparse JL with sparsity 1. In this paper, we demonstrate the benefi ts of using sparsitysgreater than1in sparse JL on feature vectors. Our main result is a tight tradeoff between-to-2 norm ratio and accuracy for a general sparsitys , that signifi cantly generalizes the result of Freksen et. al. Our result theoretically demonstrates that sparse JL with s 1 can have signifi cantly better norm-preservation properties on feature vectors than sparse JL with s = 1; we also empirically demonstrate this fi nding. 1Introduction Feature hashing and other random projection schemes are infl uential in helping manage large data 11 . The goal is to reduce the dimensionality of feature vectors: more specifi cally, to project high-dimensional feature vectors living inRninto a lower dimensional spaceRm(wherem ? n), while approximately preserving Euclidean distances (i.e.2distances) with high probability. This dimensionality reduction enables a classifi er to process vectors inRm, instead of vectors inRn. In this context, feature hashing was fi rst introduced by Weinberger et. al 29 for document-based classifi cation tasks such as email spam fi ltering. For such tasks, feature hashing yields a lower dimensional embedding of a high-dimensional feature vector derived from a bag-of-words model. Since then, feature hashing has become a mainstream approach 28, applied to numerous domains including ranking text documents 4, compressing neural networks 7, and protein sequence classifi cation 5. Random Projections Dimensionality reduction schemes for feature vectors fi t nicely into the random projection literature. In fact, the feature hashing scheme proposed by Weinberger et al. 29 boils down to uniformly drawing a random m n matrix where each column contains one nonzero entry, equal to 1 or 1. The2-norm-preserving objective can be expressed mathematically as follows: for error? 0and failure probability, the goal is to construct a probability distributionAoverm nreal matrices I would like to thank Prof. Jelani Nelson for advising this project. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. that satisfi es the following condition for vectors x Rn: PAA(1 ?)kxk2 kAxk2 (1 + ?)kxk2 1 .(1) The result underlying the random projection literature is the Johnson-Lindenstrauss lemma, which gives an upper bound on the dimension m achievable by a probability distribution A satisfying (1): Lemma 1.1 (Johnson-Lindenstrauss 16)For anyn Nand parameters0 ?, 1, there exists a probability distributionAoverm nmatrices, withm = (?2ln(1/) , that satisfi es(1). The optimality of the dimension m achieved by Lemma 1.1 was recently proven 17, 15. To speed up projection time, it is useful to consider probability distributions over sparse matrices (i.e. matrices with a small number of nonzero entries per column). More specifi cally, for matrices withsnonzero entries per column, the projection time for a vectorxgoes down fromO(mkxk0) toO(skxk0), wherekxk0is the number of nonzero entries ofx. In this context, Kane and Nelson 19 constructed sparse JL distributions (which we defi ne formally in Section 1.1), improving upon previous work 2,22,12. Roughly speaking, a sparse JL distribution, as constructed in 19, boils down to drawing a randomm nmatrix where each column contains exactlysnonzero entries, each equal to1/sor1/s. Kane and Nelson show that sparse JL distributions achieve the same (optimal) dimension as Lemma 1.1, while also satisfying a sparsity property. Theorem 1.2 (Sparse JL 19)Forn Nand0 ?, 1, any sparse JL distributionAs,m,n (defi ned formally in Section 1.1) overm nmatrices, with dimensionm = (?2ln(1/)and sparsity s = (?1 ln(1/), satisfi es (1). Sparse JL distributions are state-of-the-art sparse random projections. Moreover, these distributions are near-optimal: any distribution satisfying(1)requires a sparsity(?1ln(1/)/ln(1/?)at m = (?2ln(1/)as shown by Nelson and Nguyen 25.2In practice, though, it can be necessary to utilize a lower sparsitys, since the projection time is linear ins. Resolving this issue, Cohen 8 showed that sparse JL distributions can achieve a lower sparsity with an appropriate gain in dimension. He generalized Theorem 1.2 to show the following dimension-sparsity tradeoffs for sparse JL: Theorem 1.3 (Dimension-Sparsity Tradeoffs 8)For anyn Nand0 ?, 1, a uniform sparse JL distributionAs,m,n (defi ned formally in Section 1.1) satisfi es(1)ifs (?1ln(1/) and m min ? 2?2/,?2ln(1/)e(? 1ln(1/)/s)?. Connection to Feature Hashing Sparse JL distributions have particularly close ties to feature hashing. In particular, the feature hashing scheme proposed by Weinberger et al. 29 can be viewed as a special case of sparse JL, namely with s = 1. Interestingly, in practice, feature hashing can do much better than theoretical results, such as Theorem1.2and Theorem1.3, would indicate 13. An explanation for this phenomenon is that the highest error terms in sparse JL stem from vectors with mass concentrated on a very small number of entries, while in practice, the mass on feature vectors may be spread out between many coordinates. This motivates studying the tradeoff space for vectors with low -to-2ratio. More formally, takeSvto be n x Rn| kxk kxk2 v o , so thatS1= RnandSv( Swfor0 v 1 have been used in practice for projecting feature vectors. For example, sparse JL-like methods (with s 1) have been used to project feature vectors in machine learning domains including visual tracking 27, face recognition 23, and recently in ELM (a type of feedforward neural network) 6. Now, a variant of sparse JL is even included in the Python sklearn library for machine learning.3 While Theorem1.4is restricted to the case ofs = 1, it is thus natural to explore how constructions withs 1perform on feature vectors, by studyingv(m,?,s)for sparse JL withs 1. A related question was considered by Weinberger et al. 29 for “multiple hashing,” an alternate distribution over sparse matrices constructed by addingsdraws fromA1,m,nand scaling by1/s. More specifi cally, they show thatv(m,?,s) min(1, s v(m,?,1) for multiple hashing. However, Kane and Nelson 19 later showed that multiple hashing has worse geometry-preserving properties than sparse JL: that is, multiple hashing requires a larger sparsity than sparse JL to satisfy (1). Thus, fi nding the behavior ofv(m,?,s)for sparse JL distributions, which are state-of-the-art, remained an open problem. In this work, we settle the question of howv(m,?,s)behaves for sparse JL with a general sparsity s 1. Main Results We characterizev(m,?,s)for sparse JL withs 1. Our theoretical result shows that sparse JL withs 1, even ifs is a small constant, can achieve signifi cantly better norm-preservation properties for feature vectors than sparse JL with s = 1. Moreover, we empirically demonstrate this fi nding. We show the following tight bounds on v(m,?,s) for a general sparsity s: Theorem 1.5LetAs,m,nbe a uniform sparse JL distribution4 (defi ned in Section 1.1). Fors m/e and for small enough5?and, the functionv(m,?,s)is equal tof0(m,?,ln(1/),s), where f0(m,?,p,s) is6: 1if m min 2?2ep,?2pe ? max ? 1,p? 1 s ?! ?s r ln(m?2 p ) p else, if max (?2p),s e ? max ? 1,p? 1 s ?! m ?2e(p) ?smin ln( m? p ) p , r ln(m?2 p ) p else, if (?2p) m min ?2e(p),s e ? max ? 1,p? 1 s ?! 0if m (?2p). Our main result, Theorem 1.5, signifi cantly generalizes Theorem 1.2, Theorem 1.3, and Theorem 1.4. Notice our bound in Theorem 1.5 has up to four regimes. In the fi rst regime, which occurs when m min(2?2/,?2ln(1/)e(max(1,ln(1/)? 1/s), Theorem 1.5 shows v(m,?,s) = 1, so (1) holds on the full spaceRn. Notice this boundary onmoccurs at the dimensionality-sparsity tradeoff in Theorem 1.3. In the last regime, which occurs whenm (?2ln(1/), Theorem 1.5 shows that v(m,?,s) = 0, so there are vectors with arbitrarily small-to-2norm ratio where(1)does not hold. Whens (?1ln(1/), Theorem 1.5 shows that up to two intermediate regimes exist. One of the regimes,(?smin(ln(m? p )/p, q ln(m? 2 p )/p), matches the middle regime ofv(m,?,1) in Theorem 1.4 with an extra factor of s, much like the bound for multiple hashing in 29 that we mentioned earlier. However, unlike the multiple hashing bound, Theorem 1.5 sometimes has another regime,(?s q ln(m? 2 p )/p), which does not arise fors = 1(i.e. in Theorem 1.4).7Intuitively, 3See /stable/modules/random_projection.html. 4We prove the lower bound on v(m,?,s) in Theorem 1.5 for any sparse JL distribution. 5By “small enough”, we mean the condition that that ? and are below some constant C0. 6Notice that the function f0(m,?,p,s) is not defi ned for certain “constant-factor” intervals between the boundaries of regimes (e.g. C1?2p m C2?2p). See Appendix A for a discussion. 7This regime does not arise for s = 1, since e(p?1) ?2e(p) for suffi ciently small ?. 3 we expect this additional regime for sparse JL withsclose to(?1ln(1/): ats = (?1ln(1/) andm = (?2ln(1/), Theorem1.2tells usv(m,?,s) = 1, but if?is a constant, then the branch(?sln ? m? p ? /p)yields(1/pln(1/), while the branch(?s q ln(m? 2 p )/p)yields (1). Thus, it is natural that the fi rst branch disappears for large m. Our result elucidates thatv(m,?,s)increases approximately as s, thus providing insight into how even small constant increases in sparsity can be useful in practice. Another consequence of our result is a lower bound on dimension-sparsity tradeoffs (Corollary A.1 in Appendix A) that essentially matches the upper bound in Theorem 1.3. Moreover, we required new techniques to prove Theorem 1.5, for reasons that we discuss further in Section 1.2. We also empirically support our theoretical fi ndings in Theorem 1.5. First, we illustrate with real- world datasets the potential benefi ts of using small constantss 1for sparse JL on feature vectors. We specifi cally show thats = 4,8,16consistently outperformss = 1in preserving the2norm of each vector, and that there can be up to a factor of ten decrease in failure probability fors = 8,16in comparison tos = 1. Second, we use synthetic data to illustrate phase transitions and other trends in Theorem 1.5. More specifi cally, we empirically show thatv(m,?,s)is not smooth, and that the middle regime(s) of v(m,?,s) increases with s. 1.1Preliminaries LetAs,m,nbe asparse JL distributionif the entries of a matrixA As,m,nare generated as follows. LetAr,i= r,ir,i/swherer,irm,inandr,irm,in are defi ned as follows: The families r,irm,inand r,irm,inare independent from each other. The variables r,irm,in are i.i.d Rademachers (1 coin fl ips). The variablesr,irm,inare identically distributed Bernoullis (0,1random vari- ables) with expectation s/m. Ther,irm,inare independent across columns but not independent within each column. For every column1 i n, it holds that Pm r=1r,i = s. Moreover, the random variables are negatively correlated: for every subsetS mand every column1 i n, it holds that E ?Q rSr,i ? Q rSEr,i. A common special case is auniform sparse JL distribution, generated as follows: for every 1 i n, we uniformly choose exactlysof these variables inr,irmto be1. Whens = 1, every sparse JL distribution is a uniform sparse JL distribution, but for s 1, this is not the case. Another common special case is ablock sparse JL distribution. This produces a different construc- tion fors 1. In this distribution, each column1 i nis partitioned intosblocks of ?m s ? consecutive rows. In each block in each column, the distribution of the variablesr,i is defi ned by uniformly choosing exactly one of these variables to be 1.8 1.2Proof Techniques We use the following notation. For any random variableXand valueq 1, we callE|X|qtheqth moment of X, where E denotes the expectation. We use kXkqto denote the q-norm (E|X|q)1/q. For everyx1,.,xn Rnsuch thatkxk2= 1, we need to analyze tail bounds of an error term, which for the sparse JL construction is the following random variable: kAxk2 2 1 = 1 s X i6=j m X r=1 r,ir,jr,ir,jxixj=: R(x1,.,xn). An upper bound on the tail probability ofR(x1,.,xn)is needed to prove the lower bound on v(m,?,s)in Theorem 1.5, and a lower bound is needed to prove the upper bound onv(m,?,s) 8Our lower bound in Theorem 1.5 applies to this distribution, though our upper bound does not. An interesting direction for future work would be to generalize the upper bound to this distribution. 4 in Theorem 1.5. It turns out that it suffi ces to tightly analyze the random variable moments E(R(x1,.,xn)q. For the upper bound, we use Markovs inequality like in 13,19,3,24, and for the lower bound, we use the Paley-Zygmund inequality like in 13: Markovs inequality gives a tail upper bound from upper bounds on moments, and the Paley-Zygmund inequality gives a tail lower bound from upper and lower bounds on moments. Thus, the key ingredient of our analysis is a tight bound for kR(x1,.,xn)kqon Sv= n x Rn| kxk kxk2 v o at each threshold v value. While the moments ofR(x1,.,xn)have been studied in previous analyses of sparse JL, we emphasize that it is not clear how to adapt these existing approaches to obtain a tight bound on every Sv. The moment bound that we require and obtain is far more general: the bounds in 19,9 are limited toRn= S1and the bound in 13 is limited tos = 1.9The non-combinatorial approach in 9 for boundingkR(x1,.,xn)kqonRn= S1 also turns out to not be suffi ciently precise onSv, for reasons we discuss in Section 2.10 Thus, we require new tools for our moment bound. Our analysis provides a new perspective, inspired by the probability theory literature, that differs from the existing approaches in the JL literature. We believe our style of analysis is less brittle than combinatorial approaches 13,19,3,24: in this setting, once the sparsitys = 1case is recovered, it becomes straightforward to generalize to othersvalues. Moreover, our approach can yield greater precision than the existing non-combinatorial approaches 9,8,14, which is necessary for this setting. Thus, we believe that our structural approach to analyzing JL distributions could be of use in other settings. In Section 2, we present an overview of our methods and the key technical lemmas to analyze kR(x1,.,xn)kq. We defer the proofs to the Appendix. In Section 3, we prove the tail bounds in Theorem 1.5 from these moment bounds. In Section 4, we empirically evaluate sparse JL. 2Sketch of Bounding the Moments of R(x1,.,xn) Our approach takes advantage of the structure ofR(x1,.,xn)as a quadratic form of Rademachers (i.e. P t1,t2at1,t2t1t2 ) with random variable coeffi cients (i.e. whereat1,t2is itself a random variable). Fortheupper bound, weneedtoanalyzekR(x1,.,xn)kqforgeneralvectorsx1,.,xn. For the lower bound, we only need to showkR(x1,.,xn)kqis large for single vector in eachSv, and we show we can select the vector in the2-unit ball with1/v2nonzero entries, all equal tov. For ease of notation, we denote this vector by v,.,v,0,.,0 for the remainder of the paper. We analyzekR(x1,.,xn)kqusing general moment bounds for Rademacher linear and quadratic forms. Though Cohen, Jayram, and Nelson 9 also viewR(x1,.,xn)as a quadratic form, we show in the supplementary material that their approach of bounding the Rademachers by gaussians is not suffi ciently precise for our setting.11 In our approach, we make use of stronger moment bounds for Rademacher linear and quadratic forms, some of which are known to the probability theory community through Lataas work in 21,20 and some of which are new adaptions tailored to the constraints arising in our setting. More specifi cally, the Lataas bounds 21,20 target the setting where the coeffi cients are scalars. In our setting, however, the coeffi cients are themselves random variables, and we need bounds that are tractable to analyze in this setting, which involves creating new bounds to handle some cases. Our strategy for bounding kR(x1,.,xn)kq is to break down into rows. We defi ne Zr(

人人文库> 全部分类> 教育资料 > 课件下载

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

NIPS20199656-understanding-sparse-jl-for-feature-hashing

文档简介

温馨提示

最新文档

评论

NIPS20199656-understanding-sparse-jl-for-feature-hashing

文档简介

温馨提示

最新文档

评论

相关文档