NIPS20199267-high-dimensional-optimization-in-adaptive-random-subspaces_第1页
NIPS20199267-high-dimensional-optimization-in-adaptive-random-subspaces_第2页
NIPS20199267-high-dimensional-optimization-in-adaptive-random-subspaces_第3页
NIPS20199267-high-dimensional-optimization-in-adaptive-random-subspaces_第4页
NIPS20199267-high-dimensional-optimization-in-adaptive-random-subspaces_第5页
已阅读5页,还剩6页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

High-Dimensional Optimization in Adaptive Random Subspaces Jonathan Lacotte Department of Electrical Engineering Stanford University Mert Pilanci Department of Electrical Engineering Stanford University Marco Pavone Department of Aeronautics &Astronautics Stanford University Abstract We propose a new randomized optimization method for high-dimensional problems which can be seen as a generalization of coordinate descent to random subspaces. We show that an adaptive sampling strategy for the random subspace signifi cantly outperforms the oblivious sampling method, which is the common choice in the recent literature. The adaptive subspace can be effi ciently generated by a correlated random matrix ensemble whose statistics mimic the input data. We prove that the improvement in the relative error of the solution can be tightly characterized in terms of the spectrum of the data matrix, and provide probabilistic upper- bounds. We then illustrate the consequences of our theory with data matrices of different spectral decay. Extensive experimental results show that the proposed approach offers signifi cant speed ups in machine learning problems including logisticregression, kernelclassifi cationwithrandomconvolutionlayersandshallow neural networks with rectifi ed linear units. Our analysis is based on convex analysis and Fenchel duality, and establishes connections to sketching and randomized matrix decomposition. 1Introduction Random Fourier features, Nystrom method and sketching techniques have been successful in large scale machine learning problems. The common practice is to employ oblivious sampling or sketching matrices, which are typically randomized and fi xed ahead of the time. However, it is not clear whether one can do better by adapting the sketching matrices to data. In this paper, we show that adaptive sketching matrices can signifi cantly improve the approximation quality. We characterize the approximation error on the optimal solution in terms of the smoothness of the function, and spectral properties of the data matrix. Many machine learning problems end up being high dimensional optimization problems, which typically follow from forming the kernel matrix of a large dataset or mapping the data trough a high dimensional feature map, such as random Fourier features 20 or convolutional neural networks 13. Such high dimensional representations induce higher computational and memory complexities, and result in slower training of the models. Random projections are a classical way of performing dimensionality reduction, and are widely used in many algorithmic contexts 25. Nevertheless, only recently these methods have captured great attention as an effective way of performing dimensionality reduction in convex optimization. In the context of solving a linear systemAx = band least-squares optimization, the authors of 14 propose a randomized iterative method with linear convergence rate, 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. which, at each iteration, performs a proximal updatex(k+1)=argminxTkx x(k)k 2 2, where the next iteratex(k+1) is restricted to lie within an affi ne subspaceT = x(k)+ range(AS), andSis a nmdimension-reduction matrix withm 6 minn,d. In the context of kernel ridge regression, the authors of 31 propose to approximate then-dimensional kernel matrix by sketching its columns to a lowerm-dimensional subspace, chosen uniformly at random. From the low dimensional kernel ridge solution Rm, they show how to reconstruct an approximatione x Rnof the high dimensional solutionx Rn. Provided that the sketching dimensionmis large enough as measured by the spectral properties of the kernel matrixK, the estimatee xretains some statistical properties ofx, e.g., minimaxity. Similarly, in the broader context of classifi cation through convex loss functions, the authors of 32,33 propose to project thed-dimensional features of a given data matrixAto a lowerm-dimensional subspace, chosen independently of the data. After computing the optimal low-dimensional classifi er Rm, their algorithm returns an estimatee x Rdof the optimal classifi erx Rd. Even though they provide formal guarantees on the estimation errorke x xk2, their results rely on several restrictive assumptions, that is, the data matrixAmust be low rank, or, the classifi erxmust lie in the span of the top few left singular vectors ofA. Further, random subspace optimization has also been explored for large-scale trust region problems 27, also using a subspace chosen uniformly at random. Our proposed approach draws connections with the Gaussian Kaczmarz method proposed in 14 and the kernel sketching method in 31. Differently, we are interested in smooth, convex optimization problems with ridge regularization. In contrast to 32,33, we do not make any assumption on the optimal solution x. Our work relates to the considerable amount of literature on randomized approximations of high dimensional kernel matricesK. The typical approach consists of building a low-rank factorization of the matrixK, using a random subset of its columns 28,23,9,17. The so-called Nystrom method has proven to be effective empirically 30, and many research efforts have been devoted to improving and analyzing the performance of its many variants (e.g., uniform column sub-sampling, leverage- score based sampling), especially in the context of kernel ridge regression 1,2. In a related vein, sketching methods have been proposed to reduce the space complexity of storing high-dimensional data matrices 19,31, by projecting their rows to a randomly chosen lower dimensional subspace. Our theoretical fi ndings build on known results for low-rank factorization of positive semi-defi nite (p.s.d.) matrices 15,5,12,29, and show intimate connections with kernel matrices sketching 31. Lastly, our problem setting also draws connections with compressed sensing 7 where the goal is to recover a high dimensional structured signal from a small number of randomized, and usually oblivious, measurements. 1.1Contributions In this work, we propose a novel randomized subspace optimization method with strong solution approximation guarantees which outperform oblivious sampling methods. We derive probabilistic bounds on the error of approximation for general convex functions. We show that our method provides a signifi cant improvement over the oblivious version, and theoretically quantify this observation as function of the spectral properties of the data matrix. We also introduce an iterative version of our method, which converges to the optimal solution by iterative refi nement. 1.2An overview of our results Letf : Rn Rbe a convex and-strongly smooth function, i.e.,2f(w) ? Infor allw Rn, and A Rnda high-dimensional matrix. We are interested in solving the primal problem x= argmin xRd f(Ax) + 2 kxk2 2, (1) Given a random matrix S Rdmwith m ? d, we consider instead the sketched primal problem argmin Rm f(AS) + 2 SS,(2) where we effectively restrict the optimization domain to a lowerm-dimensional subspace. In this work, we explore the following questions: How can we estimate the original solutionxgiven the sketched solution? Is a uniformly random subspace the optimal choice, e.g.,S Gaussian i.i.d.? 2 Or, can we come up with an adaptive sampling distribution that is related to the matrixA, which yields stronger guarantees? By Fenchel duality analysis, we exhibit a natural candidate for an approximate solution tox, given bye x = 1Af(AS). Our main result (Section 2) establishes that, for an adaptive sketching matrix of the formS = AeSwhere e S is typically Gaussian i.i.d., the relative error satisfi es a high- probability guarantee of the formke x xk2/kxk26 , with eSprovides much stronger guarantees than oblivious sketching, where Sis independent ofA. Then, we take advantage of the error contraction (i.e., z f(w)?, which is convex and its domaindomf:= z Rn| f(z) zk2 2. There exist an unique primal solutionxand an unique dual solutionz. Further, we haveAx f(z), z= f(Ax) and x= 1 A z. Proposition 2(Fenchel Duality on Sketched Program).Strong duality holds for the sketched program min f(AS) + 2 kSk2 2 = max y f(y) 1 2kPSA yk2 2, wherePS= S(SS)Sis the orthogonal projector onto the range ofS. There exist a sketched primal solutionand an unique sketched dual solutiony. Further, for any solution, it holds that AS f(y) and y= f(AS). We defi ne the following deterministic functionalZfwhich depends onf, the data matrixAand the sketching matrix S, and plays an important role in controlling the approximation error, Zf Zf(A,S) =sup (domfz) AP SA kk2 2 !1 2 ,(3) whereP S = I PSis the orthogonal projector ontorange(S).The relationshipx= 1Af(Ax)suggests the pointe x = 1Af(AS)as a candidate for approximat- ingx. The Fenchel dual programs of(1)and(2)only differ in their quadratic regularization term, kAzk2 2 andkPSAyk2 2, which difference is tied to the quantity kP SA (z y)k 2. As it holds thatke x xk2= 1kA(z y)k2, we show that the errorke x xk2can be controlled in terms of the spectral normkP SA k 2, or more sharply, in terms of the quantity Zf, which satis- fi esZf6 kP SA k 2. We formalize this statement in our next result, which proof is deferred to Appendix B.1. 3 Theorem 1(Deterministic bound).Letbe any minimizer of the sketched program(2). Then, under the condition 2Z2 f, we have ke x xk26 r 2Zfkx k 2, (4) which further implies ke x xk26 r 2kP SA k 2kx k 2. (5) For an adaptive sketching matrixS = AeS, we rewritekP SA k2 2 = kK KeS(eSKeS)eSKk2, whereK = AAis p.s.d. Combining our deterministic bound(5)with known results 15,12,5 for randomized low-rank matrix factorization in the formKeS(eSKeS)eSKof p.s.d. matrices K, we can give guarantees with high probability (w.h.p.) on the relative error for various types of matrices e S. For conciseness, we specialize our next result to adaptive Gaussian sketching, i.e., e SGaussian i.i.d. Given a target rankk 2, we introduce a measure of the spectral tail ofAas Rk(A) = ? 2 k+ 1 k P j=k+1 2 j ?1 2 , whereis the rank of the matrixAand1 2 . its singular values. The proof of the next result follows from a combination of Theorem 1 and Corollary 10.9 in 15, and is deferred to Appendix B.2. Corollary 1(High-probability bound).Givenk 6 min(n,d)/2and a sketching dimensionm = 2k, let S = AeS, with e S RnmGaussian i.i.d. Then, for some universal constant c06 36, provided 2c2 0R2k(A), it holds with probability at least 1 12ekthat ke x xk26 c0 r 2Rk(A)kx k 2. (6) Remark 1.The quantityZf:= sup(domfz) ? AP SA kk2 2 ?1 2 is the eigenvalue of the matrix P SA , restricted to the spherical cap K := (domf z) Sn1, whereSn1is the unit sphere in dimensionn. Thus, depending on the geometry ofK, the deterministic bound(4)might be much tighter than(5), and yield a probabilistic bound better than(6). The investigation of such a result is left for future work. 2.1Theoretical predictions as a function of spectral decay We study the theoretical predictions given by(5)on the relative error, for different spectral decays ofAand sketching methods, in particular, adaptive Gaussian sketching versus oblivious Gaussian sketching and leverage score column sub-sampling 12. We denotek= 2 kthe eigenvalues ofAA . For conciseness, we absorbinto the eigenvalues by settingk kand 1. This re-scaling leaves the right-hand side of the bound(5)unchanged, and does not affect the analysis below. Then, we assume that1= O(1), (,1), and 0asn +. These assumptions are standard in empirical risk minimization and kernel regression methods 11, which we focus on in Sections 4 and 5. We consider three decaying schemes of practical interest. The matrixA has either a fi nite-rank , a-exponential decay wherej ejand 0, or, a-polynomial decay wherek j2 and 1/2. Among other examples, these decays are characteristic of various standard kernel functions, such as the polynomial, Gaussian and fi rst-order Sobolev kernels 3. Given a precision 0 and a confi dence level (0,1), we denote bymA(resp.mO,mS ) a suffi cient dimension for which adaptive (resp. oblivious, leverage score) sketching yields the following(,)-guarantee on the relative error. That is, with probability at least 1 , it holds that ke x xk2/kxk26 . We determinemAfrom our probabilistic regret bound(6). FormS, using our deterministic regret bound(5) , it then suffi ces to bound the spectral normkP AeSA k 2 in terms of the eigenvaluesk, when e Sis a leverage score column sub-sampling matrix. To the best of our knowledge, the tightest bound has been given by 12 (see Lemma5). FormO, we leverage results from 32. The authors provide an upper bound on the relative errorke xxk2/kxk2, whenSis Gaussian i.i.d. with variance 1 d. It should be noted that their sketched solution is slightly different from ours. They solve = argminf(AS) + (2)1kk2 2, whereas we do include the matrixSin the regularization 4 term. One might wonder which regularizer works best whenSis Gaussian i.i.d. Through extensive numerical simulations, we observed a strongly similar performance. Further, standard Gaussian concentration results yields that kSk2 2 kk2 2. Our theoretical fi ndings are summarized in Table 1, and we give the mathematical details of our derivations in Appendix D. For the sake of clarity, we provide in Table 1 lower bounds on the predicted valuesmOandmS, and, thus, lower bounds on the ratiosmO/mAandmS/mA. Overall, adaptive Gaussian sketching provides stronger guarantees on the relative error ke x xk2/kxk2. Table 1: Sketching dimensions for a (,)-guarantee on the relative error ke x xk2/kxk2. -rank matrix-exponential decay-polynomial decay ( ? n d)( 0)( 1/2) Adaptive Gaussian (mA) + 1 + log ? 12 ? 1log ? 1 ? + log ? 12 ? 1/21/+ log ? 12 ? Oblivious Gaussian (mO)( + 1)2log ? 2 ? 12log ? 1 ? log ? 2d ? 1 22log ? 2d ? Leverage score (mS)( + 1)log ? 4 ? 1log ? 1 ? log ? 1 ? ? 1 2 1 ?2 1 log ? 1 ? Lower bound on mO mA 2log 2+hlog 2d,h 01/2log(2d/) Lower bound on mS mA log min ? log ? 1 ? ,1log ? 1 ? ? 1 2 1 ?1+2 1 We illustrate numerically our predictions for adaptive Gaussian sketching versus oblivious Gaussian sketching. Withn = 1000andd = 2000, we generate matricesAexpandApoly, with spectral decay satisfying respectivelyj ne0.1jandj nj2. First, we perform binary logistic regression, withf(Ax) = n1 Pn i=1yi(a i x)whereyi(z) = yilog(1+ez)+(1yi)log(1+ez), y 0,1nandaiis thei-th row ofA. For the polynomial (resp. exponential) decay, we expect the relative errorke xxk2/kxk2to follow w.h.p. a decay proportional tom1(resp.e0.05m). Figure 1 confi rms those predictions. We repeat the same experiments with a second loss function,f(Ax) = (2n)1 Pn i=1(a i x)2 +2(ai x)yi. The latter is a convex relaxation of the penalty 1 2k(Ax)+yk 2 2for fi tting a shallow neural network with a ReLU non-linearity. Again, Figure 1 confi rms our predictions, and we observe that the adaptive method performs much better than the oblivious sketch. 2528 (a) ReLU polynomial 102 100 05001000 (b) ReLU exponential 104 102 100 2528 (c) Logistic polynomial 104 102 100 05001000 (d) Logistic exponential 105 103 101 Figure 1: Relative error versus sketching dimensionm 2k| 3 6 k 6 10of adaptive Gaussian sketching (red) and oblivious Gaussian sketching (green), for the ReLU and logistic models, and the exponential and polynomial decays. We use = 104for all simulations. Results are averaged over 10 trials. Bar plots show (twice) the empirical standard deviations. 3Algorithms for adaptive subspace sketching 3.1Numerical conditioning and generic algorithm A standard quantity to characterize the capacity of a convex program to be solved effi ciently is its condition number 6, which, for the primal (1) and (adaptive) sketched program (2), are given by = + supx1 ?A2f(Ax)A? + infxd(A2f(Ax)A) ,S= sup1 ? e SA(I + A2f(AAeS)A)AeS ? infd ? e SA(I + A2f(AAeS)A)AeS ? . The latter can be signifi cantly larger than, up toS 1(eSAAeS) m(eSAAeS) ? . A simple change of variable overcomes this issue. WithAS,= AS(SS) 1 2, we solve instead the optimization problem 5 = argmin Rm f(AS,) + 2 kk2 2. (7) It holds thate x = 1Af(AS, ). The additional complexity induced by this change of variables comes from computing the (square-root) pseudo-inverse ofSS, which requiresO(m3) fl ops via a singular value decomposition. Whenmis small, this additional computation is negligible and numerically stable, and the re-scaled sketched program(7)is actually better conditioned that the original primal program (1), as stated in the next result that we prove in Appendix C.3. Proposition 3.Under adaptive sketching, the condition numberof the re-scaled sketched pro- gram (7) satisfi es 6 with probability 1. Algorithm 1: Generic algorithm for adaptive sketching. Input:Data matrix A Rnd, random matrix e S Rnmand parameter 0. 1Compute the sketching matrix S = AeS, and, the sketched matrix AS= AS. 2Compute the re-scaling matrix R = ?SS?1 2, and the re-scaled sketched matrix AS, = ASR. 3Solve the convex optimization problem (7), and return e x = 1 A f ? AS, ? . We observed a drastic practical performance improvement between solving the sketched program as formulated in (2) and its well-conditioned version (7). IfthechosensketchdimensionmisitselfprohibitivelylargeforcomputingthematrixQ = (SS) 1 2, one

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论