iccv2019论文全集8532-on-learning-over-parameterized-neural-networks-a-functional-approximation-perspective_第1页
iccv2019论文全集8532-on-learning-over-parameterized-neural-networks-a-functional-approximation-perspective_第2页
iccv2019论文全集8532-on-learning-over-parameterized-neural-networks-a-functional-approximation-perspective_第3页
iccv2019论文全集8532-on-learning-over-parameterized-neural-networks-a-functional-approximation-perspective_第4页
iccv2019论文全集8532-on-learning-over-parameterized-neural-networks-a-functional-approximation-perspective_第5页
已阅读5页,还剩5页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、On Learning Over-parameterized Neural Networks: A Functional Approximation Perspective Lili Su CSAIL, MIT Pengkun Yang Department of Electrical Engineering Princeton University Abstract We consider training over-parameterized two-layer neural networks with Rectifi

2、 ed Linear Unit (ReLU) using gradient descent (GD) method. Inspired by a recent line of work, we study the evolutions of network prediction errors across GD iterations, which can be neatly described in a matrix form. When the network is suffi ciently over-parameterized, these matrices individually a

3、pproximate an integral operator which is determined by the feature vector distributiononly. Consequently, GD method can be viewed as approximately applying the powers of this integral operator on the underlying function fthat generates the responses. We show that iffadmits a low-rank approximation w

4、ith respect to the eigenspaces of this integral operator, then the empirical risk decreases to this low-rank approxi- mation error at a linear rate which is determined byfandonly, i.e., the rate is independent of the sample sizen. Furthermore, iffhas zero low-rank approx- imation error, then, as lon

5、g as the width of the neural network is(nlogn), the empirical risk decreases to(1/pn) . To the best of our knowledge, this is the fi rst result showing the suffi ciency of nearly-linear network over-parameterization. We provide an application of our general results to the setting whereis the uniform

6、 distribution on the spheres andfis a polynomial. Throughout this paper, we consider the scenario where the input dimension d is fi xed. 1Introduction Neural networks have been successfully applied in many real-world machine learning applications. However, a thorough understanding of the theory behi

7、nd their practical success, even for two-layer neural networks, is still lacking. For example, despite learning optimal neural networks is provably NP-complete BG17,BR89 , in practice, even the neural networks found by the simple fi rst-order methods perform well KSH12. Additionally, in sharp contra

8、st to traditional learning theory, over- parameterized neural networks (more parameters than the size of the training dataset) are observed to enjoy smaller training and even smaller generalization errors ZBH+16. In this paper, we focus on training over-parameterized two-layer neural networks with R

9、ectifi ed Linear Unit (ReLU) using gradient descent (GD) method. Our results can be extended to other activation functions that satisfy some regularity conditions; see GMMM19, Theorem 2 for an example. The techniques derived and insights obtained in this paper might be applied to deep neural network

10、s as well, for which similar matrix representation exists DZPS18. Signifi cant progress has been made in understanding the role of over-parameterization in training neural networks with fi rst-order methods AZLL18,DZPS18,ADH+19,OS19,MMN18,LL18, ZCZG18,DLL+18,AZLS18,CG19; with proper random network i

11、nitialization, (stochastic) GD converges to a (nearly) global minimum provided that the width of the networkmis polynomially large in the size of the training datasetn. However, neural networks seem to interpolate the training 33rd Conference on Neural Information Processing Systems (NeurIPS 2019),

12、Vancouver, Canada. data as soon as the number of parameters exceed the size of the training dataset by a constant factor ZBH+16,OS19 . To the best of our knowledge, a provable justifi cation of why such mild over-parametrization is suffi cient for successful gradient-based training is still lacking.

13、 Moreover, the convergence rates derived in many existing work approach 0 asn ! 1; see Section A in Supplementary Material for details. In many applications the volumes of the datasets are huge the ImageNet dataset DDS+09 has 14 million images. For those applications, a non-diminishing (i.e., consta

14、nt w.r.t.n) convergence rate is more desirable. In this paper, our goal is to characterize a constant (w.r.t.n ) convergence rate while improving the suffi ciency guarantee of network over- parameterization. Throughout this paper, we focus on the setting where the dimension of the feature vector d i

15、s fi xed, leaving the high dimensional region as one future direction. Inspired by a recent line of work DZPS18,ADH+19, we focus on characterizing the evolutions of the neural network prediction errors under GD method. This focus is motivated by the fact that the neural network representation/approx

16、imation of a given function might not be unique KB18, and this focus is also validated by experimental neuroscience MG06, ASCC18. ContributionsIt turns out that the evolution of the network prediction error can be neatly described in a matrix form. When the network is suffi ciently over-parameterize

17、d, the matrices involved individually approximate an integral operator which is determined by the feature vector distribution only. Consequently, GD method can be viewed as approximately applying the powers of this integral operator on the underlying/target functionfthat generates the responses/labe

18、ls. The advantages of taking such a functional approximation perspective are three-fold: We showed in Theorem 2 and Corollary 1 that the existing rate characterizations in the infl uential line of work DZPS18,ADH+19,DLL+18 approach zero (i.e.,! 0) asn ! 1. This is because the spectra of these matric

19、es, asndiverges, concentrate on the spectrum of the integral operator, in which the unique limit of the eigenvalues is zero. We show in Theorem 4 that the training convergence rate is determined by howfcan be decomposed into the eigenspaces of an integral operator. This observation is also validated

20、 by a couple of empirical observations: (1) The spectrum of the MNIST data concentrates on the fi rst a few eigenspaces LBB+98; and (2) the training is slowed down if labels are partially corrupted ZBH+16, ADH+19. We show in Corollary 2 that iff can be decomposed into a fi nite number of eigenspaces

21、 of the integral operator, thenm = (nlogn) is suffi cient for the training error to converge to (1/pn) with a constant convergence rate. To the best of our knowledge, this is the fi rst result showing the suffi ciency of nearly-linear network over-parameterization. NotationsFor anyn,m 2 N, letn := 1

22、, ,nandm := 1, ,m. For anyd 2 N, denote the unit sphere asSd?1:= ?x : x 2 Rd, see AZLL18, Lemma 5.4 and ADH+19, Lemma C.2 for details. The sparsity in sign changes suggests that bothfH+(t) H+(t)andfH?(t) H?(t) are approximately PSD. Theorem 1. For any iteration t ? 0 and any stepsize 0, it is true t

23、hat I ? f H+(t + 1) + H?(t + 1) (y ? b y(t) (y ? b y(t + 1) I ? H+(t + 1) +fH?(t + 1) (y ? b y(t), where the inequalities are entry-wise. Theorem 1 says that when the sign changes are sparse, the dynamics of(y ? b y(t)are governed by a sequence of PSD matrices. Similar observation is made in DZPS18,

24、 ADH+19. 3Main Results We fi rst show (in Section 3.1) that the existing convergence rates that are derived based on minimum eigenvalues approach 0 as the sample sizengrows. Then, towards a non-diminishing convergence rate, we characterize (in Section 3.2) how the target function faffects the conver

25、gence rate. 3.1Convergence rates based on minimum eigenvalues LetH := H+(1) + H?(1). It has been shown in DZPS18 that when the neural networks are suffi ciently over-parameterizedm = (n6), the convergence ofky ? b y(t)kand the associated convergence rates with high probability can be upper bounded a

26、s 3 ky ? b y(t)k (1 ? ?min(H)tky ? b y(0)k = exp ?tlog 1 1 ? ?min(H) kyk,(8) where?min(H)is the smallest eigenvalue ofH. Equality(8)holds because ofb y(0) = 0. In this paper, we refer tolog 1 1?min(H) as convergence rate. The convergence rate here is quite appealing at fi rst glance as it is indepen

27、dent of the target functionf. Essentially(8)says that no matter how the training data is generated, via GD, we can always fi nd an over-parameterized neural network that perfectly fi ts/memorizes all the training data tuples exponentially fast! Though the spectrum of the random matrixHcan be proved

28、to concentrate asngrows, we observe that?min(H)converges to 0 as n diverges, formally shown in Theorem 2. Theorem 2.For any data distribution, there exists a sequence of non-negative real numbers ?1? ?2? . (independent of n) satisfying limi!1?i= 0 such that, with probability 1 ? ?, sup i |?i?e?i| r

29、log(4n2/?) m + r 8log(4/?) n .(9) wheree?1? ?e?nare the spectrum of H. In addition, if m = !(logn), we have ?min(H) P ? ! 0,as n ! 1,(10) where P ? ! denotes convergence in probability. A numerical illustration of the decay of?min(H)innis presented in Fig.1a. Theorem 2 is proved in Appendix D. By Th

30、eorem 2, the convergence rate in (8) approaches zero as n ! 1. Corollary 1. For any = O(1), it is true that log 1 1?min(H) ! 0 as n ! 1. 3 Though a refi ned analysis of that in DZPS18 is given by ADH+19, Theorem 4.1, the analysis crucially relies on the convergence rate in (8). 4 02004006008001000 n

31、 10?5 10?4 10?3 10?2 ?min(H) d=5 d=10 d=15 d=20 (a) The minimum eigenvalues of one realization ofH under different n and d, with network width m = 2n. (b) The spectrum ofKwithd = 10,n = 500concen- trates around that of LK. Figure 1: The spectra of H, K, and LKwhen is the uniform distribution over Sd

32、?1. In Corollary 1, we restrict our attention to = O(1). This is because the general analysis of GD Nes18 adopted by ADH+19,DZPS18 requires that(1 ? ?max(H) 0, and by the spectrum concentration given in Theorem 2, the largest eigenvalue ofHconcentrates on some strictly positive value asndiverges, i.

33、e.,?max(H) = (1). Thus, if = !(1), then(1 ? ?max(H) 0for any suffi ciently large n, violating the condition assumed in ADH+19, DZPS18. Theorem 2 essentially follows from two observations. LetK = EH, where the expectation is taken with respect to the randomness in the network initialization. It is ea

34、sy to see that by standard concentration argument, for a given dataset, the spectrum ofKandHare close with high probability. In addition, the spectrum ofK, asnincreases, concentrates on the spectrum of the following integral operator LKon L2(Sd?1,), (LKf)(x) := Z Sd?1 K(x,s)f(s)d,(11) with the kerne

35、l function: K(x,s) := hx,si 2 ( ? arccoshx,si)8 x,s 2 Sd?1,(12) which is bounded overSd?1 Sd?1. In fact,?1? ?2? in Theorem 2 are the eigenvalues ofLK. Assupx,s2Sd?1K(x,s) 1 2, it is true that ?i 1for alli ? 1 . Notably, by defi nition, Kii0= EHii0 = 1 nK(xi,xi0) is the empirical kernel matrix on the

36、 feature vectors of the given dataset(xi,yi) : 1 i n. A numerical illustration of the spectrum concentration ofKis given in Fig.1b; see, also, XLS17. Though a generalization bound is given in ADH+19, Theorem 5.1 and Corollary 5.2, it is unclear how this bound scales inn. In fact, if we do not care a

37、bout the structure of the target functionfand allow y pnto be arbitrary, this generalization bound might not decrease to zero asn ! 1. A detailed argument and a numerical illustration can be found in Appendix B. 3.2Constant convergence rates Recall thatfdenotes the underlying function that generates

38、 output labels/responses (i.e.,ys) given input features (i.e.,xs). For example,fcould be a constant function or a linear function. Clearly, the diffi culty in learningfvia training neural networks should crucially depend on the properties offitself. We observe that the training convergence rate migh

39、t be determined by howfcan be decomposed into the eigenspaces of the integral operator defi ned in(11). This observation is also validated by a couple of existing empirical observations: (1) The spectrum of the MNIST data LBB+98 concentrates on the fi rst a few eigenspaces; and (2) the training is s

40、lowed down if labels are partially corrupted ZBH+16,ADH+19. Compared with ADH+19, we use spectral projection concentration to show how the random eigenvalues and the random projections in ADH+19, Eq.(8) in Theorem 4.1 are controlled by fand . We fi rst present a suffi cient condition for the converg

41、ence of ky ? b y(t)k. 5 Theorem 3 (Suffi ciency). Let 0 0, if m ? 32 c2 1 1 c0 + 2Tc1 4 + 4log 4n ? 1 c0 + 2Tc1 2! ,(14) then with probability at least 1 ? ?, the following holds for all t T: ? ? ? ? 1 pn(y ? b y(t) ? ? ? ? (1 ? c0)t+ 2c1.(15) Theorem 3 is proved in Appendix E. Theorem 3 says that i

42、f ? ? ? 1 pn(I ? K)ty ? ? ?converges toc1 exponentially fast, then ? ? ? 1 pn(y ? b y(t) ? ? ?converges to2c1with the same convergence rate guarantee provided that the neural network is suffi ciently parametrized. Recall thatyi2 ?1,1for eachi 2 n. Roughly speaking, in our setup,yi= (1)andkyk = pPn i

43、=1y 2 i = (pn). Thus we have the 1 pn scaling in (13) and (14) for normalization purpose. Similar results were shown in DZPS18,ADH+19 with = ?min(K) n ,c0= n?min(K)andc1= 0. But the obtained convergence ratelog 1 1?2 min(K) ! 0asn ! 1. In contrast, as can be seen later (in Corollary 2), ifflies in t

44、he span of a small number of eigenspaces of the integral operator in(11), then we can choose = (1), choosec0to be a value that is determined by the target functionfand the distributiononly, and choosec1= ( 1 pn). Thus, the resulting convergence ratelog 1 1?c0 does not approach 0 asn ! 1. The additiv

45、e termc1= (1/pn)arises from the fact that only fi nitely many data tuples are available. Both the proof of Theorem 3 and the proofs in DZPS18,ADH+19,AZLL18 are based on the observation that when the network is suffi ciently over-parameterized, the sign changes (activation pattern changes) of the hid

46、den neurons are sparse. Different from DZPS18, ADH+19, our proof does not use ?min(K); see Appendix E for details. It remains to show, with high probability,(13)in Theorem 3 holds with properly chosenc0andc1. By the spectral theorem DS63, Theorem 4, Chapter X.3 and RBV10,LKhas a spectrum with distin

47、ct eigenvalues 1 2 4 such that LK= X i?1 iPi,with Pi:= 1 2i Z ?i (?I ? LK)?1d?, wherePi: L2(Sd?1,) ! L2(Sd?1,)is the orthogonal projection operator onto the eigenspace associated with eigenvaluei; here (1)iis the imaginary unit, and (2) the integral can be taken over any closed simple rectifi able c

48、urve (with positive direction)?icontainingionly and no other distinct eigenvalue. In other words,Pifis the function obtained by projecting functionfonto the eigenspaces of the integral operator LKassociated with i. Given an 2 N, letm be the sum of the multiplicities of the fi rstnonzero top eigenval

49、ues ofLK. That is, m1is the multiplicity of 1and (m2? m1) is the multiplicity of 2 . By defi nition, ?m= 6= +1= ?m+1, 8. Theorem 4. For any ? 1 such that i 0, for i , let (f,) :=sup x2Sd?1 ? ? ? ? ? ? f(x) ? ( X 1i Pif)(x) ? ? ? ? ? ? be the approximation error of the span of the eigenspaces associa

50、ted with the fi rstdis- tinct eigenvalues.Then given?2(0, 1 4) andT0, ifn 256log 2 ? (?m?m+1)2 and 4 The sequence of distinct eigenvalues can possibly be of fi nite length. In addition, the sequences ofis and ?is (in Theorem 2) are different, the latter of which consists of repetitions. 6 m ? 32 c2

51、1 1 c0 + 2Tc1 4 + 4log 4n ? 1 c0 + 2Tc1 2 withc0= 3 4? andc1= (f,), then with probability ? (1 ? 3?), for all t T: ? ? ? ? 1 pn(y ? b y(t) ? ? ? ? 1 ? 3 4?m t + 16p2 q log 2 ? (?m? ?m+1)pn + 2p2(f,). Since?mis determined byfandonly, with = 1, the convergence ratelog 1 1?3 4?m is constant w.r.t.n. Re

52、mark 1(Early stopping).In Theorems 3 and 4, the derived lower bounds ofmgrow in T. To controlm, we need to terminate the GD training at some “reasonable”T. Fortunately, Tis typically small. To see this, note that,c0, andc1are independent oft. By(13)and (15)we know ? ? ? 1 pn(y ? b y(t) ? ? ?decrease

53、s to(c1)in(log 1 c1/log 1 1?c0) iterations provided that (log 1 c1/log 1 1?c0) T. Thus, to guarantee ? ? ? 1 pn(y ? b y(t) ? ? ? = O(c1), it is enough to ter- minate GD at iterationT = (log 1 c1/log 1 1?c0). Similar to us, early stopping is adopted in AZLL18, LSO19, and is commonly adopted in practi

54、ce. Corollary 2(zeroapproximation error).Suppose there existssuch thati 0, for i , and (f,) = 0.Then let = 1andT = log n ?log(1?3 4?m). For a given ? 2 (0, 1 4), ifn 256log 2 ? (?m?m+1)2 and m correspondingly, the data feature vector is also augmented. In this section, as we are dealing with distrib

55、ution on the original feature vector, we explicitly separate out the bias fromwj. In particular, letb0 j N(0,1). For ease of exposition, with a little abuse of notation, we usedto denote the dimension of thewj andxbefore the above mentioned augmentation. With bias,(1)can be rewritten asfW,b(x) = 1 p

56、m Pm j=1ajhx,wji + bj+,whereb = (b1, ,bm)are the bias of the hidden neurons, and the kernel function in (12) becomes K(x,s) = hx,si + 1 2 ? arccos 1 2 (hx,si + 1) 8 x,s 2 Sd?1.(16) From Theorem 4 we know the convergence rate is determined by the eigendecomposition of the target functionfw.r.t.the ei

57、genspaces ofLK. Whenis the uniform distribution onSd?1, the eigenspaces of LKare the spaces of homogeneous harmonic polynomials, denoted by Hfor ? 0. Specifi cally,LK= P ?0?P, whereP(for ? 0) is the orthogonal projector ontoHand ?= d?2 2 +d?2 2 0is the associated eigenvalue is the coeffi cient ofK(x

58、,s)in the expansion into 7 246810 degree 10?9 10?7 10?5 10?3 10?1 ? d=5 d=10 d=15 d=20 (a) Plot of?withunder differentd. Here, the?is monotonically decreasing in . (b) Training withfbeing randomly generated linear or quadratic functions with n = 1000, m = 2000. Figure 2: Application to uniform distribution and polynomials. Gegenbauer polynomials. Note thatHandH 0 are

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论