nips20199250-splitting-steepest-descent-for-growing-neural-architectures_W

上传人：我*** IP属地：北京上传时间：2020-09-15 格式：DOCX 页数：11 大小：302.46KB 积分：9.6 举报 版权申诉

nips20199250-splitting-steepest-descent-for-growing-neural-architectures_W_第2页

nips20199250-splitting-steepest-descent-for-growing-neural-architectures_W_第3页

nips20199250-splitting-steepest-descent-for-growing-neural-architectures_W_第4页

nips20199250-splitting-steepest-descent-for-growing-neural-architectures_W_第5页

已阅读5页，还剩6页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

1、Splitting Steepest Descent for Growing Neural ArchitecturesQiang Liu UT ALemeng Wu *UT ADilin Wang UT AAbstractWe develop a progressive training approach for neural networks which adaptively grows the network structure by splitting

2、 existing neurons to multiple off-springs. By leveraging a functional steepest descent idea, we derive a simple criterion for deciding the best subset of neurons to split and a splitting gradient for optimally updating the off-springs. Theoretically, our splitting strategy is a second-order function

3、al steepest descent for escaping saddle points in a1n -Wasserstein metric space, on which the standard parametric gradient descent is a first-order steepest descent. Our method provides a new practical approach for optimizing neural network structures, especially for learning lightweight neural arch

4、itectures in resource-constrained settings.1 IntroductionDeep neural networks (DNNs) have achieved remarkable empirical successes recently. However, efficient and automatic optimization of model architectures remains to be a key challenge. Compared with parameter optimization which has been well add

5、ressed by gradient-based methods (a.k.a. back-propagation), optimizing model structures involves significantly more challenging discrete optimization with large search spaces and high evaluation cost. Although there have been rapid progresses recently, designing the best architectures still requires

6、 a lot of expert knowledge and trial-and-errors for most practical tasks.This work targets extending the power of gradient descent to the domain of model structure optimiza- tion of neural networks. In particular, we consider the problem of progressively growing a neural network by “splitting” exist

7、ing neurons into several “off-springs”, and develop a simple and practical approach for deciding the best subset of neurons to split and how to split them, adaptively based on the existing model structure. We derive the optimal splitting strategies by considering the steepest descent of the loss whe

8、n the off-springs are infinitesimally close to the original neurons, yielding a splitting steepest descent that monotonically decrease the loss in the space of model structures.Our main method, shown in Algorithm 1, alternates between a standard parametric descent phase which updates the parameters

9、to minimize the loss with a fixed model structure, and a splitting phase which updates the model structures by splitting neurons. The splitting phase is triggered when no further improvement can be made by only updating parameters, and allow us to escape the parametric local optima by augmenting the

10、 neural network in a locally optimal fashion. Theoretically, these two phases can be viewed as performing functional steepest descent on an1-Wasserstein metric space, in which the splitting phase is a second-order descent for escaping saddle points in the functional space, while the parametric gradi

11、ent descent corresponds to a first-order descent. Empirically, our algorithm is simple and practical, and provides a promising tool for many challenging problems, including progressive training of interpretable neural networks, learning lightweight and energy-efficient neural architectures for resou

12、rce-constrained settings, and transfer learning, etc.Equal contribution33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.Related Works The idea of progressively growing neural networks by node splitting is not new, but previous works are mostly based on heuri

13、stic or purely random splitting strategies (e.g., Wynne- Jones, 1992; Chen et al., 2016). A different approach for progressive training is the Frank-Wolfe or gradient boosting based strategies (e.g., Schwenk & Bengio, 2000; Bengio et al., 2006; Bach, 2017), which iteratively add new neurons derived

14、from functional conditional gradient, while keeping the previous neurons fixed. However, these methods are not suitable for large scale settings, because adding each neuron requires to solve a difficult non-convex optimization problem, and keeping the previous neurons fixed prevents us from correcti

15、ng the mistakes made in earlier iterations. A practical alternative of Frank-Wolfe is to simply add new randomly initialized neurons and co-optimize the new and old neurons together. However, random initialization does not allow us to leverage the information of the existing model and takes more tim

16、e to converge. In contrast, splitting neurons from the existing network allows us to inherent the knowledge from the existing model (see Chen et al. (2016), and is faster to converge in settings like continual learning, when the previous model is not far away from the optimal solution.An opposite di

17、rection of progressive training is to prune large pre-trained networks (e.g., Han et al., 2016; Li et al., 2017; Liu et al., 2017). In comparison, our splitting method requires no large pre- trained models and can outperform existing pruning methods in terms of learning ultra-small neural architectu

18、res, which is of critical importance for resource-constrained settings like mobile devices and Internet of things. More broadly, there has been a series of recent works on neural architecture search, based on various strategies from combinatorial optimization, including reinforcement learning (RL) (

19、e.g., Pham et al., 2018; Cai et al., 2018; Zoph & Le, 2017), evolutionary algorithms (EA) (e.g., Stanley & Miikkulainen, 2002; Real et al., 2018), and continuous relaxation (e.g., Liu et al., 2019a; Xie et al., 2018). However, these general-purpose black-box optimization methods do not leverage the

20、inherent geometric structure of the loss landscape, and are highly computationally expensive due to the need of evaluating the candidate architectures based on inner training loops.Background: Steepest Descent and Saddle Points Stochastic gradient descent is the driving horse for solving large scale

21、 optimization in machine learning and deep learning. Gradient descent can be viewed as a steepest descent procedure that iteratively improves the solution by following the direction that maximally decreases the loss function within a small neighborhood of the previous solution. Specifically, for min

22、imizing a loss function L(), each iteration of steepest descent updatesthe parameter via + , where is a small step size and is an update direction chosen tomaximally decrease the loss L( + ) of the updated parameter under a norm constraint k k1,where kkdenotes the Euclidean norm. When r L() = 0 and

23、is infinitesimal, the optimal descentdirectionequals the negative gradient direction, that is,= rL()/kr L()k , yielding a descentof L( + ) L() k rL(k) . At a critical point with a zero gradient (r L() = 0), thesteepest descent direction depends on the spectrum of the Hessian matrixr 2L(). Denote by

24、min the minimum eigenvalue of r 2L() and vmin its associated eigenvector. When min 0, the point is a stable local minimum and no further improvement can be made in the infinitesimal neighborhood. When min 0. This yieldsi=1 iian augmented loss function on and w:m!#L(, w) := ExDi=1wi (i, x).(2)i=1A ke

25、y property of this construction is that it introduces a smooth change on the loss function when the off-springs im are close to the original parameter : when i = , 8i = 1, . . , m, theaugmented network and loss are equivalent to the original ones, that is, L (1m, w) = L(), where1m denotes the m 1 ve

26、ctor consisting of all ones; when all the i are within an infinitesimalneighborhood of , it yields an infinitesimal change on the loss, with which a steepest descent can be derive.Formally, consider the set of splitting schemes (m, , w) whose off-springs are -close to the original neuron:(m, , w):m

27、2 N+,ki k,Xmi=1wi = 1, wi 0, 8i = 1,. . , m.We want to decide the optimal (m, , w) to maximize the decrease of loss L (, w) L(), when the step size is infinitesimal. Although this appears to be an infinite dimensional optimization because m is allowed to be arbitrarily large, we show that the optima

28、l choice is achieved with either m = 1(no splitting) or m = 2 (splitting into two off-springs), with uniform weights wi = 1/m. Whether aneuron should be split (m = 1 or 2) and the optimal values of the off-springs i are decided bythe minimum eigenvalue and eigenvector of a splitting matrix, which pl

29、ays a role similar to Hessian matrix for deciding saddle points.Definition 2.1 (Splitting Matrix). For L() in (1), its splitting matrix S() is defined asS() = ExD 0( (, x)r2 (, x).(3)We call the minimum eigenvalue min(S() of S() the splitting index of , and the eigenvectorvmin(S() related to min(S()

30、 the splitting gradient of .The splitting matrix S() is a Rdd symmetric “semi-Hessian” matrix that involves the first derivative 0(), and the second derivative of (, x). It is useful to compare it with the typical gradient and Hessian matrix of L():rL() = ExD 0( (, x)r (, x), r2 L() = S() + E| 00( (

31、, x)rz (, x)2, T () where v2 := vv is the outer product. The splitting matrix S() differs from the gradient rL() in replacing r (, x) with the second-order derivative r2(, x), and differs from the Hessianmatrix r2 L() in missing an extra term T (). We should point out that S() is the “easier part”2o

32、f the Hessian matrix, because the second-order derivativer(, x) of the individual neuronis much simpler than the second-order derivative 00( ) of “everything else”, which appears in the extra term T (). In addition, as we show in Section 2.2, S() is block diagonal in terms of multiple neurons, which

33、 is crucial for enabling practical computational algorithm.It is useful to decompose each i into i = + ( + i), where is an average dPisplacement vector shared bPy all copies, and i is the splitting vector associated with i, and satisfies i wi i = 0 (whichimpliesi wii = + ). It turns out that the cha

34、nge of loss L(, w) L() naturally decomposes into two terms that reflect the effects of the average displacement and splitting, respectively.iTheorem 2.2. Assume = + ( +) with P w= 0 andP w = 1. For L() and L(, w)iii iiiin (1) and (2), assume L(, w) has bounded third order derivativesmw.r.t. . We hav

35、eL(, w) L() =rL() +2 r2L()+22 X2wi S() i+ O(3),(4)|z|zi ( w; )I(; ) = L( + )L() + O(3)i=1II,where the change of loss is decomposed into two terms: the first term I(; ) is the effect of the average displacement , and it is equivalent to applying the standard parametric update + on L(). The second ter

36、m II ( , w; ) is the change of the loss caused by the splitting vectors := i. It depends on L() only through the splitting matrix S().Therefore, the optimal average displacement should be decided by standard parametric steepest (gradient) descent, which yields a typical O() decrease of loss at non-s

37、tationary points. In compari- son, the splitting term II( , w; ) is alwaysO (2), which is much smaller. Given that introducing new neurons increases model size, splitting should not be preferred unless it is impossible to achievean O (2) gain with pure parametric updates that do not increase the mod

38、el size. Therefore, it is motivated to introduce splitting only at stable local minima, when the optimal equals zero and no further improvement is possible with (infinitesimal) regular parametric descent on L(). In this case, we only need to minimize the splitting term II( , w; ) to decide the optim

39、al splitting strategy, which is shown in the following theorem.Theorem 2.3. a) If the splitting matrix is positive definite, that is, min(S() 0, we have II( , w; ) 0 for any w 0 and = 0, and hence no infinitesimal splitting can decrease the loss. We call that is splitting stable in this case.b) If m

40、in(S() 0, an optimal splitting strategy that minimizes II( , w; ) subject to k ik 1 ism = 2,w1 = w2 = 1/2,and1 = vmin(S(),2 = vmin(S(),where vmin(S(), called the splitting gradient, is the eigenvector related to min(S(). Here we split the neuron into two copies of equal weights, and update each copy

41、 with the splitting gradient. The change of loss obtained in this case is II( 1, 1, 1/2, 1/2; ) = 2 min(S()/2 0.Remark The splitting stability (S() 0) does not necessarily ensure the standard parametric stability of L() (i.e., r2L() = S() + T () 0), except when ( ) is convex which ensures T () 0 (se

42、e Definition 2.1). If both S() 0 andr 2L() 0 hold, the loss can not be improved by any local update or splitting, no matter how many off-springs are allowed. Since stochasticgradient descent guarantees to escape unstable stationary points (Lee et al., 2016; Jin et al., 2017), we only need to calcula

43、te S() to decide the splitting stability in practice.2.2 Splitting Deep Neural NetworksIn practice, we need to split multiple neurons simultaneously, which may be of different types, or locate in different layers of a deep neural network. The key questions are if the optimal splitting strategies of

44、different neurons influence each other in some way, and how to compare the gain of splitting different neurons and select the best subset of neurons to split under a budget constraint.It turns out the answers are simple. We show that the change of loss caused by splitting a set of neurons is simply

45、the sum of the splitting terms II( , w; ) of the individual neurons. Therefore, weAlgorithm 1 Splitting Steepest Descent for Optimizing Neural ArchitecturesInitialize a neural network with a set of neurons 1:n = nthat can be split, whose losssatisfies (5). Decide a maximum number m=1 of neurons to s

46、plit at each iteration, and a threshold0 of the splitting index. A stepsize .1. Update the parameters using standard optimizers (e.g., stochastic gradient descent) until no further improvement can be made by only updating parameters.2. Calculate the splitting matrices S of the neurons following (7),

47、 as well as their minimumeigenvalues and the associated eigenvectors v .minmin3. Se lect the set of neur ons to split by picking the top m neurons with the smallest eigenvalues min and satisfies min.4. Split each of the selected neurons into two off-springs with equal weights, and update the neuron

48、network by replacing each selected neuron (, ) with(121(, ) +(, ),where + v, v .212minminUpdate the list of neurons. Go back to Step 1 or stop when a stopping criterion is met.can calculate the splitting matrix of each neuron independently without considering the other neurons, and compare the “spli

49、tting desirability” of the different neurons by their minimum eigenvalues (splitting indexes). This motivates our main algorithm (Algorithm 1), in which we progressively split the neurons with the most negative splitting indexes following their own splitting gradients. Since the neurons can be in different layers and of different t

人人文库> 全部分类> 应用文书

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

nips20199250-splitting-steepest-descent-for-growing-neural-architectures_W

文档简介

温馨提示

最新文档

评论

nips20199250-splitting-steepest-descent-for-growing-neural-architectures_W

文档简介

温馨提示

最新文档

评论

相关文档