iccv2019论文全集9640-surfing-iterative-optimization-over-incrementally-trained-deep-networks_第1页
iccv2019论文全集9640-surfing-iterative-optimization-over-incrementally-trained-deep-networks_第2页
iccv2019论文全集9640-surfing-iterative-optimization-over-incrementally-trained-deep-networks_第3页
iccv2019论文全集9640-surfing-iterative-optimization-over-incrementally-trained-deep-networks_第4页
iccv2019论文全集9640-surfing-iterative-optimization-over-incrementally-trained-deep-networks_第5页
已阅读5页,还剩5页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Surfi ng: Iterative Optimization Over Incrementally Trained Deep Networks Ganlin Song Department of Statistics and Data Science Yale University Zhou Fan Department of Statistics and Data Science Yale University John Lafferty Department of Statistics and Data Scie

2、nce Yale University Abstract We investigate a sequential optimization procedure to minimize the empirical risk functionalfb (x) = 1 2kGb(x) yk 2 for certain families of deep networks G(x). The approach is to optimize a sequence of objective functions that use network parameters

3、 obtained during different stages of the training process. When initialized with random parameters0, we show that the objectivef0(x)is “nice” and easy to optimize with gradient descent. As learning is carried out, we obtain a sequence of generative networksx 7 Gt(x)and associated risk functionsft(x)

4、, wheretindicates a stage of stochastic gradient descent during training. Since the parameters of the network do not change by very much in each step, the surface evolves slowly and can be incrementally optimized. The algorithm is formalized and analyzed for a family of expansive networks. We call t

5、he procedure surfi ng since it rides along the peak of the evolving (negative) empirical risk function, starting from a smooth surface at the beginning of learning and ending with a wavy nonconvex surface after learning is complete. Experiments show how surfi ng can be used to fi nd the global optim

6、um and for compressed sensing even when direct gradient descent on the fi nal learned network fails. 1Introduction Intensive recent research has provided insight into the performance and mathematical properties of deep neural networks, improving understanding of their strong empirical performance on

7、 different types of data. Some of this work has investigated gradient descent algorithms that optimize the weights of deep networks during learning (Du et al., 2018b,a; Davis et al., 2018; Li and Yuan, 2017; Li and Liang, 2018). In this paper we focus on optimization over the inputs to an already tr

8、ained deep network in order to best approximate a target data point. Specifi cally, we consider the least squares objective function fb (x) = 1 2kGb (x) yk 2 whereG(x)denotes a multi-layer feed-forward network andbdenotes the parameters of the network after training. The network is considered to be

9、a mapping from a latent inputx Rkto an output G(x) Rnwithk n. A closely related objective is to minimizef,A(x) = 1 2kAG(x) Ayk 2 where A is a random matrix. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. initial networkpartially trained networkfully train

10、ed networktarget y x1 4 0 4 x2 4 0 4 x1 4 0 4 x2 4 0 4 x1 4 0 4 x2 4 0 4 x1 4 0 4 x2 4 0 4 x1 4 0 4 x2 4 0 4 x1 4 0 4 x2 4 0 4 Figure 1: Behavior of the surfacesx 7 1 2kGt(x) yk 2 for two targetsyshown for three levels of training,from random networks (left) to fully trained networks (right) on Fash

11、ion MNIST data. The network structure has two fully connected layers and two transposed convolution layers with batch normalization, trained as a VAE. Hand and Voroninski (2019) study the behavior of the functionf0,Ain a compressed sensing frame- work wherey = G0(x0)is generated from a random networ

12、k with parameters0= (W1,.,Wd) drawn from Gaussian matrix ensembles; thus, the network is not trained. In this setting, it is shown that the surface is very well behaved. In particular, outside of small neighborhoods around x0and a scalar multiple of x0, the function f0,A(x) always has a descent dire

13、ction. When the parameters of the network are trained, the landscape of the functionfb (x)can be compli- cated; it will in general be nonconvex with multiple local optima. Figure 1 illustrates the behavior of the surfaces as they evolve from random networks (left) to fully trained networks (right) f

14、or 4-layer networks trained on Fashion MNIST using a variational autoencoder. For each of two target valuesy, three surfaces x 7 1 2kGt(x) yk 2 are shown for different levels of training. This paper explores the following simple idea. We incrementally optimize a sequence of objective functionsf0,f1,

15、.,fTwhere the parameters0,1,.,T= b are obtained using stochastic gradient descent induring training. When initialized with random parameters0, we show that the empirical risk functionf0(x) = 1 2kG0(x) yk 2 is “nice” and easy to optimize with gradient descent. As learning is carried out, we obtain a

16、sequence of generative networksx 7 Gt(x)and associated risk functionsft(x), wheretindicates an intermediate stage of stochastic gradient descent during training. Since the parameters of the network do not change by very much in each step (Du et al., 2018a,b), the surface evolves slowly. We initializ

17、exfor the current networkGt(x)at the optimumx t1found for the previous networkGt1(x)and then carry out gradient descent to obtain the updated point x t = argminxft(x). We call this process surfi ng since it rides along the peaks of the evolving (negative) empirical risk function, starting from a smo

18、oth surface at the beginning of learning and ending with a wavy nonconvex surface after learning is complete. We formalize this algorithm in a manner that makes it amenable to analysis. First, when0is initialized so that the weights are random Gaussian matrices, we prove a theorem showing that the s

19、urface has a descent direction at each point outside of a small neighborhood. The analysis of Hand and Voroninski (2019) does not directly apply in our case since the targetyis an arbitrary test point, and not necessarily generated according to the random network. We then give an analysis that descr

20、ibes how projected gradient descent can be used to proceed from the optimum of one network to the next. Our approach is based on the fact that the ReLU network and squared error objective result in a piecewise quadratic surface. Experiments are run to show how surfi ng can be used to fi nd the globa

21、l optimum and for compressed sensing even when direct gradient descent fails, using several experimental setups with networks trained with both VAE and GAN techniques. 2 2Background and Previous Results In this work we treat the problem of approximating an observed vectoryin terms of the outputGb (x

22、) of a trained generative model. Traditional generative processes such as graphical models are statistical models that defi ne a distribution over a sample space. When deep networks are viewed as generative models, the distribution is typically singular, being a deterministic mapping of a low-dimens

23、ional latent random vector to a high-dimensional output space. Certain forms of “reversible deep networks” allow for the computation of densities and inversion (Dinh et al., 2017; Kingma and Dhariwal, 2018; Chen et al., 2018). The variational autoencoder (VAE) approach training a generative (decoder

24、) network is to model the conditional probability ofxgivenyas Gaussian with mean(y)and covariance(y)assuming that a priorix N(0,Ik)is Gaussian. The mean and covariance are treated as the output of a secondary (encoder) neural network. The two networks are trained by maximizing the evidence lower bou

25、nd (ELBO) with coupled gradient descent algorithmsone for the encoder network, the other for the decoder networkG(x) (Kingma and Welling, 2014). Whether fi tting the networks using a variational or GAN approach (Goodfellow et al., 2014; Arjovsky et al., 2017), the problem of “inverting” the network

26、to obtain x= argminf(x) is not addressed by the training procedure. In the now classical compressed sensing framework (Candes et al., 2006; Donoho et al., 2006), the problem is to reconstruct a sparse signal after observing multiple linear measurements, possibly with added noise. More recent work ha

27、s begun to investigate generative deep networks as a replacement for sparsity in compressed sensing. Bora et al. (2017) consider identifyingy = G(x0)from linear measurementsAyby optimizingf(x) = 1 2kAy AG(x)k 2. Since this objective is nonconvex, it is not guaranteed that gradient descent will conve

28、rge to the true global minimum. However, for certain classes of ReLU networks it is shown that so long as a pointb xis found for which f(b x) is suffi ciently close to zero, thenky G(b x)kis also small. For the case whereydoes not lie in the image ofG, an oracle type bound is shown implying that the

29、 solutionb x satisfi es kG(b x) yk2 C infxkG(x) yk2+ for some small error term. The authors observe that in experiments the error seems to converge to zero when b x is computed using simple gradient descent; but an analysis of this phenomenon is not provided. Hand and Voroninski (2019) establish the

30、 important result that for ad-layer random network and random measurement matrixA, the least squares objective has favorable geometry, meaning that outside two small neighborhoods there are no fi rst order stationary points, neither local minima nor saddle points. We describe their setup and result

31、in some detail, since it provides a springboard for the surfi ng algorithm. LetG : Rk Rnbe ad-layer fully connected feedforward generative neural network, which has the form G(x) = (Wd.(W2(W1x).) whereis the ReLU activation function. The matrixWi Rnini1is the set of weights for theith layer andniis

32、number of the neurons in this layer withk = n0 n1 0)W1and Wi,+,x= diag(WiWi1,+,x.W1,+,xx 0)Wi. With this notation, we can rewrite the generative network G0in what looks like a linear form, G0(x) = Wd,+,xWd1,+,x.W1,+,xx, noting that each matrix Wi,+,xdepends on the input x. 3 If fA,0(x) is differenti

33、able at x, we can write the gradient as fA,0(x) = ? 1 Y i=d Wi,+,x ?T ATA ? 1 Y i=d Wi,+,x ? x ? 1 Y i=d Wi,+,x ?T ATA ? 1 Y i=d Wi,+,x0 ? x0. In this expression, one can see intuitively that under the assumption thatAandWiare Gaussian matrices, the gradientf0(x)should concentrate around a determini

34、stic vectorvx,x0. Hand and Voroninski (2019) establish suffi cient conditions for concentration of the random matrices around deterministic quantities, so thatvx,x0has norm bounded away from zero ifx is suffi ciently far from x0or a scalar multiple ofx0 . Their results show that for random networks

35、having a suffi ciently expansive number of neurons in each layer, the objectivefA,0has a landscape favorable to gradient descent. We build on these ideas, showing fi rst that optimizing with respect toxfor a random network and arbitrary signaly can be done with gradient descent. This requires modifi

36、 ed proof techniques, since it is no longer assumed thaty = G0(x0). In fact,ycan be arbitrary and we wish to approximate it asGb (x(y) for somex(y). Second, after this initial optimization is carried out, we show how projected gradient descent can be used to track the optimum as the network undergoe

37、s a series of small changes. Our results are stated formally in the following section. 3Theoretical Results Suppose we have a sequence of networksG0,G1,.,GTgenerated from the training process. For instance, we may take a network with randomly initialized weights asG0, and record the network after ea

38、ch step of gradient descent in training; GT = G is the fi nal trained network. Algorithm 1 Surfi ng Input:Sequence of networks0,1,.,T 1:x1 0 2:for t = 0 to T do 3:x xt1 4:repeat 5:x x ft(x) 6:until convergence 7:xt x Output: xT For a given vectory Rn, we wish to minimize the objectivef(x) = 1 2kAG(x

39、) Ayk 2 with respect tox for the fi nal networkG, where eitherA = I Rnn, orA Rmnis a measurement matrix with i.i.d.N(0,1/m)entries in a compressed sensing context. Write ft(x) = 1 2kAGt(x) Ayk 2, t T.(1) The idea is that we fi rst minimizef0, which has a nicer landscape, to obtain the minimizerx0. W

40、e then apply gradient descent onftfort = 1,2,.,T successively, starting from the minimizer xt1for the previous network. We provide some theoretical analysis in partial support of this algorithmic idea. First, we show that at random initializationG0, all critical points off0(x)are localized to a smal

41、l ball around zero. Second, we show that ifG0,.,GT are obtained from a discretization of a continuous fl ow, along which the global minimizer offt(x)is unique and Lipschitz-continuous, then a projected-gradient version of surfi ng can successively fi nd the minimizers for G1,.,GTstarting from the mi

42、nimizer for G0. We consider expansive feedforward neural networks G : Rk 7 Rngiven by G(x,) = V (Wd.(W2(W1x + b1) + b2). + bd). Here,dis the number of intermediate layers (which we will treat as constant),is the ReLU activation function(x) = max(x,0)applied entrywise, and = (V,W1,.,Wd,b1,.,bd)are th

43、e network parameters. The input dimension isk n0, each intermediate layeri dhas weights Wi Rnini1and biasesbi Rni, and a linear transformV Rnnd is applied in the fi nal layer. For our fi rst result, consider fi xedy Rnand a random initializationG0(x) G(x,0)where0 has Gaussian entries (independent of

44、y ). If the network is suffi ciently expansive at each intermediate layer, then the following shows that with high probability, all critical points off0(x)belong to a small ball around 0. More concretely, the directional derivative Dx/kxkf0 (x) satisfi es Dx/kxkf0(x) lim t0+ f0(x tx/kxk) f0(x) t 0 s

45、uch that for any (0,0), if 1. n ndand ni C(2log1)ni1lognifor all i d, and 2.EitherA = Iandm = n, orA Rmnhas i.i.d.N(0,1/m)entries (independent of V,bi,Wi) where m Ck(1log1)log(n1.nd), then with probability at least1 C(ecm+ ndec 4nd1 + Pd1 i=1 niec 2ni1), every x Rk outside the ball kxk C (1 + kyk) s

46、atisfi es (2). We defer the proof to the supplementary material. Note that if insteadG0were correlated withy, say y = G0(x)for some inputxwithkxk 1, thenxwould be a global minimizer off0(x), and we would havekyk kxdk . kx1k kxk 1in the above network wherexi Rniis the output of theithlayer. The theor

47、em shows that for a random initialization ofG0which is independent of y, the minimizer is instead localized to a ball around 0 which is smaller in radius by the factor . For our second result, consider a network fl ow Gs(x) G(x,(s) fors 0,S, where(s) = (V (s),W1(s),b1(s),.,Wd(s),bd(s)evolve continuo

48、usly in a time parameters. As a model for network training, we assume thatG0,.,GTare obtained by discrete sampling from this fl ow viaGt= Gt, corresponding tos tfor a small time discretization step. We assume boundedness of the weights and uniqueness and Lipschitz-continuity of the global minimizer

49、along this fl ow. Assumption 3.2. There are constants M,L 0, consider the rows given by S(x,) = (i,j) : |w i,jxi1+ bi,j| , where xi1= (Wi1.(W1x + b1). + bi1) is the output of the(i 1)thlayer for this inputx, andv j ,w i,j, andbi,j are respectively thejth row ofV, thejthrow ofWiand thejthentry ofbiin

50、. This setS(x,)represents those neurons that are close to 0 before ReLU thresholding, and hence whose activations may change after a small change of the network input x. Defi ne P(x,) = P0,P1,.,PG as the set of all linear piecesPgwhose activation patterns differ fromP0only in rows belonging to S(x,)

51、. That is, for every x Pg P(x,) and (i,j) / S(x,), we have sign(w i,jx i1+ bi,j) = sign(w i,jxi1+ bi,j) where xi1is the output of the (i 1)thlayer for input x. With this defi nition, we consider a stylized projected-gradient surfi ng procedure in Algorithm 2, where ProjPis the orthogonal projection

52、onto the polytope P. 5 Algorithm 2 Projected-gradient Surfi ng Input: Network fl ow G(,(s) : s 0,S, parameters , 0. 1:Initialize x0= argminxf(x,(0). 2:for t = 1,.,T do 3:for each linear piece Pg P(xt1,(t),) do 4:x xt1 5:repeat 6:x ProjPg(x f(x,(t) 7:until convergence 8:x(g) t x 9:xt x(g) t for g 0,.

53、,G that achieves the minimum value of f(x(g) t ,(t). Output: xT The complexity of this algorithm depends on the number of piecesGto be optimized over in each step. We expect this to be small in practice when the slack parameter is chosen suffi ciently small, and provide a heuristic argument in the s

54、upplement indicating why this may be the case. The following shows that for any 0 , there is a suffi ciently fi ne time discretizationdepending on,M,L such that Algorithm 2 tracks the global minimizer. In particular, for the fi nal objective fT(x) = f(x,(T)corresponding to the networkGT, the outputx

55、Tis the global minimizer of fT(x). We remark that the time discretizationmay need to be smaller for deeper networks, asG(x) corresponding to a deeper network may have a larger Lipschitz constant inx . The specifi c dependence below arises from bounding this Lipschitz constant by Qd i=1kWik, which is

56、 a conservative bound also used and discussed in greater detail in Szegedy et al. (2014); Virmaux and Scaman (2018). Theorem 3.3.Suppose Assumption 3.2 holds. For any 0, if /(Lmax(M,1)d+1)and x0= argminxf(x,(0), then the iteratesxtin Algorithm 2 are given byxt= argminxf(x,(t) for each t = 1,.,T. Pro

57、of. For any fi xed, letx,x Rkbe two inputs toG(x,). Ifxi,xiare the corresponding outputs of theithlayer, using the assumptionkWik Mand the fact that the ReLU activationis 1-Lipschitz, we have kxi xik = k(Wixi1+ bi) (Wixi1+ bi)k k(Wixi1+ bi) (Wixi1+ bi)k Mkxi1 xi1k . Mikx xk. Letx(s) = argminxf(x,(s). By assumption,kx(s ) x(s)k L. For the network with parameter(s)at times, letx,i(s)andx,i(s )be the outputs at theithlayer corresponding to inputs x(s) and x(s ). Then for any i d

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论