iccv2019论文全集8991-the-implicit-bias-of-adagrad-on-separable-data_第1页
iccv2019论文全集8991-the-implicit-bias-of-adagrad-on-separable-data_第2页
iccv2019论文全集8991-the-implicit-bias-of-adagrad-on-separable-data_第3页
iccv2019论文全集8991-the-implicit-bias-of-adagrad-on-separable-data_第4页
iccv2019论文全集8991-the-implicit-bias-of-adagrad-on-separable-data_第5页
已阅读5页,还剩4页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、The Implicit Bias of AdaGrad on Separable Data Qian Qian Department of Statistics Ohio State University Columbus, OH 43210, USA Xiaoyuan Qian School of Mathematical Sciences Dalian University of Technology Dalian, Liaoning 116024, China xyqian Abstract We study the implicit bias of A

2、daGrad on separable linear classifi cation problems. We show that AdaGrad converges to a direction that can be characterized as the solution of a quadratic optimization problem with the same feasible set as the hard SVM problem. We also give a discussion about how different choices of the hyperparam

3、eters of AdaGrad might impact this direction. This provides a deeper understanding of why adaptive methods do not seem to have the generalization ability as good as gradient descent does in practice. 1Introduction In recent years, implicit regularization from various optimization algorithms plays a

4、crucial role in the generalization abilities in training deep neural networks (Salakhutdinov and Srebro 2015, Neyshabur et al. 2015, Keskar et al. 2016, Neyshabur et al. 2017, Zhang et al. 2017). For example, in underdetermined problems where the number of parameters is larger than the number of tra

5、ining examples, many global optima fail to exhibit good generalization properties, however, a specifi c optimization algorithm (such as gradient descent) does converge to a particular optimum that generalize well, although no explicit regularization is enforced when training the model. In other word

6、s, the optimization technique itself biases towards a certain model in an implicit way (Soudry et al. 2018). This motivates a line of works to investigate the implicit biases of various algorithms (Telgarsky 2013, Soudry et al. 2018, Gunasekar et al. 2017, 2018a,b). The choice of algorithms would af

7、fect the implicit regularization introduced in the learned models. In underdetermined least squares problems, where the minimizers are fi nite, we know that gradient descent yields the minimumL2norm solution, whereas coordinate descent might give a different solution. Another example is logistic reg

8、ression with separable data. While gradient descent converges in the direction of the hard margin support vector machine solution (Soudry et al. 2018), coordinate descent converges to the maximumL1margin solution (Telgarsky 2013, Gunasekar et al. 2018a). Unlike the squared loss, the logistic loss do

9、es not admit a fi nite global minimizer on separable data: the iterates will diverge in order to drive the loss to zero. As a result, instead of characterizing the convergence of the iteratesw(t), it is the asymptotic direction of these iterates i.e., limtw(t)/kw(t)kthat is important and therefore h

10、as been characterized (Soudry et al. 2018, Gunasekar et al. 2018b). Morevoer, it has attracted much attention that different adaptive methods of gradient descent and hyperparametersofanadaptivemethodexhibitdifferentbiases, thusleadingtodifferentgeneralization performance in deep learning (Salakhutdi

11、nov and Srebro 2015, Keskar et al. 2016, Wilson et al. 2017, Hoffer et al. 2017). Among those fi ndings is that the vanilla SGD algorithm demonstrates better generalization than its adaptive variants (Wilson et al. 2017), such as AdaGrad (Duchi et al. 2010) and Adam (Kingma and Ba 2015). Therefore i

12、t is important to precisely characterize how different adaptive methods induce difference biases. A natural question to ask is: can we explain this observation by characterizing the implicit bias of AdaGrad, which is a paradigm of adaptive 33rd Conference on Neural Information Processing Systems (Ne

13、urIPS 2019), Vancouver, Canada. methods, in a binary classifi cation setting with separable data using logistic regression? And how does the implicit bias depend on the choice of the hyperparameters of this specifi c algorithm, such as initialization, step sizes, etc? 1.1Our Contribution In this wor

14、k we study Adagrad applied to logistic regression with separable data. Our contribution is three-fold as listed as follows. We prove that the directions of AdaGrad iterates, with a constant step size suffi ciently small, always converge. We formulate the asymptotic direction as the solution of a qua

15、dratic optimization problem. This achieves a theoretical characterization of the implicit bias of AdaGrad, which also provides insights about why and how the factors involved, such as certain intrinsic properties of the dataset, the initialization and the learning rate, affect the implicit bias. We

16、introduce a novel approach to study the bias of AdaGrad. It is mainly based on a geometric estimation on the directions of the updates, which doesnt depend on any calculation on the convergence rate. 1.2Paper Organization This paper is organized as follows. In Section 2 we explain our problem setup.

17、 The main theory is developed in Section 3, including convergence of the adaptive learning rates of AdaGrad, existence of the asymptotic direction of AdaGrad iterates, and relations between the asymptotic directions of Adagrad and gradient descent iterates. We conclude our paper in Section 4 with a

18、review of our results and some questions left to future research. 2Problem Setup Let(xn,yn) : n = 1, ,Nbe a training dataset with featuresxn Rpand labelsyn 1,1 . To simplify the notation, we redefi neynxnasxn, n = 1, ,N,and consider learning the logistic regression model over the empirical loss: L(w

19、) = N X n=1 l ?wT xn?,w Rp, where l : Rp R. We focus on the following case, same as proposed in Soudry et al. 2018: Assumption 1. There exists a vector wsuch that wT xn 0 for all n. Assumption 2. l(u) is continuously differentiable, smooth, and strictly decreasing to zero. Assumption 3. There exist

20、positive constants a,b,c, and d such that ? ?l0(u) + ceau? e(a+b)u, for u d. It is easy to see that the exponential lossl(u) = euand the logistic lossl(u) = log(1 + eu) both meet these assumptions. Given two hyperparameters?, 0and an initial pointw(0) Rp,we consider the diagonal AdaGrad iterates w(t

21、 + 1) = w(t) h(t) ? g(t),t = 0,1,2, ,(1) where g(t) = (g1(t), ,gp(t) , gi(t) = L wi (w(t), h(t) = (h1(t), ,hp(t) , hi(t) = 1 pg i(0)2+ + gi(t)2+ ? ,i = 1, ,p, 2 and ? is the element-wise multiplication of two vectors, e.g. a ? b = (a1b1, ,apbp)T for a = (a1, ,ap)T, b = (b1, ,bp)T. To analyze the con

22、vergence of the algorithm, we put an additional restriction on the hyperparam- eter . Assumption 4. The hyperparameter is not too large; specifi cally, 2 mini1,p pg i(0)2+ ? .(2) We are interested in the asymptotic behavior of the AdaGrad iteration scheme in(1). The main problem is: does there exist

23、s some vector wAsuch that lim t w(t)/kw(t)k = wA? We will provide an affi rmative answer to this question in the following section. 3The Asymptotic Direction of AdaGrad Iterates 3.1Convergence of the Adaptive Learning Rates We fi rst provide some elementary facts about AdaGrad iterates(1)with all as

24、sumptions (1-4) proposed in Section 2. Lemma 3.1.L(w(t + 1) 0. Theorem 3.1. The sequence h(t) t=0converges as t to a vector h= (h,1, ,h,p) satisfying h,i 0 (i = 1, ,p). 3.2Convergence of the Directions of AdaGrad Iterates Intheremainderofthepaperwedenoteh= limth(t)andn= h1/2 ?xn(n = 1, ,N). Since, b

25、y Theorem 3.1, the components of h have a positive lower bound, we can defi ne (t) = h1 ? h(t) (t = 0,1,2,). Here the squared root and the inverse of vectors are defi ned element-wise. We call the function Lind: Rp R,Lind(v) = N X n=1 l ?vT n? 3 the induced loss with respect to AdaGrad (1). Note tha

26、t L(w) = N X n=1 l ?wT xn?= N X n=1 l ? h1/2 ? w ?T? h1/2 ? xn ? = N X n=1 l ? h1/2 ? w ?T n ? = Lind ? h1/2 ? w ? . Thus if we set v(t) = h1/2 ? w(t) (t = 0,1,2,),(3) then v(0) = h1/2 ? w(0), and h1/2 ? v(t + 1) = w(t + 1) = w(t) h(t) ? L(w(t) = h1/2 ? v(t) h(t) ? h1/2 ? L ? h1/2 ? v(t) ? = h1/2 ?

27、v(t) h(t) ? h1/2 ? Lind ? h1/2 ? ? h1/2 ? v(t) ? = h1/2 ? (v(t) (t) ? Lind(v(t) , or v(t + 1) = v(t) (t) ? Lind(v(t) (t = 0,1,).(4) We refer to (4) as the induced form of AdaGrad (1). The following result for the induced form is a simple corollary of Lemma 3.3. Lemma 3.4. The following statements ho

28、ld: (i) kLind(t)k 0 (t ). (ii) kv(t)k (t ). (iii) Lind(v(t) 0 (t ). (iv) n, limtv(t)Tn= . (v) t0, t t0, v(t)Tn 0. For the induced loss Lind, consider GD iterates u(t + 1) = u(t) Lind(u(t) (t = 0,1,).(5) According to Theorem 3 in Soudry et.al.(2018), we have lim t u(t) ku(t)k = b u kb uk , where b u

29、=argmin uTn1,n kuk2. Noting that(t) 1 (t )we can obtain GD iterates(5)by taking the limit of(t)in (4). Therefore it is reasonable to expect that these two iterative processes have similar asymptotic behaviors, especially a common limiting direction. Different from the case of GD method discussed in

30、Soudry et al. 2018, however, it is diffi cult to obtain an effective estimation about the convergence rate ofw(t). Instead, we introduce an orthogonal decomposition approach to obtain the asymptotic direction of the original Adagrad process (1). In the remainder of the paper, we denote byPthe projec

31、tion onto the1dimensional subspace spanned byb u, andQthe projection onto the orthogonal complement. Without any loss of generality we may assume kb uk = 1. Thus we have the orthogonal decomposition v = Pv + Qv (v Rp), where Pv = kPvkb u = ?vT b u?b u. Moreover, we denote (t) = Lind(v(t),d(t) = (t)

32、? (t).(6) 4 Using this notation we can rewrite the iteration scheme (4) as v(t + 1) = v(t) + d(t)(t = 0,1,). By reformulating (6) as d(t) = (t) + (t) 1) ? (t), where(t) 1 0ast , we regard(t)as the decisive part ofd(t)and acquire properties of d(t) through exploring analogues of (t). First, we can sh

33、ow a basic estimation: (t)Tb u = kP(t)k k(t)k max n knk (t = 0,1,2). The projection properties of (t) is easily passed on to d(t). In fact, for suffi ciently large t, d(t)Tb u = kPd(t)k kd(t)k 4max n knk ,(7) Inequality (7) provides a cumulative effect on the projection of v(t) as t increases: kPv(t

34、)k kv(t)k 8max n knk , for suffi ciently large t. The following lemma reveals a crucial characteristic of the iterative process(4): asttends to infi nity, the contribution of(t)to the increment of the deviation from the direction ofb u, compared to its contribution to the increment in the direction

35、of b u, becomes more and more insignifi cant. Lemma 3.5.Given 0.Leta,b,c be positive numbers as defi ned in Assumption 3 in Section 2. If kQv(t)k 2N(c + 1)(ace)1 , then for suffi ciently large t, Qv(t)T(t) 0,there existR 0 such that for suffi ciently largetandkQv(t)k R, kQv(t + 1)k kQv(t)k kd(t)k. T

36、herefore, over a long period, the cumulative increment ofv(t)in the direction ofb uwill overwhelm the deviation from it, yielding the existence of an asymptotic direction for v(t). Lemma 3.7. lim t v(t) kv(t)k = b u.(8) By the relation (3) between v(t) and w(t), our main result directly follows from

37、 (8). Theorem 3.2. AdaGrad iterates (1) has an asymptotic direction: lim t w(t) kw(t)k = ew k ewk, where ew =argmin wTxn1,n ? ? ? ? 1 h ? w ? ? ? ? 2 .(9) 3.3Factors Affecting the Asymptotic Direction Theorem 3.2 confi rms that AdaGrad iterates(1)have an asymptotic directionew/k ewk,whereew is the s

38、olution to the optimization problem(9). Since the objective function ? ? ?h 1/2 ? w ? ? ? 2 is determined by the limit vectorh,it is easy to see that the asymptotic direction may depend on the choices of the dataset(xn,yn)N n=1, the hyperparameters?, ,and the initial pointw(0).In the following we wi

39、ll discuss this varied dependency in several respects. 5 3.3.1Difference from the Asymptotic Direction of GD iterates When the classic gradient descent method is applied to minimize the same loss, it is known (see Theorem 3, Soudry et al. 2018) that GD iterates wG(t + 1) = wG(t) L(wG(t)(t = 0,1,2,),

40、(10) have an asymptotic directionbw/k bwk, wherebwis the solution of the hard max-margin SVM problem argmin wTxn1,n kwk2.(11) The two optimization problems (9) and (11) have the same feasible set ?w Rp : wTxn 1, for n = 1, ,N?, but they take on different objective functions. It is natural to expect

41、that their solutionsewandbw yield different directions, as shown in the following toy example. Example 3.1.Letx1= (cos,sin)TandL(w) = ew Tx1 .Suppose0 0, x1=r1(cos1,sin1)T, 2 1 0. It is easy to check that ifw = (w1,w2)T satisfi eswTxi 1 (i = 1,2),thenw1 , w2 . Thus any quadratic formb1w2 1+b2w22 (b1

42、, b2 0)takes its minimum at(, )Tover the feasible set ?w : wT xi 1 (i = 1,2)?. Hence the asymptotic direction of AdaGrad(1)applying to this problem is always equal to (, )T ? k(, )k, which is also the asymptotic direction of GD (10). A geometric perspective of this example is given in Figure 2, wher

43、e the red arrows indicate x1= (cos(5/8), sin(5/8) and x2= (cos(/8),sin(/8),and the magenta arrow indicates(, )T. It is easy to see that the isoline (the thick ellipse drawn in green) along which the function ? ? ?h 1/2 ? w ? ? ? 2 equals its minimum must intersect with the feasible set (the grey sha

44、dowed area) at the corner (, )T, no matter what his. Figure 3: A case that the asymptotic directions of AdaGrad and GD are equal. 8 4Conclusion We proved that the basic diagonal AdaGrad, when minimizing a smooth monotone loss function with an exponential tail, has an asymptotic direction, which can

45、be characterized as the solution of a quadratic optimization problem. In this respect AdaGrad is similar to GD, even though their asymptotic directions are usually different. The difference between them also lies in the stability of their asymptotic directions. The asymptotic direction of GD is uniq

46、uely determined by the predictors xns and independent of initialization and the learning rate, as well as the rotation of coordinate system, while the asymptotic direction of AdaGrad is likely to be affected by those factors. In spite of all these fi ndings, we still do not know whether the asymptot

47、ic direction of AdaGrad will change for various initialization or different learning rates. Furthermore, we hope our approach can be applied to the research on the implicit biases of other adaptive methods such as AdaDelta, RMSProp, and Adam. References B. Neyshaburand R. R. Salakhutdinov and N. Sre

48、bro. Path-sgd: Path-normalized optimization in deep neural networks. In Advances in Neural Information Processing Systems, page 24222430, 2015. B. Neyshabur, R. Tomioka, and N. Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. In International Con

49、ference on Learning Representations, 2015. Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. ICLR, 2016. B. Neyshabur, R. Tomioka, R. Salakhutdinov, and N. Srebro. Geometry of optimization and implicit regularization in deep learning, 2017. URL /pdf/1705.03071.pdf. C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep lea

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论