版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、The Implicit Bias of AdaGrad on Separable Data Qian Qian Department of Statistics Ohio State University Columbus, OH 43210, USA Xiaoyuan Qian School of Mathematical Sciences Dalian University of Technology Dalian, Liaoning 116024, China xyqian Abstract We study the implicit bias of A
2、daGrad on separable linear classifi cation problems. We show that AdaGrad converges to a direction that can be characterized as the solution of a quadratic optimization problem with the same feasible set as the hard SVM problem. We also give a discussion about how different choices of the hyperparam
3、eters of AdaGrad might impact this direction. This provides a deeper understanding of why adaptive methods do not seem to have the generalization ability as good as gradient descent does in practice. 1Introduction In recent years, implicit regularization from various optimization algorithms plays a
4、crucial role in the generalization abilities in training deep neural networks (Salakhutdinov and Srebro 2015, Neyshabur et al. 2015, Keskar et al. 2016, Neyshabur et al. 2017, Zhang et al. 2017). For example, in underdetermined problems where the number of parameters is larger than the number of tra
5、ining examples, many global optima fail to exhibit good generalization properties, however, a specifi c optimization algorithm (such as gradient descent) does converge to a particular optimum that generalize well, although no explicit regularization is enforced when training the model. In other word
6、s, the optimization technique itself biases towards a certain model in an implicit way (Soudry et al. 2018). This motivates a line of works to investigate the implicit biases of various algorithms (Telgarsky 2013, Soudry et al. 2018, Gunasekar et al. 2017, 2018a,b). The choice of algorithms would af
7、fect the implicit regularization introduced in the learned models. In underdetermined least squares problems, where the minimizers are fi nite, we know that gradient descent yields the minimumL2norm solution, whereas coordinate descent might give a different solution. Another example is logistic reg
8、ression with separable data. While gradient descent converges in the direction of the hard margin support vector machine solution (Soudry et al. 2018), coordinate descent converges to the maximumL1margin solution (Telgarsky 2013, Gunasekar et al. 2018a). Unlike the squared loss, the logistic loss do
9、es not admit a fi nite global minimizer on separable data: the iterates will diverge in order to drive the loss to zero. As a result, instead of characterizing the convergence of the iteratesw(t), it is the asymptotic direction of these iterates i.e., limtw(t)/kw(t)kthat is important and therefore h
10、as been characterized (Soudry et al. 2018, Gunasekar et al. 2018b). Morevoer, it has attracted much attention that different adaptive methods of gradient descent and hyperparametersofanadaptivemethodexhibitdifferentbiases, thusleadingtodifferentgeneralization performance in deep learning (Salakhutdi
11、nov and Srebro 2015, Keskar et al. 2016, Wilson et al. 2017, Hoffer et al. 2017). Among those fi ndings is that the vanilla SGD algorithm demonstrates better generalization than its adaptive variants (Wilson et al. 2017), such as AdaGrad (Duchi et al. 2010) and Adam (Kingma and Ba 2015). Therefore i
12、t is important to precisely characterize how different adaptive methods induce difference biases. A natural question to ask is: can we explain this observation by characterizing the implicit bias of AdaGrad, which is a paradigm of adaptive 33rd Conference on Neural Information Processing Systems (Ne
13、urIPS 2019), Vancouver, Canada. methods, in a binary classifi cation setting with separable data using logistic regression? And how does the implicit bias depend on the choice of the hyperparameters of this specifi c algorithm, such as initialization, step sizes, etc? 1.1Our Contribution In this wor
14、k we study Adagrad applied to logistic regression with separable data. Our contribution is three-fold as listed as follows. We prove that the directions of AdaGrad iterates, with a constant step size suffi ciently small, always converge. We formulate the asymptotic direction as the solution of a qua
15、dratic optimization problem. This achieves a theoretical characterization of the implicit bias of AdaGrad, which also provides insights about why and how the factors involved, such as certain intrinsic properties of the dataset, the initialization and the learning rate, affect the implicit bias. We
16、introduce a novel approach to study the bias of AdaGrad. It is mainly based on a geometric estimation on the directions of the updates, which doesnt depend on any calculation on the convergence rate. 1.2Paper Organization This paper is organized as follows. In Section 2 we explain our problem setup.
17、 The main theory is developed in Section 3, including convergence of the adaptive learning rates of AdaGrad, existence of the asymptotic direction of AdaGrad iterates, and relations between the asymptotic directions of Adagrad and gradient descent iterates. We conclude our paper in Section 4 with a
18、review of our results and some questions left to future research. 2Problem Setup Let(xn,yn) : n = 1, ,Nbe a training dataset with featuresxn Rpand labelsyn 1,1 . To simplify the notation, we redefi neynxnasxn, n = 1, ,N,and consider learning the logistic regression model over the empirical loss: L(w
19、) = N X n=1 l ?wT xn?,w Rp, where l : Rp R. We focus on the following case, same as proposed in Soudry et al. 2018: Assumption 1. There exists a vector wsuch that wT xn 0 for all n. Assumption 2. l(u) is continuously differentiable, smooth, and strictly decreasing to zero. Assumption 3. There exist
20、positive constants a,b,c, and d such that ? ?l0(u) + ceau? e(a+b)u, for u d. It is easy to see that the exponential lossl(u) = euand the logistic lossl(u) = log(1 + eu) both meet these assumptions. Given two hyperparameters?, 0and an initial pointw(0) Rp,we consider the diagonal AdaGrad iterates w(t
21、 + 1) = w(t) h(t) ? g(t),t = 0,1,2, ,(1) where g(t) = (g1(t), ,gp(t) , gi(t) = L wi (w(t), h(t) = (h1(t), ,hp(t) , hi(t) = 1 pg i(0)2+ + gi(t)2+ ? ,i = 1, ,p, 2 and ? is the element-wise multiplication of two vectors, e.g. a ? b = (a1b1, ,apbp)T for a = (a1, ,ap)T, b = (b1, ,bp)T. To analyze the con
22、vergence of the algorithm, we put an additional restriction on the hyperparam- eter . Assumption 4. The hyperparameter is not too large; specifi cally, 2 mini1,p pg i(0)2+ ? .(2) We are interested in the asymptotic behavior of the AdaGrad iteration scheme in(1). The main problem is: does there exist
23、s some vector wAsuch that lim t w(t)/kw(t)k = wA? We will provide an affi rmative answer to this question in the following section. 3The Asymptotic Direction of AdaGrad Iterates 3.1Convergence of the Adaptive Learning Rates We fi rst provide some elementary facts about AdaGrad iterates(1)with all as
24、sumptions (1-4) proposed in Section 2. Lemma 3.1.L(w(t + 1) 0. Theorem 3.1. The sequence h(t) t=0converges as t to a vector h= (h,1, ,h,p) satisfying h,i 0 (i = 1, ,p). 3.2Convergence of the Directions of AdaGrad Iterates Intheremainderofthepaperwedenoteh= limth(t)andn= h1/2 ?xn(n = 1, ,N). Since, b
25、y Theorem 3.1, the components of h have a positive lower bound, we can defi ne (t) = h1 ? h(t) (t = 0,1,2,). Here the squared root and the inverse of vectors are defi ned element-wise. We call the function Lind: Rp R,Lind(v) = N X n=1 l ?vT n? 3 the induced loss with respect to AdaGrad (1). Note tha
26、t L(w) = N X n=1 l ?wT xn?= N X n=1 l ? h1/2 ? w ?T? h1/2 ? xn ? = N X n=1 l ? h1/2 ? w ?T n ? = Lind ? h1/2 ? w ? . Thus if we set v(t) = h1/2 ? w(t) (t = 0,1,2,),(3) then v(0) = h1/2 ? w(0), and h1/2 ? v(t + 1) = w(t + 1) = w(t) h(t) ? L(w(t) = h1/2 ? v(t) h(t) ? h1/2 ? L ? h1/2 ? v(t) ? = h1/2 ?
27、v(t) h(t) ? h1/2 ? Lind ? h1/2 ? ? h1/2 ? v(t) ? = h1/2 ? (v(t) (t) ? Lind(v(t) , or v(t + 1) = v(t) (t) ? Lind(v(t) (t = 0,1,).(4) We refer to (4) as the induced form of AdaGrad (1). The following result for the induced form is a simple corollary of Lemma 3.3. Lemma 3.4. The following statements ho
28、ld: (i) kLind(t)k 0 (t ). (ii) kv(t)k (t ). (iii) Lind(v(t) 0 (t ). (iv) n, limtv(t)Tn= . (v) t0, t t0, v(t)Tn 0. For the induced loss Lind, consider GD iterates u(t + 1) = u(t) Lind(u(t) (t = 0,1,).(5) According to Theorem 3 in Soudry et.al.(2018), we have lim t u(t) ku(t)k = b u kb uk , where b u
29、=argmin uTn1,n kuk2. Noting that(t) 1 (t )we can obtain GD iterates(5)by taking the limit of(t)in (4). Therefore it is reasonable to expect that these two iterative processes have similar asymptotic behaviors, especially a common limiting direction. Different from the case of GD method discussed in
30、Soudry et al. 2018, however, it is diffi cult to obtain an effective estimation about the convergence rate ofw(t). Instead, we introduce an orthogonal decomposition approach to obtain the asymptotic direction of the original Adagrad process (1). In the remainder of the paper, we denote byPthe projec
31、tion onto the1dimensional subspace spanned byb u, andQthe projection onto the orthogonal complement. Without any loss of generality we may assume kb uk = 1. Thus we have the orthogonal decomposition v = Pv + Qv (v Rp), where Pv = kPvkb u = ?vT b u?b u. Moreover, we denote (t) = Lind(v(t),d(t) = (t)
32、? (t).(6) 4 Using this notation we can rewrite the iteration scheme (4) as v(t + 1) = v(t) + d(t)(t = 0,1,). By reformulating (6) as d(t) = (t) + (t) 1) ? (t), where(t) 1 0ast , we regard(t)as the decisive part ofd(t)and acquire properties of d(t) through exploring analogues of (t). First, we can sh
33、ow a basic estimation: (t)Tb u = kP(t)k k(t)k max n knk (t = 0,1,2). The projection properties of (t) is easily passed on to d(t). In fact, for suffi ciently large t, d(t)Tb u = kPd(t)k kd(t)k 4max n knk ,(7) Inequality (7) provides a cumulative effect on the projection of v(t) as t increases: kPv(t
34、)k kv(t)k 8max n knk , for suffi ciently large t. The following lemma reveals a crucial characteristic of the iterative process(4): asttends to infi nity, the contribution of(t)to the increment of the deviation from the direction ofb u, compared to its contribution to the increment in the direction
35、of b u, becomes more and more insignifi cant. Lemma 3.5.Given 0.Leta,b,c be positive numbers as defi ned in Assumption 3 in Section 2. If kQv(t)k 2N(c + 1)(ace)1 , then for suffi ciently large t, Qv(t)T(t) 0,there existR 0 such that for suffi ciently largetandkQv(t)k R, kQv(t + 1)k kQv(t)k kd(t)k. T
36、herefore, over a long period, the cumulative increment ofv(t)in the direction ofb uwill overwhelm the deviation from it, yielding the existence of an asymptotic direction for v(t). Lemma 3.7. lim t v(t) kv(t)k = b u.(8) By the relation (3) between v(t) and w(t), our main result directly follows from
37、 (8). Theorem 3.2. AdaGrad iterates (1) has an asymptotic direction: lim t w(t) kw(t)k = ew k ewk, where ew =argmin wTxn1,n ? ? ? ? 1 h ? w ? ? ? ? 2 .(9) 3.3Factors Affecting the Asymptotic Direction Theorem 3.2 confi rms that AdaGrad iterates(1)have an asymptotic directionew/k ewk,whereew is the s
38、olution to the optimization problem(9). Since the objective function ? ? ?h 1/2 ? w ? ? ? 2 is determined by the limit vectorh,it is easy to see that the asymptotic direction may depend on the choices of the dataset(xn,yn)N n=1, the hyperparameters?, ,and the initial pointw(0).In the following we wi
39、ll discuss this varied dependency in several respects. 5 3.3.1Difference from the Asymptotic Direction of GD iterates When the classic gradient descent method is applied to minimize the same loss, it is known (see Theorem 3, Soudry et al. 2018) that GD iterates wG(t + 1) = wG(t) L(wG(t)(t = 0,1,2,),
40、(10) have an asymptotic directionbw/k bwk, wherebwis the solution of the hard max-margin SVM problem argmin wTxn1,n kwk2.(11) The two optimization problems (9) and (11) have the same feasible set ?w Rp : wTxn 1, for n = 1, ,N?, but they take on different objective functions. It is natural to expect
41、that their solutionsewandbw yield different directions, as shown in the following toy example. Example 3.1.Letx1= (cos,sin)TandL(w) = ew Tx1 .Suppose0 0, x1=r1(cos1,sin1)T, 2 1 0. It is easy to check that ifw = (w1,w2)T satisfi eswTxi 1 (i = 1,2),thenw1 , w2 . Thus any quadratic formb1w2 1+b2w22 (b1
42、, b2 0)takes its minimum at(, )Tover the feasible set ?w : wT xi 1 (i = 1,2)?. Hence the asymptotic direction of AdaGrad(1)applying to this problem is always equal to (, )T ? k(, )k, which is also the asymptotic direction of GD (10). A geometric perspective of this example is given in Figure 2, wher
43、e the red arrows indicate x1= (cos(5/8), sin(5/8) and x2= (cos(/8),sin(/8),and the magenta arrow indicates(, )T. It is easy to see that the isoline (the thick ellipse drawn in green) along which the function ? ? ?h 1/2 ? w ? ? ? 2 equals its minimum must intersect with the feasible set (the grey sha
44、dowed area) at the corner (, )T, no matter what his. Figure 3: A case that the asymptotic directions of AdaGrad and GD are equal. 8 4Conclusion We proved that the basic diagonal AdaGrad, when minimizing a smooth monotone loss function with an exponential tail, has an asymptotic direction, which can
45、be characterized as the solution of a quadratic optimization problem. In this respect AdaGrad is similar to GD, even though their asymptotic directions are usually different. The difference between them also lies in the stability of their asymptotic directions. The asymptotic direction of GD is uniq
46、uely determined by the predictors xns and independent of initialization and the learning rate, as well as the rotation of coordinate system, while the asymptotic direction of AdaGrad is likely to be affected by those factors. In spite of all these fi ndings, we still do not know whether the asymptot
47、ic direction of AdaGrad will change for various initialization or different learning rates. Furthermore, we hope our approach can be applied to the research on the implicit biases of other adaptive methods such as AdaDelta, RMSProp, and Adam. References B. Neyshaburand R. R. Salakhutdinov and N. Sre
48、bro. Path-sgd: Path-normalized optimization in deep neural networks. In Advances in Neural Information Processing Systems, page 24222430, 2015. B. Neyshabur, R. Tomioka, and N. Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. In International Con
49、ference on Learning Representations, 2015. Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. ICLR, 2016. B. Neyshabur, R. Tomioka, R. Salakhutdinov, and N. Srebro. Geometry of optimization and implicit regularization in deep learning, 2017. URL /pdf/1705.03071.pdf. C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep lea
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025-2026学年徊的拼音教学设计幼儿园
- 2025-2026学年篮球理论教学设计和教案
- 晋中师范高等专科学校《儿童文学与欣赏》2024-2025学年第二学期期末试卷
- 黑龙江科技大学《动物生理学实验》2024-2025学年第二学期期末试卷
- 地理类职业规划指南
- 老师我想对您说资料15篇
- 2025-2026学年鸟岛教学设计数学答案app
- 2025-2026学年英语语法ppp模式教学设计
- 2025-2026学年教学设计英文模版
- 2025-2026学年种甘蔗教案
- 尿毒症合并高钾血症护理查房
- 市政工程施工技术课件
- GB/T 2820.5-2025往复式内燃机驱动的交流发电机组第5部分:发电机组
- 优化人员岗位管理制度
- 量具使用培训手册
- 音乐鉴赏与实践 课件《万物欢腾》
- 公司环保巡查管理制度
- CJ/T 476-2015建筑机电设备抗震支吊架通用技术条件
- 高中数学三年教学规划
- 高考语文专题复习:辨析并修改病句
- XXXX学校校服采购自检自查报告范文
评论
0/150
提交评论