版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Online-Within-Online Meta-Learning Giulia Denevi1,2, Dimitris Stamos3, Carlo Ciliberto3,4and Massimiliano Pontil1,3 1Istituto Italiano di Tecnologia (Italy),2University of Genoa (Italy), 3University College of London (UK),4Imperial College of London (UK), giulia.deneviiit.it, c.cilibertoimperial.ac.
2、uk, d.stamos.12,m.pontilucl.ac.uk Abstract We study the problem of learning a series of tasks in a fully online Meta-Learning setting. The goal is to exploit similarities among the tasks to incrementally adapt an inner online algorithm in order to incur a low averaged cumulative error over the tasks
3、. We focus on a family of inner algorithms based on a parametrized variant of online Mirror Descent. The inner algorithm is incrementally adapted by an online Mirror Descent meta-algorithm using the corresponding within-task minimum regularized empirical risk as the meta-loss. In order to keep the p
4、rocess fully online, we approximate the meta-subgradients by the online inner algorithm. An upper bound on the approximation error allows us to derive a cumulative error bound for the proposed method. Our analysis can also be converted to the statistical setting by online-to-batch arguments. We inst
5、antiate two examples of the framework in which the meta-parameter is either a common bias vector or feature map. Finally, preliminary numerical experiments confi rm our theoretical fi ndings. 1Introduction Humans can quickly adapt knowledge gained when learning past tasks, in order to solve new task
6、s from just a handful of examples. In contrast, learning systems are still rather limited when it comes to transferknowledgeoverasequenceof learningproblems. Overcomingthislimitationcanhaveabroad impact in artifi cial intelligence, as it can save the expensive preparation of large training samples,
7、often humanly annotated, needed by current machine learning methods. As a result, Meta-Learning is receiving increasing attention, both from applied 15, 32 and theoretical perspective 5, 40, 17. Until very recently, Meta-Learning was mainly studied in the batch statistical setting, where data are as
8、sumed to be independently sampled from some distribution and they are processed in one batch, see 6, 23, 24, 25, 26, 29. Only recently, a lot of interest raised in investigating more effi cient methods, combining ideas from Online Learning and Meta-Learning, see 1,12,13,30,3,21,16,8,11,30. In this s
9、etting, which is sometimes referred to as Lifelong Learning, the tasks are observed sequentially via corresponding sets of training examples and the broad goal is to exploit similarities across the tasks to incrementally adapt an inner (within-task) algorithm to such a sequence. There are different
10、ways to deal with Meta-Learning in an online framework: the so-called Online-Within-Batch (OWB) framework, where the tasks are processed online but the data within each task are processed in one batch, see 1,12,13,16,8,3,21, or the so-called Online-Within-Online (OWO) framework, where data are proce
11、ssed sequentially both within and across the tasks, see 1,3,21,16,11. Previous work mainly analyzed specifi c settings, see the technical discussion in App. A. The main goal of this work is to propose an OWO Meta-Learning approach that can be adapted to a broad family of algorithms. We consider a ge
12、neral class of inner algorithms based on primal-dual Online Learning 36,38,37,35, 34. In particular, we discuss in detail the case of online Mirror Descent on a regularized variant of the empirical risk. The regularizer belongs to a general family of strongly convex functions parametrized by a meta-
13、parameter. The inner algorithm is adapted by a meta-algorithm, which also consists in applying online Mirror Descent on a meta-objective given by the within-task minimum regularized 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. empirical risk. The interp
14、lay between the meta-algorithm and the inner algorithm plays a key role in our analysis. The latter is used to compute a good approximation of the meta-subgradient which is supplied to the former. A key novelty of our analysis is to show that, exploiting a closed form expression of the error on the
15、meta-subgradients, we can automatically derive a cumulative error bound for the entire procedure. Our analysis holds also for more aggressive primal-dual online updates and it can be adapted to the statistical setting by online-to-batch arguments. Contributions. Our contribution is threefold. First,
16、 we derive an effi cient and theoretically grounded OWO Meta-Learning framework which is inspired by Multi-Task Learning (MTL). Our framework applies to a wide class of within-task algorithms and tasks relationships. Second, we establish how our analysis can be converted to the statistical setting.
17、Finally, we show how our general analysis can be directly applied to two important families of inner algorithms in which the meta-parameter is either a bias vector or a feature map shared across the tasks. Paper organization.We start by introducing in Sec. 2 our OWO Meta-Learning setting. In Sec. 3
18、we recall some background material from primal-dual Online Learning. In Sec. 4 we outline the proposed method and we give a cumulative error bound for it. In Sec. 5 we show how the above analysis can be used to derive guarantees for our method in the statistical setting. In Sec. 6 we specify our fra
19、mework to two important examples in which the tasks share a common bias vector or feature map. Finally, in Sec. 7 we report preliminary experiments with our method and in Sec. 8 we draw conclusions. Technical proofs are postponed to the appendix. 2Setting In this section we introduce the OWO Meta-Le
20、arning problem. We consider that the learner is facing a sequence of online tasks. Corresponding to each task, there is an input spaceX, an output space Yand a datasetZ = (zi)n i=1 = (xi,yi)n i=1 2 (X Y)n, which is observed sequentially. Online Learning aims to design an algorithm that makes predict
21、ions through time from past information. More precisely, at each stepi 2 1,.,n: (a) a datapointzi= (xi,yi)is observed, (b) the algorithm outputs a label yi, (c) the learner incurs the errori( yi), wherei() = (,yi)for a loss function. To simplify our presentation, throughout we letX Rd,Y Rand we cons
22、ider algorithms that perform linear predictions of the form yi= hxi,wii, where(wi)n i=1is a sequence of weight vectors updated by the algorithm andh,idenotes the standard inner product inRd. The goal is to bound the cumulative error of the algorithm, i.e.Einner(Z) = Pn i=1i(hxi,wii), with respect to
23、 (w.r.t.) the same quantity incurred by a vector w 2 Rd fi xed in hindsight, i.e. Pn i=1i(hxi, wi). In the OWO Meta-Learning setting, we have a family of inner online algorithms identifi ed by a meta-parameterbelonging to a prescribed setand the goal is to adaptto a sequence of learning tasks, in on
24、line fashion. Throughout this work,will be a closed, convex, non-empty subset of an Euclidean spaceM. The broad goal is to “transfer information” gained when learning previous tasks, in order to help learning future tasks. For this purpose, we propose a Meta-Learning procedure, acting across the tas
25、ks, which modifi es the inner algorithm one task after another. More precisely, we letTbe the number of tasks and, for each taskt 2 1,.,Twe letZt= (xt,i,yt,i)n i=1 1 be the corresponding data sequence. At each timet: (a) the meta-learner incrementally receives a task datasetZt, (b) it runs the inner
26、 online algorithm with meta-parametertonZt, returning the predictor vectors(wt,i)n i=1, (c) it incrementally incurs the errorst,i(hxt,i,wt,ii), wheret,i() = (,yt,i), (d) the meta-parameter (and consequently, the inner algorithm) is updated int+1. Denoting by Einner(Zt,t)the cumulative error of the i
27、nner algorithm with meta-parameterton the datasetZt, the goal is to bound the error accumulated across the tasks, i.e. Emeta?(Zt)T t=1 ? = T X t=1 Einner(Zt,t) = T X t=1 n X i=1 t,i(hxt,i,wt,ii),(1) w.r.t. the same quantity incurred by a sequence of tasks vectors( wt)T t=1 fi xed in hindsight, i.e.
28、PT t=1 Pn i=1t,i(hxt,i, wti). The setting we consider in the paper is inspired by previous work on Multi-Task Learning, such as 2,10,18. To describe it, we use extended real-valued functions and, for any data sequenceZand 1Throughout the paper we use the double subscript notation “t,i”, to denote th
29、e outer, inner task index. 2 meta-parameter 2 , we defi ne the within-task minimum regularized empirical risk LZ() = min w2Rd RZ(w) + ?f(w,)RZ(w) = 1 n n X i=1 i(hxi,wi),(2) where? 0is a regularization parameter andfis an appropriate complexity term ensuring the existence and the uniqueness of the a
30、bove minimizer w. Assuming the entire sequence(Zt)T t=1 available in hindsight, introducing the notation Lt= LZt, many MTL methods read as follows min 2M T X t=1 Lt() + F(),(3) where 0is a meta-regularization parameter andFis an appropriate meta-regularizer ensuring that the above minimum is attaine
31、d. We stress that in our OWO Meta-Learning setting, the data are received sequentially, both within and across the tasks. The above formulation inspires us to take a within-task online algorithm that mimics well the (batch) objective in Eq. (2) and to defi ne as meta-objectives for the online meta-a
32、lgorithm the functions(Lt)T t=1. Obviously, in this setting, the meta-objectives (and consequently their subgradients used by the meta-algorithm) are computed only up to an approximation error, depending on the specifi c properties of the inner algorithm we are using. We will show how to control and
33、 exploit this approximation error in the analysis. In the sequel, for an Euclidean spaceV, we let?0(V)to be the set of proper, closed and convex functions overVand, for anyf 2 ?0(V), we denote byDomfits domain (we refer to App. B and 31 for notions on convex analysis). In this work, we make the foll
34、owing standard assumptions in which we introduce two norms k k and | that will be specifi ed in two applications below. Assumption 1(Loss and regularizer).Let(,y)be a convex and closed real-valued function for anyy 2 Yand letf 2 ?0(Rd M)be such that, for any 2 ,f(,)is1-strongly convex w.r.t. a norm
35、k kover Rd, infw2Rdf(w,) = 0 and, for any / 2 , Domf(,) = ;. Assumption 2(Meta-regularizer).LetFbe a closed and1-strongly convex function w.r.t. a norm | over M such that inf2MF() = 0 and DomF = . Notice that the norm w.r.t. which the functionf(,)is assumed to be strongly convex may vary with. Moreo
36、ver, under Asm. 1,DomLZ= and, sinceLZ is defi ned as the partial minimum of a function in?0(Rd M),LZ2 ?0(M). This property supports the choice of this function as the meta-objective for our meta-algorithm. Finally, by Lemma 29 in App. B, Asm. 1 and Asm. 2 ensure the existence and the uniqueness of t
37、he minimizers in Eq. (2) and Eq. (3). We conclude this section by giving two examples included in the framework above. The fi rst one is inspired by the MTL variance regularizer in 14, while the second example, which can be easily extended to more general MTL regularizers such as in 2,10,27,28, rela
38、tes to the MTL trace norm regularizer. As we will see in the following, in the fi rst example the tasks predictors are encouraged to stay close to a common bias vector, in the second example they are encouraged to lie in the range of a low-rank feature map. In order to describe these examples we req
39、uire some additional notation. We letkk2,kkF,kkTr,kk1, be the Euclidean, Frobenius, trace, and operator norm, respectively. We also let “” be the pseudo-inverse,Tr()be the trace,Ran()be the range andSd(resp.Sd +) be the set of symmetric (resp. positive semi-defi nite) matrices inRdd. Finally,Sdenote
40、s the indicator function of the set S, taking value 0 when the argument belongs to S and +1 otherwise. Example 1(Bias).We chooseM = = Rd,F() = 1 2k k 2 2, satisfying Asm. 2 with| = k k2, and f(,) = 1 2k ?k 2 2, satisfying Asm. 1 with k k= k k2for every 2 Rd. Example 2(Feature Map).We chooseM = Sdand
41、 = S, whereS = 2 Sd + : Tr() 1. For a fi xed02 S, we setF() = 1 2k ?0k 2 F + S(), satisfying Asm. 2 with| = k kF, and f(,) = 1 2h, i + Ran()() + S(), satisfying Asm. 1 with k k = ph,i for any 2 S. We will return to these examples in Sec. 6, specializing our method and our analysis to these settings.
42、 3Preliminaries: primal-dual Online Learning Our OWO Meta-Learning method consists in the application of two nested primal-dual online algorithms, one operating within the tasks and another across the tasks. In particular, even though our 3 Algorithm 1Primal-dual online algorithm online Mirror Desce
43、nt Input (gm)M m=1, (Am) M m=1, (cm) M m=1, (m) M m=1, r as described in the text Initialization 1= (), v1= rr(0) 2 Dom r For m = 1 to M Receive gm, Am, cm+1, m Suffer gm(Amvm) and compute 0m2 mgm(Amvm) Update m+1= (m,0m) Defi ne vm+1= rr? 1/cm+1 Pm j=1A jm+1,j ? 2 Dom r Return (m)M+1 m=1, (vm) M+1
44、m=1 analysis holds also for more aggressive schemes, in this work, we consider online Mirror Descent algorithm. In this section we briefl y recall some material from the primal-dual interpretation of this algorithm that will be used in our subsequent analysis. The material of this section is an adap
45、tation from 36, 38, 37, 35, 34; we refer to App. C for a more detailed presentation. Online Mirror Descent algorithm on a (primal) problem can be derived from the following primal- dual framework in which we introduce an appropriate dual algorithm. Specifi cally, at each iteration m 2 1,.,M, we cons
46、ider the following instantaneous primal optimization problem Pm+1= inf v2V Pm+1(v)Pm+1(v) = m X j=1 gj(Ajv) + cmr(v)(4) whereVis an Euclidean space,cm 0,r 2 ?0(V)is a1-strongly convex function w.r.t. a norm kkoverV(with dual normkk) such thatinfv2Vr(v) = 0, for anyj 2 1,.,M, lettingVjan Euclidean sp
47、ace,gj2 ?0(Vj)andAj: V ! Vjis a linear operator with adjoinA j. As explained in App. C, the corresponding dual problem is given by Dm+1=inf 2V1Vm Dm+1()Dm+1() = m X j=1 g j(j) + cmr ? 1 cm m X j=1 A jj , (5) whereg j andrare respectively the conjugate functions ofgjandr . After this, we defi ne the
48、dual scheme in which the dual variablem+1is updated by a greedy coordinate descent approach on the dual, settingm+1= (m,0m), where0m2 mgm(Amvm)is anm-subgradient ofgmatAmvm andvmis the current primal iteration. The primal variable is then updated from the dual one by a variant of the KarushKuhnTucke
49、r (KKT) conditions, see Alg. 1. It is possible to show that the corresponding primal iterates generated in this way belong toDom rand they coincide with the lazy variant of online Mirror Descent. We recall that such a scheme includes many well-known algorithms, when one properly specifi es the compl
50、exity termr, see e.g. 34. The behavior of Alg. 1 is analyzed in the next result which will be a key tool for our analysis. Theorem 1(Dual optimality gap for Alg. 1).Let(vm)M m=1be the primal iterates returned by Alg. 1 when applied to the generic problem in Eq. (4) and let?Dual= DM+1(M+1) ? DM+1be t
51、he corresponding (non-negative) dual optimality gap at the last dual iterate M+1of the algorithm. 1. If, for any m 2 1,.,M, cm+1? cm, then, ?Dual ? M X m=1 gm(Amvm) + PM+1+ 1 2 M X m=1 1 cm ? ?A m 0 m ? ?2 + M X m=1 m. 2. If, for any m 2 1,.,M, cm= Pm j=1?j for some ?j 0, then, ?Dual ? M X m=1 n gm(
52、Amvm) + ?mr(vm) o + PM+1+ 1 2 M X m=1 1 cm ? ?A m 0 m ? ?2 + M X m=1 m. The fi rst (resp. second) inequality in Thm. 1 links the dual optimality gap of the last dual iterate generated by Alg. 1, with the (resp. regularized) cumulative error of the corresponding primal iterates. Note that this result
53、 can be readily used to bound the cumulative error (resp. its regularized version) of Alg. 1 by the batch regularized comparative PM+1and additional terms. In the following section, we will make use of the above theorem in order to analyze our OWO Meta-Learning method. 4 Algorithm 2Within-task algor
54、ithm Input ? 0, 2 , Z = (zi)n i=1 Initialization s,1= (), w,1= rf(,)(0) For i = 1 to n Receive the datapoint zi= (xi,yi) Compute s0,i2 i(hxi,w,ii) R Defi ne (s,i+1)i= s0,i, ?i= ?(i + 1) Updatew,i+1=rf(,)?1/?i Pi j=1xjs 0 ,j ? Return (w,i)n+1 i=1, w= 1 n n X i=1 w,i, s,n+1 Algorithm 3Meta-algorithm I
55、nput 0, (Zt)T t=1 Initialization 12 For t = 1 to T Receive incrementally the dataset Zt Run Alg. 2 with tover Zt Compute st,n+1 Compute r0tas in Prop. 3 using st,n+1 Update t+1= rF? 1/ Pt j=1r 0 j ? Return (t)T+1 t=1, = 1 T T X t=1 t 4Method In this section we present the proposed OWO Meta-Learning
56、method and we establish a (regularized) cumulative error bound for it. As anticipated in Sec. 2, the method consists in the application of Alg. 1 both to the (non-normalized) within-task problem in Eq. (2) and to the across-tasks problem in Eq. (3), corresponding, as we will show in the following, t
57、o Alg. 2 and Alg. 3, respectively. In order to analyze our method, we start from studying the behavior of the inner Alg. 2. Proposition 2(Dual optimality gap for the inner Alg. 2).Let Asm. 1 hold. Then, Alg. 2 coincides with Alg. 1 applied to the non-normalized within-task problem in Eq. (2). As a c
58、onsequence, introducing the regularized cumulative error of the iterates generated by Alg. 2, Ereg inner(Z,) = n X i=1 n i(hxi,w,ii) + ?f(w,i,) o ,(6) wherew,i2 Domf(,)for anyi 2 1,.,n, the following upper bound for the associated dual optimality gap ?Dualintroduced in Thm. 1 holds ?Dual = ? Ereg inner(Z,) ? nLZ() + 1 2? n X i=1 1 i ? ?xis0 ,i ? ?2 ,. (7) Proof.The inner Alg. 2 coincides with Alg. 1 applied to the non-normalized within-task problem in Eq. (2), once one makes the ide
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 三坐标测量机实操手册:Mizar Gold 设备人形机器人零件检测避坑指南
- 辽宁省葫芦岛市2026届高三上学期1月期末考试英语试卷(含答案无听力音频无听力原文)
- 广东省江门市2026届九年级上学期1月期末考试英语试卷(含答案无听力原文及音频)
- 化工企业属地管理培训
- 飞行安全管理课件
- 11月进出口数据点评:出口强在中游
- 飞机调试技术专家
- 飞机知识讲解课件
- 2026年广安市教育体育系统公开考核招聘体育专业技术人员备考考试题库及答案解析
- 2026甘肃嘉峪关市信访局招聘公益性岗位人员笔试备考试题及答案解析
- 情趣用品项目计划书
- 2025年中考语文文言文真题汇编47份(分师生版)
- DBJ∕T 15-106-2015 顶管技术规程
- 湖北省咸宁市2025-2026学年物理高二上期末复习检测试题含解析
- 2025年煤层气开发行业分析报告及未来发展趋势预测
- 全民健身中心建设工程施工方案
- 传统文化音乐课题申报书
- GB/T 21526-2025结构胶粘剂粘接前金属和塑料表面处理导则
- 天然气管道应急抢修技术方案
- (2025年标准)情侣欠钱协议书
- 长租公寓消防知识培训课件
评论
0/150
提交评论