




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Reconciling meta-learning and continual learning with online mixtures of tasks Ghassen Jerfel Duke University Erin Grant UC Berkeley Thomas L. Griffi ths Princeton University Katherine Heller Duke University Abstract Learning-
2、to-learn or meta-learning leverages data-driven inductive bias to increase the effi ciency of learning on a novel task. This approach encounters diffi culty when transfer is not advantageous, for instance, when tasks are considerably dissimilar or change over time. We use the connection between grad
3、ient-based meta-learning and hierarchical Bayes to propose a Dirichlet process mixture of hierarchical Bayesian models over the parameters of an arbitrary parametric model such as a neural network. In contrast to consolidating inductive biases into a single set of hyperparameters, our approach of ta
4、sk-dependent hyperparameter selection better handles latent distribution shift, as demonstrated on a set of evolving, image-based, few-shot learning benchmarks. 1Introduction Meta-learning algorithms aim to increase the effi ciency of learning by treating task-specifi c learning episodes as examples
5、 from which to generalize 47. The central assumption of a meta-learning algorithm is that some tasks are inherently related and so inductive transfer can improve sample effi ciency and generalization 8,9,5. In learning a single set of domain-general hyperparameters that parameterize a metric space 5
6、3 or an optimizer 40,14, recent meta-learning algorithms make the assumption that tasks are equally related, and therefore non-adaptive, mutual transfer is appropriate. This assumption has been cemented in recent few-shot learning benchmarks, which comprise a set of tasks generated in a uniform mann
7、er e.g., 53, 14. However, the real world often presents scenarios in which an agent must decide what degree of transfer is appropriate. In some cases, a subset of tasks are more strongly related to each other, and so non-uniform transfer provides a strategic advantage. On the other hand, transfer in
8、 the presence of dissimilar or outlier tasks worsens generalization performance 44,12. Moreover, when the underlying task distribution is non-stationary, inductive transfer to previously observed tasks should exhibit graceful degradation to address the catastrophic forgetting problem 28. In these se
9、ttings, the consolidation of all inductive biases into a single set of hyperparameters is not well-posed to deal with changing or diverse tasks. In contrast, in order to account for this degree of task heterogeneity, humans detect and adapt to novel contexts by attending to relationships between tas
10、ks 10. In this work, we learn a mixture of hierarchical models that allows a meta-learner to adaptively select over a set of learned parameter initializations for gradient-based adaptation to a new task. The method is equivalent to clustering task-specifi c parameters in the hierarchical model induc
11、ed by Equal contribution. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. recasting gradient-based meta-learning as hierarchical Bayes 21 and generalizes the model-agnostic meta-learning (MAML) algorithm introduced in 14 . By treating the assignment of tas
12、k-specifi c parameters to clusters as latent variables, we can directly detect similarities between tasks on the basis of the task-specifi c likelihood, which may be parameterized by an expressive model such as a neural network. Our approach, therefore, alleviates the need for explicit geometric or
13、probabilistic modeling assumptions about the weights of a complex parametric model and provides a scalable method to regulate information transfer between episodes. We additionally consider the setting of a non-stationary or evolving task distribution, which neces- sitates a meta-learning method tha
14、t possesses adaptive complexity. We translate stochastic point estimation in an infi nite mixture 39 over model parameters into a gradient-based meta-learning algorithm that is compatible with any differentiable likelihood model and requires no distributional assumptions. We demonstrate the unexplor
15、ed ability of non-parametric priors over neural network parameters to automatically detect and adapt to task distribution shift in a naturalistic image dataset; addressing the non-trivial setting of task-agnostic continual learning in which the task change is unobserved c.f., task-aware settings suc
16、h as 28. 2Gradient-based meta-learning as hierarchical Bayes Since our approach is grounded in the probabilistic formulation of meta-learning as hierarchical Bayes 4, our approach can be applied to any probabilistic meta-learner. In this work, we focus on model-agnostic meta-learning (MAML) 14, a gr
17、adient-based meta-learning approach that estimates global parameters to be shared among task-specifi c models as an initialization for a few steps of gradient descent. MAML admits a natural interpretation as parameter estimation in a hierarchical probabilistic model, where the learned initialization
18、 acts as data-driven regularization for the estimation of task-specifi c parameters ?j. In particular, 21 cast MAML as posterior inference for task-specifi c parameters?jgiven some samples of task-specifi c dataxj1:Nand a prior over?jthat is induced by the early stopping of an iter- ative optimizati
19、on procedure; truncation atKsteps of gradient descent on the negative log-likelihood ?logp(xj1:N| ?j)starting from?j(0)= can be then understood as mode estimation of the posteriorp(?j| xj1:N,) .The mode estimates ?j= ?j(0)+ PK k=1r?logp(xj1:N | ?j(k?1) are then combined to evaluate the marginal like
20、lihood for each task as p ? xjN+1:N+M| ? = Z p?xjN+1:N+M| ?j?p?j| ?d?j p?xjN+1:N+M| ?j?,(1) wherexjN+1:N+Mis another set of samples from thejth task. A training dataset can then be summa- rized in an empirical Bayes point estimate ofcomputed by gradient-based optimization of the joint marginal likel
21、ihood in (1) in across tasks, so that the likelihood of a datapoint sampled from a new task can be computed using only and without storing the task-specifi c parameters. 3Improving meta-learning by modeling latent task structure If the task distribution is heterogeneous, assuming a single parameter
22、initializationfor gradient- based meta-learning is not suitable because it is unlikely that the point estimate computed by a few steps of gradient descent will suffi ciently adapt the task-specifi c parameters?to a diversity of tasks. Moreover, explicitly estimating relatedness between tasks has the
23、 potential to aid the effi cacy of a meta-learning algorithm by modulating both positive and negative transfer 52,60,45,61, and by identifying outlier tasks that require a more signifi cant degree of adaptation 56,23. Nonetheless, defi ning an appropriate notion of task relatedness is a diffi cult p
24、roblem in the high-dimensional parameter or activation space of models such as neural networks. Using the probabilistic interpretation of Section 2, we deal with the variability in the tasks by assuming that each set of task-specifi c parameters?jis drawn from a mixture of base distributions, each o
25、f which is parameterized by a hyperparameter(). Accordingly, we capture task relatedness by estimating the likelihood of assigning each task to a mixture component based simply on the task-specifi c negative log-likelihood after a few steps of gradient-based adaptation. The result is a scalable meta
26、-learning algorithm that jointly learns task-specifi c cluster assignments and model parameters, and is capable of modulating the transfer of information across tasks by clustering together related task-specifi c parameter settings. 2 Algorithm 1 Stochastic gradient-based EM for finite and infinite
27、mixtures( dataset D, meta-learning rate ?, adaptation rate , temperature , initial cluster count L0, meta-batch sizeJ, training batch sizeN, validation batch sizeM, adaptation iteration countK, global priorG0) Initialize cluster count L L0and meta-level parameters (1),.,(L) G0 while not converged do
28、 Draw tasks T1,.,TJ pD(T ) for j in 1,.,J do Draw task-specifi c datapoints, xj1.xjN+M pTj(x) Draw a parameter initialization for a new cluster from the global prior, (L+1) G0 for in 1,.,L,L + 1 do Initialize ?() j () Compute task-specifi c mode estimate, ?() j ?() j + P kr?logp(xj1:N | ?() j ) Comp
29、ute assignment of tasks to clusters, ?j E-STEP (xj1:N,?(1:L) j ) Update each component in 1,.,L, () ()+ M-STEP (xjN+1:N+M,?() j ,?jJ j=1) Summarize 1,. to update global prior G0 return (1),. E-STEP( xjiN i=1, ?() j L =1) return -softmax(Pilogp(xji|?() j ) M-STEP(xjiM i=1, ?() j ,?j) return ? rPj,i?j
30、logp(xji|?() j ) Top:Algorithm 1:Stochastic gradient-based expectation maximization (EM) for probabilistic clustering of task-specifi c parameters in a meta-learning setting. Bottom:Subroutine 2:TheE-STEPandM-STEP for a fi nite mixture of hierarchical Bayesian models implemented as gradient-based me
31、ta-learners. Formally, letzjbe the categorical latent variable indicating the cluster assignment of each task- specifi c parameter?j. Direct maximization of the mixture model likelihood is a combinatorial optimization problem that can grow intractable. This intractability is equally problematic for
32、the posterior distribution over the cluster assignment variableszj and the task-specifi c parameters?j, which are both treated as latent variables in the probabilistic formulation of meta-learning. A scalable approximation involves representing the conditional distribution for each latent variable w
33、ith a maximum a posteriori (MAP) estimate. In our meta-learning setting of a mixture of hierarchical Bayes (HB) models, this suggests an augmented expectation maximization (EM) procedure 13 alternating between anE-STEPthat computes an expectation of the task-to-cluster assignmentszj, which itself in
34、volves the computation of a conditional mode estimate for the task-specifi c parameters ?j, and an M-STEP that updates the hyperparameters (1:L)(see Subroutine 2). To ensure scalability, we use the minibatch variant of stochastic optimization 43 in both theE-STEP and theM-STEP; such approaches to EM
35、 are motivated by a view of the algorithm as optimizing a single free energy at both theE-STEPand theM-STEP37. In particular, for each taskjand cluster , we follow the gradients to minimize the negative log-likelihood on the training data pointsxj1:N, using the cluster parameters()as initialization.
36、 This allows us to obtain a modal point estimate of the task-specifi c parameters ?() j .TheE-STEPin Subroutine 2 leverages the connection between gradient-based meta-learning and HB 21 and the differentiability of our clustering procedure to employ the task-specifi c parameters to compute the poste
37、rior probability of cluster assignment. Accordingly, based on the likelihood of the same training data points under the model parameterized by ?() j , we compute the cluster assignment probabilities as ?() j := p?zj= | xj1:N,(1:L)?/ Z p(xj1:N| ?j) p(?j| ()d?j p(xj1:N| ?() j ) .(2) The cluster means(
38、)are then updated by gradient descent on the validation loss in theM-STEPin Subroutine 2; thisM-STEPis analogous to the MAML algorithm in 14 with the addition of mixing weights ?() j . Note that, unlike other recent approaches to probabilistic clustering e.g.,3 we adhere to the episodic meta-learnin
39、g setup for both training and testing since only the task support setxj1:Nis used to compute both the point estimate ?() j and the cluster responsibilities?() j .See Algorithm 1 for the full algorithm, whose high-level structure is shared with the non-parametric variant of our method detailed in Sec
40、tion 5. 3 Table 1:Meta-test set accuracy on the miniImageNet5-way,1-and5-shot classifi cation benchmarks from 53 among methods using a comparable architecture (the 4-layer convolutional network from 53). For methods on which we report results in later experiments, we additionally report the total nu
41、mber of parameters optimized by the meta-learning algorithm. a Results reported by 40. b We report test accuracy for models matching train and test “shot” and “way”. c We report test accuracy for a comparable base (task-specifi c network) architecture. ModelNum. param.1-shot (%)5-shot (%) matching n
42、etwork53a43.56 0.8455.31 0.73 meta-learner LSTM4043.44 0.7760.60 0.71 prototypical networks49b46.61 0.7865.77 0.70 MAML1448.70 1.8463.11 0.92 MT-net3038,90751.70 1.84 PLATIPUS1565,54650.13 1.86 VERSA20c807,93848.53 1.84 Our method: 2 components65,54649.60 1.5064.60 0.92 3 components98,31951.20 1.526
43、5.00 0.96 4 components131,09250.49 1.4664.78 1.43 5 components163,86551.46 1.68 4 Experiment: miniImageNet few-shot classifi cation Clustering task-specifi c parameters provides a way for a meta-learner to deal with task heterogeneity as each cluster can be associated with a subset of the tasks that
44、 would benefi t most from mutual transfer. While we do not expect existing tasks to present a signifi cant degree of heterogeneity given the uniform sampling assumptions behind their design, we nevertheless conduct an experiment to validate that our method gives an improvement on a standard benchmar
45、k for few-shot learning. We apply Algorithm 3 with Subroutine 2 andL 2 2,3,4,5components to the 1-shot and 5-shot, 5-way, miniImageNet few-shot classifi cation benchmarks 53; Appendix C.2.1 contains additional experimental details. We demonstrate in Table 1 that a mixture of meta-learners improves t
46、he performance of gradient-based meta-learning on this task for any number of components. However, the performance of the parametric mixture does not improve monotonically with the number of componentsL. This leads us to the development of non-parametric clustering for continual meta- learning, wher
47、e enforcing specialization to subgroups of tasks and increasing model complexity is, in fact, necessary to preserve performance on prior tasks due to signifi cant heterogeneity. 5Scalable online mixtures for task-agnostic continual learning The mixture of meta-learners developed in Section 3 address
48、es a drawback of meta-learning ap- proaches such as MAML that consolidate task-general information into a single set of hyperparame- ters. However, the method adds another dimension to model selection in the form of identifying the correct number of mixture components. While this may be resolved by
49、cross-validation if the dataset is static and therefore the number of components can remain fi xed, adhering to a fi xed number of components throughout training is not appropriate in the non-stationary regime, where the underlying task distribution changes as different types of tasks are presented
50、sequentially in a continual learning setting. In this regime, it is important to incrementally introduce more components that can each specialize to the distribution of tasks observed at the time of spawning. To address this, we derive a scalable stochastic estimation procedure to compute the expect
51、ation of task-to-cluster assignments (E-STEP) for a growing number of task clusters in a non-parametric mixture model 39 called the Dirichlet process mixture model (DPMM). The formulation of the Dirichlet process mixture model (DPMM) that is most appropriate for incremental learning is the sequentia
52、l draws formulation that corresponds to an instantiation of the Chinese restaurant process (CRP) 39. A CRP prior overzjallows some probability to be assigned to a new mixture component while the task identities are inferred in a sequential manner, and has therefore been key to recent online and stoc
53、hastic learning of the DPMM 31. A draw from a CRP proceeds as follows: For a sequence of tasks, the fi rst task is assigned to the fi rst cluster and thejth subsequent task is then assigned to the th cluster with probability p ? zj= | z1:j?1, ? = ( n()/n + for L /n + for = L + 1 , (3) 4 E-STEP( xj1:
54、N,?(1:L) j , concentration , threshold ) DPMM log-likelihood for all in 1,.,L, () j P ilogp(xji | ?() j ) + logn() DPMM log-likelihood for new component, (L+1) j P ilogp(xji | ?(L+1) j ) + log DPMM assignments, ?j -softmax(1) j ,.,(L+1) j ) if ?(L+1) j then Expand the model by incrementing L L + 1 e
55、lse Renormalize ?j -softmax(1) j ,.,(L) j ) return ?j M-STEP( xjiM i=1, ?() j ,?j, concentration ) return ? rPj,i?jlogp(xji| ?() j ) + logn() Subroutine 3: The E-STEP and M-STEP for an infi nite mixture of hierarchical Bayesian models. whereLis the number of non-empty clusters,n()is the number of ta
56、sks already occupying a cluster , and is a fi xed positive concentration parameter. The prior probability associated with a new mixture component is therefore p(zj= L + 1 | z1:j?1, ). In a similar spirit to Section 3, we develop a stochastic EM procedure for the estimation of the latent task-specifi
57、 c parameters?1:Jand the meta-level parameters(1:L)in the DPMM, which allows the number of observed task clusters to grow in an online manner with the diversity of the task distribution. While computation of the mode estimate of the task-specifi c parameters?jis mostly unchanged from the fi nite var
58、iant, the estimation of the cluster assignment variableszin theE-STEP requires revisiting the Gibbs conditional distributions due to the potential addition of a new cluster at each step. For a DPMM, the conditional distributions for zjare p ? zj= | xj1:M,z1:j?1 ? / 8 : n()Rp(xj1:M|?() j )p(?() j |)d?jdG()for L R p(xj1:M|?(0) j )p(?(0) j |)d?jdG0()for = L + 1 (4) withG0as the base measure or global prior over the components of the CRP,Gis the prior over each clusters parameters, initialized with a draw from a Gaussian centered atG0 with a fi xed
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 内部平台管理办法
- 军事安全管理办法
- 军品批次管理办法
- 军队装备管理办法
- 农务专家管理办法
- 农村噪音管理办法
- 农村文化管理办法
- 农村花草管理办法
- 农田种植管理办法
- 农行服务管理办法
- 中国狼疮肾炎诊治和管理指南(2025版)解读
- 环保企业五年发展计划
- 金属非金属矿井通风作业培训
- 灵活用工合同协议书
- 全球及中国PCB检测设备行业市场发展现状及发展前景研究报告2025-2028版
- 《移步换景 别有洞天─中国古典园林欣赏》教学课件-2024-2025学年人教版初中美术八年级下册
- 2025年重庆物流集团渝地绿能科技有限公司招聘笔试参考题库含答案解析
- 广汉市小升初试题及答案
- 塔吊安装拆卸应急准备与响应预案
- 重点人口管理工作规定
- 西亚、中亚、北非音乐文化
评论
0/150
提交评论