Uber ATG 近三年发表的自动驾驶论文合集 Learning to Reweight Examples for Robust Deep Learning_第1页
Uber ATG 近三年发表的自动驾驶论文合集 Learning to Reweight Examples for Robust Deep Learning_第2页
Uber ATG 近三年发表的自动驾驶论文合集 Learning to Reweight Examples for Robust Deep Learning_第3页
Uber ATG 近三年发表的自动驾驶论文合集 Learning to Reweight Examples for Robust Deep Learning_第4页
Uber ATG 近三年发表的自动驾驶论文合集 Learning to Reweight Examples for Robust Deep Learning_第5页
已阅读5页,还剩9页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Learning to Reweight Examples for Robust Deep Learning Mengye Ren1 2Wenyuan Zeng1 2Bin Yang1 2Raquel Urtasun1 2 Abstract Deep neural networks have been shown to be verypowerfulmodelingtoolsformanysupervised learning tasks involving complex input patterns However they can also easily overfi t to training set biases and label noises In addition to various regularizers example reweighting algorithms are popular solutions to these problems but they require careful tuning of additional hyperparam eters such as example mining schedules and regularization hyperparameters In contrast to past reweighting methods which typically consist of functions of the cost value of each example in this work we propose a novel meta learning algorithm that learns to assign weights to training examples based on their gradient directions To determine the example weights our method performs a meta gradient descent step on the current mini batch example weights which are initialized from zero to minimize the loss on a clean unbiased validation set Our proposed method can be easily implemented on any type of deep network does not require any additional hyperparameter tuning and achieves impressive performance on class imbalance and corrupted label problems where only a small amount of clean validation data is available 1 Introduction Deep neural networks DNNs have been widely used for machine learning applications due to their powerful capacity for modeling complex input patterns Despite their success it has been shown that DNNs are prone to training set biases i e the training set is drawn from a joint distribution p x y that is different from the distributionp xv yv of the evaluation set This distribution mismatch could have many 1Uber Advanced Technologies Group Toronto ON CANADA 2Department of Computer Science University of Toronto Toronto ON CANADA Correspondence to Mengye Ren Proceedings of the35thInternational Conference on Machine Learning Stockholm Sweden PMLR 80 2018 Copyright 2018 by the author s different forms Class imbalance in the training set is a very common example In applications such as object detection in the context of autonomous driving the vast majority of the training data is composed of standard vehicles but models also need to recognize rarely seen classes such as emergency vehicles or animals with very high accuracy This will sometime lead to biased training models that do not perform well in practice Another popular type of training set bias is label noise To train a reasonable supervised deep model we ideally need a large dataset with high quality labels which require many passes of expensive human quality assurance QA Although coarse labels are cheap and of high availability the presence of noise will hurt the model performance e g Zhang et al 2017 has shown that a standard CNN can fi t any ratio of label fl ipping noise in the training set and eventually leads to poor generalization performance Training set biases and misspecifi cation can sometimes be addressed with dataset resampling Chawla et al 2002 i e choosing the correct proportion of labels to train a network on or more generally by assigning a weight to each example and minimizing a weighted training loss The example weights are typically calculated based on the training loss as in many classical algorithms such as AdaBoost Freund Jiang et al 2017 However there exist two contradicting ideas in training loss based approaches In noisy label problems we prefer examples with smaller training losses as they are more likely to be clean images yet in class imbalance problems algorithms such as hard negative mining Malisiewicz et al 2011 prioritize examples with higher training loss since they are more likely to be the minority class In cases when the training set is both imbalanced and noisy these existing methods would have the wrong model assumptions In fact without a proper defi nition of an unbiased test set solving the training set bias problem is inherently ill defi ned As the model cannot distinguish the right from the wrong stronger regularization can usually work surprisingly well in certain synthetic noise settings Here we argue that in order to learn general forms of training set biases it is necessary to have a small unbiased validation to guide training It is actually arXiv 1803 09050v3 cs LG 5 May 2019 Learning to Reweight Examples for Robust Deep Learning not uncommon to construct a dataset with two parts one relatively small but very accurately labeled and another massive but coarsely labeled Coarse labels can come from inexpensive crowdsourcing services or weakly supervised data Cordts et al 2016 Russakovsky et al 2015 Chen to circumvent this we perform validation at every training iteration to dynamically determine the example weights of the current batch Towards this goal we propose an online reweighting method that leverages an additional small validation set and adaptively assigns importance weights to examples in every iteration We experiment with both class imbalance and corrupted label problems and fi nd that our approach signifi cantly increases the robustness to training set biases 2 Related Work The idea of weighting each training example has been well studied in the literature Importance sampling Kahn Ma et al 2017 Jiang et al 2015 Wang et al 2017 proposed a Bayesian method that infers the example weights as latent variables More recently Jiang et al 2017 proposed to use a meta learning LSTM to output the weights of the examples based on the training loss Reweighting examples is also related to curriculum learning Bengio et al 2009 where the model reweights among many available tasks Similar to self paced learning typically it is benefi cial to start with easier examples One crucial advantage of reweighting examples is robust ness against training set bias There has also been a multitude of prior studies on class imbalance problems including using dataset resampling Chawla et al 2002 Dong et al 2017 cost sensitive weighting Ting 2000 Khan et al 2015 and structured margin based objectives Huang et al 2016 Meanwhile the noisy label problem has been thoroughly studied by the learning theory commu nity Natarajan et al 2013 Angluin Sukhbaatar Xiao et al 2015 Azadi et al 2016 Goldberger Li et al 2017 Jiang et al 2017 Vahdat 2017 Hendrycks et al 2018 In addition to corrupted data Koh Mu noz Gonz alez et al 2017 demonstrate the possibility of a dataset adversarial attack i e dataset poisoning Our method improves the training objective through a weighted loss rather than an average loss and is an in stantiation of meta learning Thrun Lake et al 2017 Andrychowicz et al 2016 i e learning to learn better Using validation loss as the meta objective has been explored in recent meta learning literature for few shot learning Ravi Ren et al 2018 Lorraine otherwise we can always add this small validation set into the training set and leverage more information during training Let x be our neural network model and be the model parameters We consider a loss functionC y y to minimize during training where y x In standard training we aim to minimize the expected loss for the training set 1 N PN i 1C yi yi 1 N PN i 1fi where each input example is weighted equally andfi stands for the loss function associating with dataxi Here we aim to learn a reweighting of the inputs where we minimize a weighted loss w arg min N X i 1 wifi 1 withwiunknown upon beginning Note that wi N i 1can be understood as training hyperparameters and the optimal selection of w is based on its validation performance w arg min w w 0 1 M M X i 1 fv i w 2 It is necessary thatwi 0for alli since minimizing the negative training loss can usually result in unstable behavior Online approximationCalculating the optimalwire quires two nested loops of optimization and every single loop can be very expensive The motivation of our approach is to adapt onlinewthrough a single optimization loop For each training iteration we inspect the descent direction of some training examples locally on the training loss surface and reweight them according to their similarity to the descent direction of the validation loss surface For most training of deep neural networks SGD or its variants are used to optimize such loss functions At every steptof training a mini batch of training examples xi yi 1 i n is sampled wherenis the mini batch size n N Then the parameters are adjusted according to the descent direction of the expected loss on the mini batch Let s consider vanilla SGD t 1 t 1 n n X i 1 fi t 3 where is the step size We want to understand what would be the impact of training exampleitowards the performance of the validation set at training stept Following a similar analysis to Koh lf Pn i 1 iC yf i yf i 6 t BackwardAD lf t 7 t t t 8 yg Forward Xg yg t 9 lg 1 m Pm i 1C yg i yg i 10 BackwardAD lg 11 w max 0 w w P j w Pj w 12 lf Pn i 1wiC yi yf i 13 t BackwardAD lf t 14 t 1 OptimizerStep t t 15 end for Training timeOur automatic reweighting method will introduce a constant factor of overhead First it requires two full forward and backward passes of the network on training and validation respectively and then another backward on backwardpass Step5inFigure1 togetthegradientstothe example weights and fi nally a backward pass to minimize the reweighted objective In modern networks a backward on backward pass usually takes about the same time as a forward pass and therefore compared to regular training our method needs approximately 3 training time it is also possible to reduce the batch size of the validation pass for speedup We expect that it is worthwhile to spend the extra time to avoid the irritation of choosing early stopping fi netuning schedules and other hyperparameters 3 4 Analysis convergence of the reweighted training Convergence results of SGD based optimization methods are well known Reddi et al 2016 However it is still meaningful to establish a convergence result about our methodsinceitinvolvesoptimizationoftwo levelobjectives Eq 1 2 rather than one and we further make some fi rst order approximation by introducing Eq 7 Here we show theoretically that our method converges to the critical point of the validation loss function under some mild conditions and we also give its convergence rate More detailed proofs can be found in the Appendix B C Learning to Reweight Examples for Robust Deep Learning Defi nition 1 A functionf x Rd Ris said to be Lipschitz smooth with constant L if k f x f y k Lkx yk x y Rd Defi nition 2 f x has bounded gradients ifk f x k for all x Rd In most real world cases the high quality validation set is really small and thus we could set the mini batch sizemto be the same as the size of the validation setM Under this condition the following lemma shows that our algorithm always converges to a critical point of the validation loss However our method is not equivalent to training a model only on this small validation set Because directly training a model on a small validation set will lead to severe overfi tting issues On the contrary our method can leverage useful information from a larger training set and still converge to an appropriate distribution favored by this clean and balanced validation dataset This helps both generalization and robustness to biases in the training set which will be shown in our experiments Lemma 1 Suppose the validation loss function is Lipschitz smooth with constantL and the train loss functionfi of training dataxihave bounded gradients Let the learning rate t satisfi es t 2n L 2 wherenis the training batch size Then following our algorithm the validation loss always monotonically decreases for any sequence of training batches namely G t 1 G t 13 where G is the total validation loss G 1 M M X i 1 fv i t 1 14 Furthermore in expectation the equality in Eq 13 holds only when the gradient of validation loss becomes 0 at some time stept namelyEt G t 1 G t if and only if G t 0 where the expectation is taking over possible training batches at time step t Moreover we can prove the convergence rate of our method to be O 1 2 Theorem2 SupposeG fiand tsatisfythe aforementioned conditions then Algorithm 1 achieves E k G t k2 in O 1 2 steps More specifi cally min 0 t t t 1 i t i t i t 0 19 1 m m X j 1 fv j t fi t 20 1 m m X j 1 L X l 1 fv j l l l t fi l l l t 21 1 m m X j 1 L X l 1 vec zv j l 1g v j l vec z i l 1g i l 22 1 m m X j 1 L X l 1 D1 X p 1 D2 X q 1 zv j l 1 pg v j l q zi l 1 pgi l q 23 1 m m X j 1 L X l 1 D1 X p 1 zv j l 1 p zi l 1 p D2 X q 1 gv j l qgi l q 24 1 m m X j 1 L X l 1 zv j l 1 zi l 1 gv j l gi l 25 B Convergence of our method This section provides the proof for Lemma 1 Lemma Suppose the validation loss function is Lipschitz smooth with constantL and the train loss functionfi of training dataxihave bounded gradients Let the learning rate t satisfi es t 2n L 2 wherenis the training batch size Then following our algorithm the validation loss always monotonically decreases for any sequence of training batches namely G t 1 G t 26 where G is the total validation loss G 1 M M X i 1 fv i t 1 27 Furthermore in expectation the equality in Eq 26 holds only when the gradient of validation loss becomes 0 at some time stept namelyEt G t 1 G t if and only if G t 0 where the expectation is taking over possible training batches at time step t Proof Suppose we have a small validation set withMclean data x1 x2 xM each associating with a validation loss functionfi where is the parameter of the model The overall validation loss would be G 1 M M X i 1 fi 28 Now suppose we have anotherN Mtraining data xM 1 xM 2 xN and we add those validation data into this set to form our large training datasetT which has N data in total The overall training loss would be F 1 M N X i 1 fi 29 For simplicity sinceM N we assume that the validation data is a subset of the training data During training we take a mini batchBof training data at each step and B n Following some similar derivation as Appendix A we have the following update rules t 1 t t n X i B max G f i 0 fi 30 where tis the learning rate at time stept Since all gradients are taken at t we omit tin our notations Since the validation lossG is Lipschitz smooth we have G t 1 G t G L 2 k k2 31 Plugging our updating rule Eq 30 G t 1 G t I1 I2 32 where I1 t n X i B max G fi 0 G fi t n X i B max G fi 0 2 33 and I2 L 2 t n X i B max G fi 0 fi 2 34 L 2 2 t n2 X i B max G f i 0 fi 2 35 L 2 2 t n2 X i B max G f i 0 2 k fik2 36 L 2 2 t n2 X i B max G f i 0 2 2 37 Learning to Reweight Examples for Robust Deep Learning The fi rst inequality Eq 35 comes from the triangle inequality The second inequality Eq 37 holds since fihas bounded gradients If we denoteTt P i Bmax G fi 0 2 where tstands for the time step t then G t 1 G t t n Tt 1 L t 2 2n 38 Note that by defi nition Ttis non negative and since t 2n L 2 if follows that that G t 1 G t for any t Next we proveEt Tt 0if and only if G 0 and Et Tt 0if and only if G 6 0 where the expectation is taken over all possible training batches at time stept It is obvious that when G 0 Et Tt 0 If G 6 0 from the inequality below we fi rstly know that there must exist a validation example xj 0 j Msuch that G fj 0 0 G 1 M M X i 1 G fi 39 Secondly there is a non zero possibilitypto sample a training batchBsuch that it contains this dataxj Also noticing thatTtis a non negative random variable we have E t Tt p X i B max G fi 0 2 pmax G fj 0 2 p G f j 2 0 40 Therefore if we take expectation over the training batch on both sides of Eq 38 we can conclude that E t G t 1 G t 41 where the equality holds if and only if G 0 Thi

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论