NIPS20198794-an-adaptive-empirical-bayesian--for-sparse-deep-learning

上传人：我*** IP属地：北京上传时间：2020-05-31 格式：PDF 页数：11 大小：716.72KB 积分：12 举报 版权申诉

NIPS20198794-an-adaptive-empirical-bayesian--for-sparse-deep-learning_第2页

NIPS20198794-an-adaptive-empirical-bayesian--for-sparse-deep-learning_第3页

NIPS20198794-an-adaptive-empirical-bayesian--for-sparse-deep-learning_第4页

NIPS20198794-an-adaptive-empirical-bayesian--for-sparse-deep-learning_第5页

已阅读5页，还剩6页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

An Adaptive Empirical Bayesian Method for Sparse Deep Learning Wei Deng Department of Mathematics Purdue University West Lafayette, IN 47907 Xiao Zhang Department of Computer Science Purdue University West Lafayette, IN 47907 Faming Liang Department of Statistics Purdue University West Lafayette, IN 47907 Guang Lin Departments of Mathematics, Statistics and School of Mechanical Engineering Purdue University West Lafayette, IN 47907 Abstract We propose a novel adaptive empirical Bayesian (AEB) method for sparse deep learning, where the sparsity is ensured via a class of self-adaptive spike-and-slab priors. The proposed method works by alternatively sampling from an adaptive hierarchical posterior distribution using stochastic gradient Markov Chain Monte Carlo (MCMC) and smoothly optimizing the hyperparameters using stochastic approximation (SA). We further prove the convergence of the proposed method to the asymptotically correct distribution under mild conditions. Empirical applica- tions of the proposed method lead to the state-of-the-art performance on MNIST and Fashion MNIST with shallow convolutional neural networks (CNN) and the state-of-the-art compression performance on CIFAR10 with Residual Networks. The proposed method also improves resistance to adversarial attacks. 1Introduction MCMC, known for its asymptotic properties, has not been fully investigated in deep neural networks (DNNs) due to its unscalability in dealing with big data. Stochastic gradient Langevin dynamics (SGLD) Welling and Teh, 2011, the fi rst stochastic gradient MCMC (SG-MCMC) algorithm, tackled this issue by adding noise to the stochastic gradient, smoothing the transition between optimization and sampling and making MCMC scalable. Chen et al. 2014 proposed using stochastic gradient Hamiltonian Monte Carlo (SGHMC), the second-order SG-MCMC, which was shown to converge faster. In addition to modeling uncertainty, SG-MCMC also has remarkable non-convex optimization abilities. Raginsky et al. 2017, Xu et al. 2018 proved that SGLD, the fi rst-order SG-MCMC, is guaranteed to converge to an approximate global minimum of the empirical risk in fi nite time. Zhang et al. 2017 showed that SGLD hits the approximate local minimum of the population risk in polynomial time. Mangoubi and Vishnoi 2018 further demonstrated SGLD with simulated annealing has a higher chance to obtain the global minima on a wider class of non-convex functions. However, all the analyses fail when DNN has too many parameters, and the over-specifi ed model tends to have a large prediction variance, resulting in poor generalization and causing over-fi tting. Therefore, a proper model selection is on demand at this situation. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. A standard method to deal with model selection is variable selection. Notably, the best variable selection based on theL0penalty is conceptually ideal for sparsity detection but is computationally slow. Two alternatives emerged to approximate it. On the one hand, penalized likelihood approaches, such as Lasso Tibshirani, 1994, induce sparsity due to the geometry that underlies theL1penalty. To better handle highly correlated variables, Elastic Net was proposed Zou and Hastie, 2005 and makes a compromise betweenL1penalty andL2penalty. On the other hand, spike-and-slab approaches to Bayesian variable selection originates from probabilistic considerations. George and McCulloch 1993 proposed to build a continuous approximation of the spike-and-slab prior to sample from a hierarchical Bayesian model using Gibbs sampling. This continuous relaxation inspired the effi cient EM variable selection (EMVS) algorithm in linear models Ro rkov and George, 2014, 2018. Despite the advances of model selection in linear systems, model selection in DNNs has received less attention. Ghosh et al. 2018 proposed to use variational inference (VI) based on regularized horseshoe priors to obtain a compact model. Liang et al. 2018 presented the theory of posterior consistency for Bayesian neural networks (BNNs) with Gaussian priors, and Ye and Sun 2018 applied a greedy elimination algorithm to conduct group model selection with the group Lasso penalty. Although these works only show the performance of shallow BNNs, the experimental methodologies imply the potential of model selection in DNNs. Louizos et al. 2017 studied scale mixtures of Gaussian priors and half-Cauchy scale priors for the hidden units of VGG models Simonyan and Zisserman, 2014 and achieved good model compression performance on CIFAR10 Krizhevsky, 2009 using VI. However, due to the limitation of VI in non-convex optimization, the compression is still not sparse enough and can be further optimized. Over-parameterized DNNs often demand for tremendous memory use and heavy computational resources, which is impractical for smart devices. More critically, over-parametrization frequently overfi ts the data and results in worse performance Lin et al., 2017. To ensure the effi ciency of the sparse sampling algorithm without over-shrinkage in DNN models, we propose an AEB method to adaptively sample from a hierarchical Bayesian DNN model with spike-and-slab Gaussian-Laplace (SSGL) priors and the priors are learned through optimization instead of sampling. The AEB method differs from the full Bayesian method in that the priors are inferred from the empirical data and the uncertainty of the priors is no longer considered to speed up the inference. In order to optimize the latent variables without affecting the convergence to the asymptotically correct distribution, stochastic approximation (SA) Benveniste et al., 1990, a standard method for adaptive sampling Andrieu et al., 2005, Liang, 2010, naturally fi ts to train the adaptive hierarchical Bayesian model. As a result, the asymptotic property allows us to combine simulated annealing to obtain a better point estimate in non-convex optimization. In this paper, we propose a sparse Bayesian deep learning algorithm, SG-MCMC-SA, to adaptively learn the hierarchical Bayes mixture models in DNNs. This algorithm has four main contributions: We propose a novel AEB method to effi ciently train hierarchical Bayesian mixture DNN models, where the parameters are learned through sampling while the priors are learned through optimization. We prove the convergence of this approach to the asymptotically correct distribution, and it can be further generalized to a class of adaptive sampling algorithms for estimating state-space models in deep learning. We apply this adaptive sampling algorithm in the DNN compression problems fi rstly, with potential extension to a variety of model compression problems. It achieves the state of the art in terms of compression rates, which is 91.68% accuracy on CIFAR10 using only 27K parameters (90% sparsity) with Resnet20 He et al., 2016. 2Stochastic Gradient MCMC We denote the set of model parameters by, the learning rate at timekby?(k), the entire data by D = diN i=1, wheredi= (xi,yi), the log of posterior byL(). The mini-batch of dataBis of size nwith indicesS = s1,s2,.,sn, wheresi 1,2,.,N. Stochastic gradientL()from a mini-batch of data B randomly sampled from D is used to approximate L(): L() = logP() + N n X iS logP(di|). (1) 2 SGLD (no momentum) is formulated as follows: (k+1)= (k)+ ?(k)L(k) + N(0,2?(k)1),(2) where 0denotes the inverse temperature. It has been shown that SGLD asymptotically converges to a stationary distribution(|D) eL()Teh et al., 2016, Zhang et al., 2017. Asincreases and?decreases gradually, the solution tends towards the global optima with a higher probability. Another variant of SG-MCMC, SGHMC Chen et al., 2014, Ma et al., 2015, proposes to generate samples as follows: d = rdt, dr = L()dt Crdt + N(0,2B1dt) + N(0,2(C B)1dt), (3) whereris the momentum item, Bis an estimate of the stochastic gradient variance,Cis a user- specifi ed friction term. Regarding the discretization of (3), we follow the numerical method proposed by Saatci and Wilson 2017 due to its convenience to import parameter settings from SGD. 3Empirical Bayesian via Stochastic Approximation 3.1A hierarchical formulation with deep SSGL priors Inspired by the hierarchical Bayesian formulation for sparse inference George and McCulloch, 1993, we assume the weight ljin sparse layer l with index j follows the SSGL prior lj|2,lj (1 lj)L(0,v0) + ljN(0,2v1).(4) wherelj 0,1,l Rpl,2 R,L(0,v0)denotes a Laplace distribution with mean0and scalev0, andN(0,2v1)denotes a Normal distribution with mean0and variance2v1. The sparse layer can be the fully connected layers (FC) in a shallow CNN or Convolutional layers in ResNet. If we havelj= 0, the prior behaves like Lasso, which leads to a shrinkage effect; whenlj= 1, the L2penalty dominates. The likelihood follows (B|,2) = exp ? P iS(yi (xi;) 2 22 ? (22)n/2 (regression), Q iS expyi(xi;) PK t=1expt(xi;) (classifi cation), (5) where(xi;)is a linear or non-linear mapping, andyi 1,2,.,Kis the response value of the i-th example. In addition, the variance2follows an inverse gamma prior(2) = IG(/2,/2). The i.i.d. Bernoulli prior is used for, namely(l|l) = |l| l (1 l)pl|l|wherel Rfollows Beta distribution(l) a1 l (1 l)b1. The use of self-adaptive penalty enables the model to learn the level of sparsity automatically. Finally, our posterior follows (,2,|B) (B|,2) N n(|2,)(2|)(|)(). (6) 3.2Empirical Bayesian with approximate priors To speed up the inference, we propose the AEB method by samplingand optimizing2, where uncertainty of the hyperparameters are not considered. Because the binary variableis harder to optimize directly, we consider optimizing the adaptive posteriorE|,D ?(,2,|D)? instead. Due to the limited memory, which restricts us from sampling directly fromD, we choose to sample from E|,D ?E B ?(,2,|B)? . By Fubinis theorem and Jensens inequality, we have logE|,D ?E B ?(,2,|B)? = logEB ?E |,D ?(,2,|B)? EB ?logE |,D ?(,2,|B)? EB ?E |,D ?log(,2,|B)? . (7) E|,D is short for E |(k),(k),(k),D. EB(,2,|B) denotesR D(, 2,|B)dB 3 Instead of tackling (,2,|D) directly, we propose to iteratively update the lower bound Q Q(,|(k),(k),(k) = EB ?E |D ?log(,2,|B)? .(8) Given(k),(k),(k) at the k-th iteration, we fi rst sample(k+1)fromQ, then optimizeQwith respect to,andEl|,Dvia SA, whereEl|,Dis used sinceis treated as unobserved variable. To make the computation easier, we decompose our Q as follows: Q(,|(k),(k),(k) = Q1(,|(k),(k),(k) + Q2(|(k),(k),(k) + C,(9) Denote X and C as the sets of the indices of sparse and non-sparse layers, respectively. We have: Q1(|(k),(k),(k) = N n log(B|) |z log likelihood X lC X jpl 2 lj 22 0 |z non-sparse layers C p + + 2 2 log(2) X lX X jpl |lj| lj0 z| El|,D ? 1 v0(1 lj) ? + 2 lj lj1 z| El|,D ? 1 v1lj ? 22 |z deep SSGL priors in sparse layers X 22 (10) Q2(l|(k) l ,(k) l ) = X lX X jpl log ? l 1 l ? lj z| El|,Dlj+(a 1)log(l) + (pl+ b 1)log(1 l), (11) where , and are to be estimated in the next section. 3.3Empirical Bayesian via stochastic approximation To simplify the notation, we denote the vector(,)by. Our interest is to obtain the optimal based on the asymptotically correct distribution(,). This implies that we need to obtain an estimate that solves a fi xed-point formulation R g()(,)d = Shimkin, 2011, whereg()is inspired by EMVS to obtain the optimalbased on the current . Defi ne the random outputg() asH(,) and the mean fi eld functionh() := EH(,). The stochastic approximation algorithm can be used to solve the fi xed-point iterations: (1)Sample(k+1)from a transition kernel(k)(), which yields the distribution(,(k), (2) Update (k+1)= (k)+ (k+1)H(k),(k+1) = (k)+ (k+1)(h(k) + (k). where(k+1)is the step size. The equilibrium pointis obtained when the distribution of converges to the invariant distribution(,). The stochastic approximation Benveniste et al., 1990 differs from the RobbinsMonro algorithm in that samplingfrom a transition kernel instead of a distribution introduces a Markov state-dependent noise(k)Andrieu et al., 2005. In addition, since variational technique is only used to approximate the priors, and the exact likelihood doesnt change, the algorithm falls into a class of adaptive SG-MCMC instead of variational inference. Regarding the updates ofg()with respect to, we denote the optimalbased on the current and by . We have that (k+1) lj , the probability of ljbeing dominated by the L2penalty is (k+1) lj = El|,Blj= P(lj= 1|(k) l ,(k) l ) = alj alj+ blj ,(12) wherealj= (k) lj |lj= 1)P(lj= 1|(k) l )andblj= (k) lj |lj= 0)P(lj= 0|(k) l ). The choice of Bernoulli prior enables us to use P(lj= 1|(k) l ) = (k) l . Similarly, as to g() w.r.t. , the optimal lj0and lj1based on the current ljare given by: lj0= El|,B ? 1 v0(1 lj) ? = 1 lj v0 ; lj1= El|,B ? 1 v1lj ? = lj v1 .(13) 4 To optimizeQ1with respect to, by denotingdiag0lipl i=1asV0l,diag1li pl i=1asV1lwe have: (k+1)= Rb+ pR2 b + 4RaRc 2Ra (regression), Cb+ pC2 b + 4CaCc 2Ca (classifi cation), (14) whereRa= N + P lX pl+ ,Ca= P lX pl+ + 2,Rb= Cb= P lX |V0l(k+1) l |1, Rc= I+J+,Cc= J+,I = N n P iS ?y i (xi;(k+1) ?2, J = P lX |V1/2 1l (k+1) l |2. To optimizeQ2, a closed-form update can be derived from Eq.(11) and Eq.(12) given batch dataB: (k+1) l = argmax lR Q2(l|(k) l ,(k) l ) = Ppl j=1lj + a 1 a + b + pl 2 .(15) 3.4Pruning strategy There are quite a few methods for pruning neural networks including the oracle pruning and the easy-to-use magnitude-based pruning Molchanov et al., 2017. Although the magnitude-based unit pruning shows more computational savings Gomez et al., 2018, it doesnt demonstrate robustness under coarser pruning Han et al., 2016, Gomez et al., 2018. Pruning based on the probability is also popular in the Bayesian community, but achieving the target sparsity in sophisticated networks requires extra fi ne-tuning. We instead apply the magnitude-based weight-pruning to our Resnet compression experiments and refer to it as SGLD-SA, which is detailed in Algorithm 1. The corresponding variant of SGHMC with SA is referred to as SGHMC-SA. 4Convergence Analysis The key to guaranteeing the convergence of the adaptive SGLD algorithm is to use Poissons equation to analyze additive functionals. By decomposing the Markov state-dependent noiseinto martingale difference sequences and perturbations, where the latter can be controlled by the regularity of the solution of Poissons equation, we can guarantee the consistency of the latent variable estimators. Theorem 1(L2convergence rate).For any (0,1, under assumptions in Appendix B.1, the algorithm satisfi es: there exists a constant and an optimum such that E h k(k) k2 i k. SGLD with adaptive latent variables forms a sequence of inhomogenous Markov chains and the weak convergence ofto the target posterior is equivalent to proving the weak convergence of SGLD with biased estimations of gradients. Inspired by Chen et al. 2015, we have: Corollary 1.Under assumptions in Appendix B.2, the random vector(k)from the adaptive transi- tion kernel (k1)converges weakly to the invariant distribution eL(, ) as ? 0 and k . The smooth optimization of the priors makes the algorithm robust to bad initialization and avoids entrapmentinpoorlocaloptima. Inaddition, theconvergencetotheasymptoticallycorrectdistribution allows us to combine simulated annealing to obtain better point estimates in non-convex optimization. 5Experiments 5.1Simulation of Large-p-Small-n Regression We conduct the linear regression experiments with a dataset containingn = 100observations andp = 1000predictors.Np(0,)is chosen to simulate the predictor valuesX(training set) where = ()p i,j=1withi,j = 0.6|ij|. Response valuesyare generated fromX + , where = (1,2,3,0,0,.,0)0and Nn(0,3In). We assume1 N(3,2 c),2 N(2,2 c), The quadratic equation has only one unique positive root. kkrefers toL2norm,kk1representsL1norm. 5 Algorithm 1 SGLD-SA with SSGL priors Initialize: (1), (1), (1), (1)and (1)from scratch, set target sparse rates D, f and S for k 1 : kmaxdo Sampling (k+1) (k)+ ?(k)Q(|B(k) + N(0,2?(k)1) Stochastic Approximation for Latent Variables SA: (k+1) (1 (k+1)(k)+ (k+1) (k+1)following Eq.(12) SA: (k+1) (1 (k+1)(k)+ (k+1) (k+1)following Eq.(13) SA: (k+1) (1 (k+1)(k)+ (k+1) (k+1)following Eq.(14) SA: (k+1) (1 (k+1)(k)+ (k+1)(k+1)following Eq.(15) if Pruning then Prune the bottom-s% lowest magnitude weights Increase the sparse rate s S(1 Dk/f) end if end for Table 1: Predictive errors in linear regression based on a test set considering different v0and MAE / MSEv0=0.01,=2v0=0.1,=2v0=0.01,=1v0=0.1,=1 SGLD-SA1.89 / 5.561.72 / 5.641.48 / 3.511.54 / 4.42 SGLD-EM3.49 / 19.312.23 / 8.222.23 / 19.282.07 / 6.94 SGLD15.85 / 416.3915.85 / 416.3911.86 / 229.387.72 / 88.90 3 N(1,2 c),c= 0.2. We introduce some hyperparameters, but most of them are uninformative. We fi x = 1, = 1, = 1,v1= 10, = 0.5,b = pand seta = 1. The learning rate follows ?(k)= 0.001k 1 3, and the step size is given by(k)= 10(k +1000)0.7. We varyv0andto show the robustness of SGLD-SA to different initializations. In addition, to show the superiority of the adaptive update, we compare SGLD-SA with the intuit

人人文库> 全部分类> 教育资料 > 课件下载

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

NIPS20198794-an-adaptive-empirical-bayesian--for-sparse-deep-learning

文档简介

温馨提示

最新文档

评论

NIPS20198794-an-adaptive-empirical-bayesian--for-sparse-deep-learning

文档简介

温馨提示

最新文档

评论

相关文档