iccv2019论文全集8316-you-only-propagate-once-accelerating-adversarial-training-via-maximal-principle

上传人：我*** IP属地：北京上传时间：2020-06-18 格式：PDF 页数：12 大小：895.23KB 积分：12 举报 版权申诉

iccv2019论文全集8316-you-only-propagate-once-accelerating-adversarial-training-via-maximal-principle_第2页

iccv2019论文全集8316-you-only-propagate-once-accelerating-adversarial-training-via-maximal-principle_第3页

iccv2019论文全集8316-you-only-propagate-once-accelerating-adversarial-training-via-maximal-principle_第4页

iccv2019论文全集8316-you-only-propagate-once-accelerating-adversarial-training-via-maximal-principle_第5页

已阅读5页，还剩7页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

1、You Only Propagate Once: Accelerating Adversarial Training via Maximal Principle Dinghuai Zhang, Tianyuan Zhang Peking University zhangdinghuai, 1600012888 Yiping Lu Stanford University Zhanxing Zhu School of Mathematical Sciences, Peking University Center for Data Science, Peking U

2、niversity Beijing Institute of Big Data Research zhanxing.zhu Bin Dong Beijing International Center for Mathematical Research, Peking University Center for Data Science, Peking University Beijing Institute of Big Data Research dongbin Abstract Deep learning achieves state-of-the-art results in many

3、tasks in computer vision and natural language processing. However, recent works have shown that deep networks can be vulnerable to adversarial perturbations, which raised a serious robustness issue of deep networks. Adversarial training, typically formulated as a robust optimization problem, is an e

4、ffective way of improving the robustness of deep networks. A major drawback of existing adversarial training algorithms is the computational overhead of the generation of adversarial examples, typically far greater than that of the network training. This leads to the unbearable overall computational

5、 cost of adversarial training. In this paper, we show that adversarial training can be cast as a discrete time differential game. Through analyzing the Pontryagins Maximum Principle (PMP) of the problem, we observe that the adversary update is only coupled with the parameters of the fi rst layer of

6、the network. This inspires us to restrict most of the forward and back propagation within the fi rst layer of the network during adversary updates. This effectively reduces the total number of full forward and backward propagation to only one for each group of adversary updates. Therefore, we refer

7、to this algorithm YOPO (YouOnlyPropagateOnce). Numerical experiments demonstrate that YOPO can achieve comparable defense accuracy withapproximately 1/5 1/4 GPU time of the projected gradient descent (PGD) algorithm 15.3 Equal Contribution Corresponding Authors 3Our codes are available at 33rd Confe

8、rence on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. 1Introduction Deep neural networks achieve state-of-the-art performance on many tasks 16,8,4,42. However, recent works show that deep networks are often sensitive to adversarial perturbations 33,25,47, i.e., changing t

9、he input in a way imperceptible to humans while causing the neural network to output an incorrect prediction. This poses signifi cant concerns when applying deep neural networks to safety-critical problems such as autonomous driving and medical domains. To effectively defend the adversarial attacks,

10、 24 proposed adversarial training, which can be formulated as a robust optimization 36: min E(x,y)Dmax kk? (;x + ,y),(1) whereis the network parameter,is the adversarial perturbation, and(x,y)is a pair of data and label drawn from a certain distributionD. The magnitude of the adversarial perturbatio

11、nis restricted by? 0. For a given pair(x,y), we refer to the value of the inner maximization of(1), i.e. maxkk?(;x + ,y), as the adversarial loss which depends on (x,y). A major issue of the current adversarial training methods is their signifi cantly high computational cost. In adversarial training

12、, we need to solve the inner loop, which is to obtain the optimal adversarial attack to the input in every iteration. Such optimal adversary is usually obtained using multi-step gradient decent, and thus the total time for learning a model using standard adversarial training method is much more than

13、 that using the standard training. Considering applying 40 inner iterations of projected gradient descent (PGD 15) to obtain the adversarial examples, the computation cost of solving the problem (1) is about 40 times that of a regular training. Adversary updater Adversary updater Black box Previous

14、Work YOPO Heavy gradient calculation Figure 1: Our proposed YOPO expolits the structure of neural network. To alleviate the heavy computation cost, YOPO focuses the calculation of the adversary at the fi rst layer. The main objective of this paper is to re- duce the computational burden of adver- sa

15、rial training by limiting the number of forward and backward propagation with- out hurting the performance of the trained network. In this paper, we exploit the struc- tures that the min-max objectives is encoun- teredwithdeepneuralnetworks. Toachieve this, we formulate the adversarial training prob

16、lem(1)as a differential game. After- wards we can derive the Pontryagins Max- imum Principle (PMP) of the problem. From the PMP, we discover a key fact that the adversarial perturbation is only coupled with the weights of the fi rst layer. This motivates us to propose a novel adversarial training st

17、rategy by decoupling the adver- sary update from the training of the network parameters. This effectively reduces the to- tal number of full forward and backward propagation to only one for each group of adversary updates, signifi cantly lowering the overall computation cost without ham- pering perf

18、ormance of the trained network. We name this new adversarial training algorithm asYOPO(YouOnlyPropagateOnce). Our numer- ical experiments show that YOPO achieves approximately 45 times speedup over the original PGD adversarial training with comparable accuracy on MNIST/CIFAR10. Furthermore, we apply

19、 our algorithm to a recent proposed min max optimization objective TRADES44 and achieve better clean and robust accuracy within half of the time TRADES need. 1.1Related Works Adversarial Defense.To improve the robustness of neural networks to adversarial examples, many defense strategies and models

20、have been proposed, such as adversarial training 24, orthogonal regularization 6,21, Bayesian method 43, TRADES 44, rejecting adversarial examples 41, 2 Jacobian regularization 14,27, generative model based defense 12,31, pixel defense 29,23, ordinary differential equation (ODE) viewpoint 45, ensemb

21、le via an intriguing stochastic differential equationperspective37, andfeature denoising40,32, etc. Amongall theseapproaches, adversarial training and its variants tend to be most effective since it largely avoids the the obfuscated gradient problem 2. Therefore, in this paper, we choose adversarial

22、 training to achieve model robustness. Neural ODEs.Recent works have built up the relationship between ordinary differential equations and neural networks 38,22,10,5,46,35,30. Observing that each residual block of ResNet can be written asun+1= un+ tf(un), one step of forward Euler method approximati

23、ng the ODE ut= f(u). Thus 19,39 proposed an optimal control framework for deep learning and 5,19,20 utilize the adjoint equation and the maximal principle to train a neural network. Decouple Training.Training neural networks requires forward and backward propagation in a sequential manner. Different

24、 ways have been proposed to decouple the sequential process by parallelization. This includes ADMM 34, synthetic gradients 13, delayed gradient 11, lifted machines 1,18,9. Our work can also be understood as a decoupling method based on a splitting technique. However, we do not attempt to decouple th

25、e gradient w.r.t. network parameters but the adversary update instead. 1.2Contribution To the best of our knowledge, it is the fi rst attempt to design NNspecifi c algorithm for adversarial defense. To achieve this, we recast the adversarial training problem as a discrete time differential game. Fro

26、m optimal control theory, we derive the an optimality condition, i.e. the Pontryagins Maximum Principle, for the differential game. Through PMP, we observe that the adversarial perturbation is only coupled with the fi rst layer of neural networks. The PMP motivates a new adversarial training algorit

27、hm, YOPO. We split the adversary computation and weight updating and the adversary computation is focused on the fi rst layer. Relations between YOPO and original PGD are discussed. We fi nally achieve about4 5 times speed upthan the original PGD training with compa- rable results on MNIST/CIFAR10.

28、Combining YOPO with TRADES44, we achieve both higher clean and robust accuracy within less than half of the time TRADES need. 1.3Organization This paper is organized as follows. In Section 2, we formulate the robust optimization for neural network adversarial training as a differential game and prop

29、ose the gradient based YOPO. In Section 3, we derive the PMP of the differential game, study the relationship between the PMP and the back- propagation based gradient descent methods, and propose a general version of YOPO. Finally, all the experimental details and results are given in Section 4. 2Di

30、fferential Game Formulation and Gradient Based YOPO 2.1The Optimal Control Perspective and Differential Game Inspired by the link between deep learning and optimal control 20, we formulate the robust optimization(1)as a differential game 7. A two-player, zero-sum differential game is a game where ea

31、ch player controls a dynamics, and one tries to maximize, the other to minimize, a payoff functional. In the context of adversarial training, one player is the neural network, which controls the weights of the network to fi t the label, while the other is the adversary that is dedicated to producing

32、 a false prediction by modifying the input. The robust optimization problem (1) can be written as a differential game as follows, min max kik?J(,) := 1 N N X i=1 i(xi,T) + 1 N N X i=1 T1 X t=0 Rt(xi,t;t) subject to xi,1= f0(xi,0+ i,0),i = 1,2, ,N xi,t+1= ft(xi,t,t),t = 1,2, ,T 1 (2) 3 Here, the dyna

33、micsft(xt,t),t = 0,1,.,T 1represent a deep neural network,Tdenote the number of layers,t tdenotes the parameters in layert(denote = tt ), the function ft: Rdt t Rdt+1is a nonlinear transformation for one layer of neural network wheredtis the dimension of thetth feature map andxi,0,i = 1,.,Nis the tr

34、aining dataset. The variable = (1, ,N)is the adversarial perturbation and we constrain it in an-ball. Functioniis a data fi tting loss function andRtis the regularization weightstsuch as theL2-norm. By casting the problem of adversarial training as a differential game(2), we regardandas two competin

35、g players, each trying to minimize/maximize the loss function J(,) respectively. 2.2Gradient Based YOPO The Pontryagins Maximum Principle (PMP) is a fundamental tool in optimal control that character- izes optimal solutions of the corresponding control problem 7. PMP is a rather general framework th

36、at inspires a variety of optimization algorithms. In this paper, we will derive the PMP of the differential game(2), which motivates the proposed YOPO in its most general form. However, to better illustrate the essential idea of YOPO and to better address its relations with existing methods such as

37、PGD, we present a special case of YOPO in this section based on gradient descent/ascent. We postpone the introduction of PMP and the general version of YOPO to Section 3. Let us fi rst rewrite the original robust optimization problem (1) (in a mini-batch form) as min max kik? B X i=1 (g (f0(xi+ i,0)

38、,yi), wheref0 denotes the fi rst layer,g = fT1 T1 fT2 T2 f1 1 denotes the network without the fi rst layer andBis the batch size. Here is defi ned as1, ,T1. For simplicity we omit the regularization term Rt. The simplest way to solve the problem is to perform gradient ascent on the input data and gr

39、adient descent on the weights of the neural network as shown below. Such alternating optimization algorithm is essentially the popular PGD adversarial training 24. We summarize the PGD-r(for each update on ) as follows, i.e. performing r iterations of gradient ascent for inner maximization. For s =

40、0,1,.,r 1, perform s+1 i = s i + 1i(g (f0(xi+ s i,0),yi), i = 1, ,B, where by the chain rule, i(g (f0(xi+ s i,0),yi) =g ?(g (f0(xi+ s i,0),yi) ? f0 ?g (f0(xi+ s i,0) ? if0(xi+ s i,0). Perform the SGD weight update (momentum SGD can also be used here) 2 B X i=1 (g (f0(xi+ m i ,0),yi) ! Note that this

41、 method conductsrsweeps of forward and backward propagation for each update of. This is the main reason why adversarial training using PGD-type algorithms can be very slow. To reduce the total number of forward and backward propagation, we introduce a slack variable p = g ?(g (f0(xi+ i,0),yi) ? f0 ?

42、g (f0(xi+ i,0) ? and freeze it as a constant within the inner loop of the adversary update. The modifi ed algorithm is given below and we shall refer to it as YOPO-m-n. 4 Initialize 1,0 i for each input xi. For j = 1,2, ,m Calculate the slack variable p p = g ? (g (f0(xi+ j,0 i ,0),yi) ? f0 ? g (f0(

43、xi+ j,0 i ,0) ? , Update the adversary for s = 0,1,.,n 1 for fi xed p j,s+1 i = j,s i + 1p if0(xi+ j,s i ,0),i = 1, ,B Let j+1,0 i = j,n i . Calculate the weight update U = m X j=1 B X i=1 (g (f0(xi+ j,n i ,0),yi) ! and update the weight 2U. (Momentum SGD can also be used here.) Intuitively, YOPO fr

44、eezes the values of the derivatives of the network at level1,2.,T 1during thes-loop of the adversary updates. Figure 2 shows the conceptual comprison between YOPO and PGD. YOPO-m-naccesses the datam ntimes while only requiresmfull forward and backward propagation. PGD-r, on the other hand, propagate

45、s the datartimes forrfull forward and backward propagation. As one can see that, YOPO-m-n has the fl exibility of increasingnand reducingmto achieve approximately the same level of attack but with much less computation cost. For example, suppose one applies PGD-10 (i.e. 10 steps of gradient ascent f

46、or solving the inner maximization) to calculate the adversary. An alternative approach is using YOPO-5-2which also accesses the data 10 times but the total number of full forward propagation is only 5. Empirically, YOPO-m-n achieves comparable results only requiring setting m n a litter larger than

47、r. Another benefi t of YOPO is that we take full advantage of every forward and backward propagation to update the weights, i.e. the intermediate perturbationj i,j = 1, ,m 1are not wasted like PGD-r. This allows us to perform multiple updates per iteration, which potentially drives YOPO to converge

48、faster in terms of the number of epochs. Combining the two factors together, YOPO signifi cantly could accelerate the standard PGD adversarial training. We would like to point out a concurrent paper 28 that is related to YOPO. Their proposed method, called Free-m , also can signifi cantly speed up a

49、dversarial training. In fact, Free-mis essentially YOPO-m-1, except that YOPO-m-1delays the weight update after the whole mini-batch is processed in order for a proper usage of momentum 4. 3The Pontryagins Maximum Principle for Adversarial Training In this section, we present the PMP of the discrete

50、 time differential game(2). From the PMP, we can observe that the adversary update and its associated back-propagation process can be decoupled. Furthermore, back-propagation based gradient descent can be understood as an iterative algorithm solving the PMP and with that the version of YOPO presente

51、d in the previous section can be viewed as an algorithm solving the PMP. However, the PMP facilitates a much wider class of algorithms than gradient descent algorithms 19. Therefore, we will present a general version of YOPO based on the PMP for the discrete differential game. 3.1PMP Pontryagin type

52、 of maximal principle 26,3 provides necessary conditions for optimality with a layer-wise maximization requirement on the Hamiltonian function. For each layert T := 4Momentum should be accumulated between mini-batches other than different adversarial examples from one mini-batch, otherwise overfi tt

53、ing will become a serious problem. 5 YOPO Outer IterationYOPO Inner Iteration copy PGD Adv. Train Iteration For r times For m times For n times Figure 2: Pipeline of YOPO-m-ndescribed in Algorithm 1. The yellow and olive blocks represent feature maps while the orange blocks represent the gradients o

54、f the loss w.r.t. feature maps of each layer. 0,1,.,T 1, we defi ne the Hamiltonian function Ht: Rdt Rdt+1 t R as Ht(x,p,t) = p ft(x,t) 1 B Rt(x,t). The PMP for continuous time differential game has been well studied in the literature 7. Here, we present the PMP for our discrete time differential ga

55、me (2). Theorem 1.(PMP for adversarial training) Assumeiis twice continuous differentiable, ft(,),Rt(,)are twice continuously differentiable with respect tox;ft(,),Rt(,)together with theirxpartial derivatives are uniformly bounded intand, and the setsft(x,) : t andRt(x,) : tare convex for everytandx

56、 Rdt. Denoteas the solution of the problem(2), then there exists co-state processesp i := p i,t: t Tsuch that the following holds for all t T and i B: x i,t+1= pHt(x i,t,p i,t+1, t), x i,0= xi,0+ i (3) p i,t= xHt(x i,t,p i,t+1, t), p i,T = 1 B i(x i,T) (4) At the same time, the parameters of the fi

57、rst layer 0 0and the optimal adversarial perturbation i satisfy B X i=1 H0(x i,0+ i,p i,1, 0) B X i=1 H0(x i,0+ i,p i,1, 0) B X i=1 H0(x i,0+ i,p i,1,0), (5) 0 0,kik ?(6) and the parameters of the other layers t t,t T maximize the Hamiltonian functions B X i=1 Ht(x i,t,p i,t+1, t) B X i=1 Ht(x i,t,p i,t+1,t), t t (7) Proof. Proof is in the supplementary materials. From the theorem, we can observe that the adversaryis only coupled with the parameters of the fi rst layer 0. This key observation inspires the design of YOPO. 6 3.2PMP and Back-Propagation Based Gradient

人人文库> 全部分类> 应用文书

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

iccv2019论文全集8316-you-only-propagate-once-accelerating-adversarial-training-via-maximal-principle

文档简介

温馨提示

最新文档

评论

iccv2019论文全集8316-you-only-propagate-once-accelerating-adversarial-training-via-maximal-principle

文档简介

温馨提示

最新文档

评论

相关文档