iccv2019论文全集8790-ouroboros-on-accelerating-training-of-transer-based-language-models

上传人：我*** IP属地：北京上传时间：2020-06-19 格式：PDF 页数：11 大小：583.99KB 积分：12 举报 版权申诉

iccv2019论文全集8790-ouroboros-on-accelerating-training-of-transer-based-language-models_第2页

iccv2019论文全集8790-ouroboros-on-accelerating-training-of-transer-based-language-models_第3页

iccv2019论文全集8790-ouroboros-on-accelerating-training-of-transer-based-language-models_第4页

iccv2019论文全集8790-ouroboros-on-accelerating-training-of-transer-based-language-models_第5页

已阅读5页，还剩6页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

1、Ouroboros: On Accelerating Training of Transformer-Based Language Models Qian Yang1, Zhouyuan Huo2, Wenlin Wang1, Heng Huang2, Lawrence Carin1 Department of Electrical and Computer Engineering 1 Duke University 2 University of Pittsburgh Abstract Language models are essential for n

2、atural language processing (NLP) tasks, such as machine translation and text summarization. Remarkable performance has been demonstrated recently across many NLP domains via a Transformer-based language model with over a billion parameters, verifying the benefi ts of model size. Model parallelism is

3、 required if a model is too large to fi t in a single computing device. Current methods for model parallelism either suffer from backward locking in backpropagation or are not applicable to language models. We propose the fi rst model-parallel algorithm that speeds the training of Transformer-based

4、language models. We also prove that our proposed algorithm is guaranteed to converge to critical points for non-convex problems. Extensive experiments on Transformer and Transformer-XL language models demonstrate that the proposed algorithm obtains a much faster speedup beyond data parallelism, with

5、 comparable or better accuracy. Code to reproduce experiments is to be found athttps:/github. com/LaraQianYang/Ouroboros. 1Introduction Natural language processing (NLP) tasks, such as machine translation 1,2,3,4,5, text summa- rization 6,7,8,9,10, or paraphrase generation 11,12,13 have achieved gre

6、at success with the development of neural networks. It has been demonstrated recently that Transformer networks obtain superior performance 14,15,16 relative to recurrent neural networks or convolutional neural networks. BERT 17 trains a deep bidirectional Transformer with340M parameters and obtains

7、 state-of-the-art results on11NLP tasks. Recently, OpenAI GPT-2 18, which is a Transformer-based language model with1.5B parameters, achieves state-of-the-art results on 7 out of 8 tested language modeling datasets, presenting impressive performance across many domains and datasets. Empirical result

8、s demonstrate the superiority of Transformer networks and show that a larger model tends to yield better performance. However, when a model is so large that it has to be allocated on multiple GPUs, data parallelism over these GPUs is not applicable because it requires each GPU to have one copy of th

9、e whole model. Meanwhile, model parallelization is still an open question when the model is too large to fi t in a single device when training. When a model becomes too large to fi t on a single computing device, the simplest solution is to distribute model layers across multiple devices. In 19, the

10、 authors parallelize the model by splitting fi lters or parameters of a layer across multiple GPUs. However, both of these methods suffer from backward locking of the backpropagation algorithm, and cannot parallelize the computations between layers. Backward locking denotes that the backpropagation

11、algorithm requires gradients to be Corresponding author 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. computed from top layers to bottom layers sequentially. When networks are very deep, all other devices are idle when the backpropagation computation is

12、performed on one device. Jaderberg et al. 20proposes Decoupled Neural Interface to remove backward locking, by employing additional neural networks to approximate error gradients. However, this approach works poorly on deep neural networks 21. In 21, the authors use stale gradients in previous compu

13、tations and successfully accelerate the training of deep networks like ResNet110. Subsequently, Huo et al.22revises the memory issue in 21 and obtains better generalization error. Both of these methods can only work on feed-forward networks that are separable between layers. However, neither approac

14、h can parallelize Transformer-based language models, because the shared embeddings make the networks non-separable. To address the above challenges, we make the following contributions. (i ) We present the fi rst model-parallel algorithm to parallelize the training of Transformer-based language mode

15、ls, going beyond data parallelism. (ii) The convergence rate of the proposed algorithm is analyzed, and it is proven that it is guaranteed to converge to critical points for non-convex problems. (iii) We evaluate the proposed algorithm in training two Transformer-based language models, and experimen

16、tal results verify our theoretical analysis, demonstrating convergence much faster than previous methods with comparable or better accuracy. The source code will be made publicly accessible to encourage further research. 2Preliminary and Related Works Self-attention architectures like the Transforme

17、r 14 have recently become popular for language modeling 15,16,17,18. Consider training a Transformer-based language model withLlayers. We may represent the computations in the network as follows: h1=F1(h0;w1,Vi),(1) hl=Fl(hl1;wl),l 2,.,L 1,(2) hL=FL(hL1;wL,Vo),(3) wherehl1denotes the input of layerl

18、,Fl(;wl)denotes the computation of layerlwith weight wl,Viis the input embedding, andVois the output projection. In particular,h0denotes the input datax, andhL= F(x; w)represents the output of the network. For the sake of performance,Vi andVoare typically tied in language modeling or machine transla

19、tion tasks, so thatV = Vi= Vo 23,24 . Defi ning network weightsw = w1,w2,.,wL, embedding layerVand w = w,V , the loss function for language modeling can be represented as: min w f(F(x; w),y),(4) where y denotes the target. In the following context, we use f( w) for simplicity. 2.1Gradient-Based Meth

20、od Gradient-based methods are widely employed for training deep neural networks, with important stochastic gradient descent (SGD) 25 examples including AdaGrad 26, RMSProp 27, Adam 28 and AdamW 29. With SGD, the weights of the network are updated as: wt+1 l = wt l tfl,xi(t)( w t) andV t+1 = V t tfV,

21、xi(t)( wt),(5) for anyl 1,.,L, wheretis the stepsize,i(t)represents data index at iterationt, and fl,xi(t)( wt)is the gradient of the loss function (4) with respect to the weights at layerland data sample xi(t). 2.2Backpropagation If the loss functions are differentiable, the gradients of network pa

22、rameters can be computed using the backpropagation algorithm 30. The backpropagation algorithm consists of two passes of the network, forward computation and backward computation. In the forward computation, activations of all layers are calculated froml = 1toLfollowing equations (1), (2) and (3). I

23、n the backward 2 Algorithm 1 Ouroboros + SGD Require: Initial weights w0= w0 G(1),.,w 0 G(K); Initial word embedding V 0 i = V 0 o; Stepsize sequence t; 1: for t = 0,1,2,.,T 1 do 2:for k = 1,.,K in parallel do 3:Compute delayed gradientgt kfor modulek following (8); 4:Compute mixed gradientgt Vfor e

24、mbedding layer following (9); 5:Update weights and embedding layer fol- lowing SGD: wt+1 G(k) =wtG(k) t gtk; V t+1 i = V t+1 o =V t i t g t V; 6:end for 7: end for 8:Outputws,V s i andV s orandomly fromw tT1 t=0, V t i T1 t=0 and V t o T1 t=0. Tlayer 1 Tlayer 4 GPU 1 Tlayer 2 GPU 2 Tlayer 3 GPU 3 Fi

25、gure 1: Communication between GPUs of the proposed Ouroboros algorithm. The fi rst and last module of a transformer-based language model is located on the same device. computation, we apply the chain rule and propagate error gradients repeatedly through the network, from the output layer l = L to th

26、e input layer l = 1: f( wt) wt l = f( wt) ht l ht l wt l and f( wt) ht l1 = f( wt) ht l ht l ht l1 ,(6) where w = w,V , andfl,xi(t)( wt) = f( wt) wt l . For Transformer-based language models, the gradient with respect to the input embedding and output projection layer are computed as: f( wt) Vi = f(

27、 wt) ht 1 ht 1 Vi and f( wt) Vo = f( wt) ht L ht L Vo .(7) From (6), it is evident that the computation in layerlis dependent on the error gradient f( wt) ht l from layerl + 1. Therefore, the sequential chain rule constrains all layers from updating before receiving error gradients from the dependen

28、t layers. When Transformer-based networks are very deep and computations in each layer are signifi cant, breaking such a sequential chain rule to accelerate the training presents a challenge. 3Accelerating Training of Transformer-Based Language Models We propose the fi rst model-parallel algorithm t

29、hat can speed up the training of Transformer-based language models. We then take stochastic gradient descent as an example to verify that our algorithm is easy to work with any gradient-based method. 3.1Ouroboros Algorithm We split anL-layer network intoKmodules so that the weights of the network ar

30、e divided intoK groups and each group is placed on a GPU. Therefore, we havew = wG(1),wG(2),.,wG(K)where G(k)denotes layer indices in groupk. We again denoteViandVoas the input embedding and output projection, at the fi rst and last module of the network. In 23, it is shown that shared embedding alw

31、ays has better performance than not sharing for a language model and machine translation, where ViandVoare tied andVi= Vo. In the following context, we letV = Vi,Vo. Because of this, the fi rst module and the last module must be placed on the same device, visualized in Figure 1. Our model is connect

32、ed end-to-end and shrinks like a snake when grouping, so we name it “Ouroboros.” 3 Transformer layer 1 Transformer layer 2 Transformer layer 4 loss Forward pass Backward pass Embedding Embedding Tied Transformer layer 3 GPU 1GPU 2GPU 3GPU 1 Figure 2: Forward and backward computation of the proposed

33、algorithm. We split a Transformer- based language model into four modules and allocate them into three GPUs, where the fi rst and the last module are placed on the same GPU. In the fi gure,hdenotes activations,wdenotes weights, and Vrepresents embedding layers.TLaylerrepresents Transformer layer. Th

34、e input embedding and output projection are tied together. In the backward computation of the backpropagation algorithm, the computations of Module 1 are dependent on the computations of the later modules. In our Ouroboros algorithm, at each iteration all modules are independent of each other, by us

35、ing delayed gradients. Let w = w,V , the gradient of weights in G(k) is fG(k),xi(tK+k) ? wtK+k?= X lG(k) fxi(tK+k)( wtK+k) wtK+k l , if t K + k 0,(8) or0otherwise for anyk 1,2,.,K. The gradient ofVis the average of the gradients of output projection and input embedding: fV,xi(t)( wt) = 1 2fVo,xi(t)

36、? wt?+ 1 2fVi,xi(tK+1) ? wtK+1 ? = 1 2 ? f( wt) Vt o + f( wtK+1) V tK+1 i ? ,(9) otherwise0ift K + 1 0, such that for any w,v, it is satisfi ed that: kf(w) f(v)k2 Lkw vk2.(12) Assumption 2 (Bounded variance)We assume the second moment of the stochastic gradient is upper bounded, such that there exis

37、ts constant M 0, for any sample xiand for any w: kfxi(w)k2 2 M. (13) Because of the variance equationEkfxi(w) f(w)k2 2 = Ekfxi(w)k2 2 kf(w)k22, the following inequality is also satisfi ed: kfxi(w) Efxi(w)k2 2 M.(14) Under Assumptions 1 and 2, we obtain Lemma 1 about iterations of the objective funct

38、ions. Lemma 1With Assumptions 1 and 2, let := maxt max0,tK+1 t andMK= (K+ 3 4)M +( K2 2 + K3)(K + 4)M. For all t N, the iterations in Algorithm 1 satisfy the inequality E ?f(wt+1)? f(wt)t 2 ? ?f(wt)?2 2 + 2 tLMK. (15) From Lemma 1, we observe that the expected decrease of the objective function is c

39、ontrolled by the stepsizetandMK. Therefore, we can guarantee that the values of objective functions are decreasing as long as the stepsizestare small enough, such that the right-hand side of (15) is less than zero. Based on Lemma 1, we analyze the convergence guarantee of Algorithm 1. 4.1Fixed Steps

40、ize t We fi rst analyze the convergence for Algorithm 1 whent is fi xed, and prove that the learned model will converge sub-linearly to the neighborhood of the critical points. Theorem 1 With Assumptions 1 and 2, and the fi xed stepsize sequencetsatisfyingt= and L 1,t 0,1,.,T 1, letwbe the optimal s

41、olution tof(w). The output of Algorithm 1 satisfi es: 1 T T1 X t=0 E ? ?f(wt)?2 2 2 ?f(w0) f(w)? T + 2LMK,(16) and MK= (K + 3 4)M + ( K2 2 + K3)(K + 4)M. According to Theorem 1, the average norm of gradients can converge to the neighborhood of critical points. As T , it is also upper bounded by 2LMK

42、. Remark 1With Assumptions 1 and 2, and following notation in Theorem 1, let = q f(w0)f(w) TLMK . Then 1 T T1 P t=0 Ekf(wt)k2 2 4 q (f(w0)f(w)LMK T . According to above analysis, we know that Algorithm 1 admits a convergence rate ofO(1/T)for non-convex problems, which is similar to the result of SGD

43、 32. 5 4.2Diminishing Stepsize t We prove that Algorithm 1 with diminishing stepsizes can guarantee the convergence to critical points for non-convex problems. Theorem 2With Assumptions 1 and 2, and the diminishing stepsize sequencetsatisfyingt= 0 1+t,tL 1,t 0,1,.,T 1, assumew to be the optimal solu

44、tion tof(w), and let = K such thatMK= (K + 3 4)M + ( K3 2 + K4)(K + 4)M. SettingT= T1 P t=0 t, then the output of Algorithm 1 satisfi es 1 T T1 X t=0 tE ? ?f(wt)?2 2 2 ?f(w0) f(w)? T + 2 T1 P t=0 2 tLMK T Since t= 0 t+1 , the following inequalities are satisfi ed: lim T T1 X t=0 t= andlim T T1 X t=0

45、 2 t .(17) Therefore, according to Theorem 2, when T , the right-hand side of (17) converges to 0. Remark 2Supposewsis chosen randomly fromwtT1 t=0 with probabilities proportional to tT1 t=0. According to Theorem 2, we can prove that Algorithm 1 guarantees convergence to critical points for the non-

46、convex problem: lim s Ekf(ws)k2 2= 0. 5Experimental Setup We evaluate the proposed method by training two Transformer-based language models. When the model is too large to be fi t in a single GPU, its layers have to be distributed across multiple GPUs. In this case, data parallelism over multiple GP

47、Us does not work because it requires that each GPU has one copy of the whole model. Mini-batch computation in one GPU is regarded as the data parallelism in this paper. By simulating this case, we distribute layers of a model acrossKGPUs. Experimental results demonstrate that the proposed method obt

48、ains further speedup beyond data parallelism. 5.1Datasets Following 16, three publicly available datasets are used for training and evaluation: (i) enwiki8, containing 100M bytes of unprocessed Wikipedia text 33; (ii) text8, containing 100M processed lower-case Wikipedia characters and removing any

49、character other than the 26 lettersathroughz, and space 33; and (iii) WikiText-103, the largest available word-level language modeling benchmark with long-term dependency 34. All training datasets are preprocessed following 16. 5.2Training Details Our implementation is based on Transformer-XL2using

50、PyTorch. All experiments are performed on a machine with4TESLA V100 GPUs. Parallelization between modules is handled via the subprocess library in Python3. We use two language models in the paper: a12-layer Transformer (44M parameters) 15 and Transformer-XL (41M parameters) 16. In all experiments, w

51、e split a Transformer-based language model intoKmodules and allocate them sequentially ontoKGPUs (backpropagation algorithm) orK 1GPUs (Ouroboros). Due to the limited resources, we validate our proposed algorithm by varyingKfrom3to5. According to 16, we use the Adam optimizer, where1= 0.9,2= 0.999an

52、d = 1e828. For comparison, we use Ouroboros+Adam (see Appendix) in the experiments. The learning rate is set to be0.00025and it decreases following a cosine learning rate schedule 35. 2 6 010002000300040005000600070008000 Steps 1 1.5 2 2.5 3 3.5 4 Training Loss enwiki8 Transformer-XL (Adam) Transfor

53、mer (Adam) Transformer-XL (Our method) Transformer (Our method) 010002000300040005000600070008000 Steps 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 Training Loss text8 Transformer-XL (Adam) Transformer (Adam) Transformer-XL (Our method) Transformer (Our method) 010002000300040005000600070008000 Steps 4 5

54、6 7 8 9 10 Training Loss WikiText-103 Transformer-XL (Adam) Transformer (Adam) Transformer-XL (Our method) Transformer (Our method) 00.511.522.53 Time (s) 104 1 1.5 2 2.5 3 3.5 4 Training Loss enwiki8 Transformer-XL (Adam) Transformer (Adam) Transformer-XL (Our method) Transformer (Our method) 00.51

55、1.522.53 Time (s) 104 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 Training Loss text8 Transformer-XL (Adam) Transformer (Adam) Transformer-XL (Our method) Transformer (Our method) 020004000600080001000012000 Time (s) 4 5 6 7 8 9 10 Training Loss WikiText-103 Transformer-XL (Adam) Transformer (Adam) Transf

56、ormer-XL (Our method) Transformer (Our method) Figure 3: Convergence of the methods, regarding steps and computational time. We evaluate our algorithm on both Transformer and Transformer-XL language models. Dataset TransformerTransformer-XL AdamOuroboros + AdamAdamOuroboros + Adam enwiki

57、61.05 text51.13 WikiText-10328.3228.2924.0024.10 Table 1: Comparison of Test bpc (Bit per Character) or Test PPL. We use the metric bpc on the enwiki8 and text8 datasets, and PPL on the WikiText-103 dataset. Our algorithm can achieve speedup with comparable or better performance. Warm-up

58、 Training.In the early stages of training, stale gradient information may affect the convergence of Ouroboros. Following 36, we use a gradual warm-up approach in all experiments. This avoids a sudden increase of the learning rate, decreasing the error caused by stale gradients. In all experiments, we set the warm-up step to be5000. After the warm-up, we use the cosine learning rate schedule. Repeatable Dropout.According to 37, dropout ignores weights in a fully-connected la

人人文库> 全部分类> 应用文书

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

iccv2019论文全集8790-ouroboros-on-accelerating-training-of-transer-based-language-models

文档简介

温馨提示

最新文档

评论

iccv2019论文全集8790-ouroboros-on-accelerating-training-of-transer-based-language-models

文档简介

温馨提示

最新文档

评论

相关文档