




已阅读5页,还剩10页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
Simple Recurrent Units for Highly Parallelizable Recurrence Tao Lei1Yu Zhang2Sida I. Wang1,3Hui Dai1Yoav Artzi1,4 1ASAPP Inc.2Google Brain3Princeton University4Cornell University 1tao, hd2ngyuzh 3 Abstract Common recurrent neural architectures scale poorly due to the intrinsic diffi culty in par- allelizing their state computations.In this work, we propose the Simple Recurrent Unit (SRU), a light recurrent unit that balances model capacity and scalability.SRU is de- signed to provide expressive recurrence, en- able highly parallelized implementation, and comes with careful initialization to facili- tate training of deep models.We demon- strate the effectiveness of SRU on multiple NLP tasks.SRU achieves 59x speed-up over cuDNN-optimized LSTM on classifi ca- tion and question answering datasets, and de- livers stronger results than LSTM and convo- lutional models. We also obtain an average of 0.7 BLEU improvement over the Transformer model (Vaswani et al., 2017) on translation by incorporating SRU into the architecture.1 1Introduction Recurrent neural networks (RNN) are at the core of state-of-the-art approaches for a large num- ber of natural language tasks, including machine translation (Cho et al., 2014; Bahdanau et al., 2015; Jean et al., 2015; Luong et al., 2015), lan- guage modeling (Zaremba et al., 2014; Gal and Ghahramani, 2016; Zoph and Le, 2016), opin- ion mining (Irsoy and Cardie, 2014), and situated language understanding (Mei et al., 2016; Misra et al., 2017; Suhr et al., 2018; Suhr and Artzi, 2018). Key to many of these advancements are architectures of increased capacity and computa- tion. For instance, the top-performing models for semantic role labeling and translation use eight re- current layers, requiring days to train (He et al., 2017; Wu et al., 2016b). The scalability of these models has become an important problem that im- pedes NLP research. 1Our code is available at taolei87/sru. The diffi culty of scaling recurrent networks arises from the time dependence of state com- putation.In common architectures, such as Long Short-term Memory (LSTM; Hochreiter and Schmidhuber, 1997) and Gated Recurrent Units (GRU; Cho et al., 2014), the computation of each step is suspended until the complete ex- ecution of the previous step. This sequential de- pendency makes recurrent networks signifi cantly slower than other operations, and limits their ap- plicability. For example, recent translation mod- els consist of non-recurrent components only, such as attention and convolution, to scale model train- ing (Gehring et al., 2017; Vaswani et al., 2017). In this work, we introduce the Simple Recurrent Unit (SRU), a unit with light recurrence that offers both high parallelization and sequence modeling capacity. The design of SRU is inspired by pre- vious efforts, such as Quasi-RNN (QRNN; Brad- bury et al., 2017) and Kernel NN (KNN; Lei et al., 2017), but enjoys additional benefi ts: SRU exhibits the same level of parallelism as convolution and feed-forward nets. This is achieved by balancing sequential dependence and independence: while the state compu- tation of SRU is time-dependent, each state dimension is independent. This simplifi ca- tion enables CUDA-level optimizations that parallelize the computation across hidden di- mensionsandtimesteps, effectivelyusingthe full capacity of modern GPUs. Figure 1 com- pares our architectures runtimes to common architectures. SRU replaces the use of convolutions (i.e., n- gram fi lters), as in QRNN and KNN, with more recurrent connections.This retains modeling capacity, while using less compu- tation (and hyper-parameters). arXiv:1709.02755v5 cs.CL 7 Sep 2018 0246 cuDNN LSTM conv2d (k=3) conv2d (k=2) SRU l=32, d=256 010203040 l=128, d=512 forward backward Figure 1: Average processing time in milliseconds of a batch of 32 samples using cuDNN LSTM, word- level convolution conv2d (with fi lter width k = 2 and k = 3), and the proposed SRU. We vary the number of tokens per sequence (l) and feature dimension (d). SRU improves the training of deep recur- rent models by employing highway connec- tions (Srivastava et al., 2015) and a parame- ter initialization scheme tailored for gradient propagation in deep architectures. We evaluate SRU on a broad set of problems, including text classifi cation, question answering, translation and character-level language model- ing. Our experiments demonstrate that light re- currence is suffi cient for various natural language tasks, offering a good trade-off between scala- bility and representational power. On classifi ca- tion and question answering datasets, SRU out- performs common recurrent and non-recurrent ar- chitectures, while achieving 59x speed-up com- pared to cuDNN LSTM. Stacking additional lay- ers further improves performance, while incurring relatively small costs owing to the cheap compu- tation of a single layer. We also obtain an average improvement of 0.7 BLEU score on the English to German translation task by incorporating SRU into Transformer (Vaswani et al., 2017). 2Related Work Improving on common architectures for sequence processing has recently received signifi cant atten- tion (Greff et al., 2017; Balduzzi and Ghifary, 2016; Miao et al., 2016; Zoph and Le, 2016; Lee et al., 2017). One area of research involves incor- porating word-level convolutions (i.e. n-gram fi l- ters) into recurrent computation (Lei et al., 2015; Bradbury et al., 2017; Lei et al., 2017). For ex- ample, Quasi-RNN (Bradbury et al., 2017) pro- poses to alternate convolutions and a minimal- ist recurrent pooling function and achieves sig- nifi cant speed-up over LSTM. While Bradbury et al. (2017) focus on the speed advantages of the network, Lei et al. (2017) study the theoret- ical characteristics of such computation and pos- sible extensions. Their results suggest that sim- plifi ed recurrence retains strong modeling capac- ity through layer stacking. This fi nding motivates the design of SRU for both high parallelization and representational power. SRU also relates to IRNN (Le et al., 2015), which uses an identity di- agonal matrix to initialize hidden-to-hidden con- nections. SRU uses point-wise multiplication for hidden connections, which is equivalent to using a diagonal weight matrix. This can be seen as a constrained version of diagonal initialization. Various strategies have been proposed to scale network training (Goyal et al., 2017) and to speed up recurrent networks (Diamos et al., 2016; Shazeer et al., 2017; Kuchaiev and Ginsburg, 2017). For instance, Diamos et al. (2016) utilize hardware infrastructures by stashing RNN param- eters on cache (or fast memory). Shazeer et al. (2017) and Kuchaiev and Ginsburg (2017) im- prove the computation via conditional computing and matrix factorization respectively. Our imple- mentation for SRU is inspired by the cuDNN- optimized LSTM (Appleyard et al., 2016), but en- ables more parallelism while cuDNN LSTM re- quires six optimization steps, SRU achieves more signifi cant speed-up via two optimizations. The design of recurrent networks, such as SRU and related architectures, raises questions about representational power and interpretability (Chen et al., 2018; Peng et al., 2018). Balduzzi and Ghi- fary (2016) applies type-preserving transforma- tions to discuss the capacity of various simplifi ed RNN architectures. Recent work (Anselmi et al., 2015; Daniely et al., 2016; Zhang et al., 2016; Lei etal.,2017)relatesthecapacityofneuralnetworks to deep kernels. We empirically demonstrate SRU can achieve compelling results by stacking multi- ple layers. 3Simple Recurrent Unit We present and explain the design of Simple Re- current Unit (SRU) in this section. A single layer of SRU involves the following computation: ft= (Wfxt+ vf? ct1+ bf)(1) ct= ft? ct1+ (1 ft) ? (Wxt)(2) rt= (Wrxt+ vr? ct1+ br)(3) ht= rt? ct+ (1 rt) ? xt(4) where W, Wfand Wrare parameter matrices and vf, vr, bfand bvare parameter vectors to be learnt during training. The complete architec- ture decomposes to two sub-components: a light recurrence (Equation 1 and 2) and a highway net- work (Equation 3 and 4). The light recurrence component successively reads the input vectors xtand computes the se- quence of states ctcapturing sequential informa- tion. The computation resembles other recurrent networks such as LSTM, GRU and RAN (Lee et al., 2017). Specifi cally, a forget gate ftcontrols the information fl ow (Equation 1) and the state vector ctis determined by adaptively averaging the previous state ct1and the current observation Wxtaccording to ft(Equation 2). One key design decision that differs from previ- ous gated recurrent architectures is the way ct1 is used in the sigmoid gate.Typically, ct1is multiplied with a parameter matrix to compute ft, e.g., ft= (Wfxt+ Vfct1+ bf). However, the inclusion of Vfct1 makes it diffi cult to par- allelize the state computation: each dimension of ctand ftdepends on all entries of ct1, and the computation has to wait until ct1is fully com- puted. To facilitate parallelization, our light recur- rence component uses a point-wise multiplication vf? ct1 instead. With this simplifi cation, each dimension of the state vectors becomes indepen- dent and hence parallelizable. The highway network component (Srivastava et al., 2015) facilitates gradient-based training of deep networks. It uses the reset gate rt(Equation 3) to adaptively combine the input xtand the state ctproduced from the light recurrence (Equation 4), where (1 rt) ? xtis a skip connection that allowsthegradienttodirectlypropagatetothepre- vious layer. Such connections have been shown to improve scalability (Wu et al., 2016a; Kim et al., 2016; He et al., 2016; Zilly et al., 2017). The combination of the two components makes the overall architecture simple yet expressive, and easy to scale due to enhanced parallelization and gradient propagation. 3.1Parallelized Implementation Despite the parallelization friendly design of SRU, a naive implementation which computes equations (1)(4) for each step t sequentially would not achieve SRUs full potential. We employ two op- timizations to enhance parallelism. The optimiza- tions are performed in the context of GPU / CUDA programming, but the general idea can be applied to other parallel programming models. We re-organize the computation of equations (1)(4) into two major steps. First, given the input sequence x1xL, we batch the matrix multi- plications across all time steps. This signifi cantly improves the computation intensity (e.g. GPU uti- lization). The batched multiplication is: U= W Wf Wr x1,x2, ,xL , where L is the sequence length, U RL3dis the computed matrix and d is the hidden state size. When the input is a mini-batch of B sequences, U would be a tensor of size (L,B,3d). The second step computes the remaining point- wise operations. Specifi cally, we compile all point-wise operations into a single fused CUDA kernel and parallelize the computation across each dimension of the hidden state. Algorithm 1 shows thepseudocodeoftheforwardfunction. Thecom- plexity of this step is O(LB d) per layer, where L is the sequence length and B is the batch size. In contrast, the complexity of LSTM is O(LB d2) because of the hidden-to-hidden multiplications (e.g. Vht1), and each dimension can not be in- dependently parallelized. The fused kernel also reduces overhead. Without it, operations such as sigmoid activation would each invoke a separate function call, adding kernel launching latency and more data moving costs. The implementation of a bidirectional SRU is similar: the matrix multiplications of both direc- tions are batched, and the fused kernel handles and parallelizes both directions at the same time. 3.2Initialization Properparameterinitializationcanreducegradient propagation diffi culties and hence have a positive Algorithm 1 Mini-batch version of the forward pass defi ned in Equations (1)(4). Indices: Sequence lengthL, mini-batch sizeB, hidden state dimensiond. Input: Input sequences batchxl,i,j; grouped matrix multiplicationUl,i,j0; initial statec0i,j; parametersvfj,vrj,bfjandbrj. Output: Outputh,and internalc,states. Initializeh,andc,as twoL B dtensors. fori = 1, ,B;j = 1, ,ddo/ Parallelize each exampleiand dimensionj c = c0i,j forl = 1, ,Ldo f = (Ul,i,j + d + vfj c + bfj) c = f c + (1 f) Ul,i,j r = (Ul,i,j + d 2 + vrj c + brj) h = r c + (1 r) xl,i,j cl,i,j = c hl,i,j = h returnh,andc, impact on the fi nal performance. We now describe an initialization strategy tailored for SRU. We start by adopting common initializations de- rived for feed-forward networks (Glorot and Ben- gio, 2010; He et al., 2015). The weights of param- eter matrices are drawn with zero mean and 1/d variance, for instance, via the uniform distribution p3/d,+p3/d. This ensures the output vari- ance remains approximately the same as the input variance after the matrix multiplication. However, the light recurrence and highway computation would still reduce the variance of hidden representations by a factor of 1/3 to 1/2: 1 3 Varht Varxt 1 2 , and the factor converges to 1/2 in deeper layers (see Appendix A). This implies the output htand the gradient would vanish in deep models. To off- set the problem, we introduce a scaling correction constant in the highway connection ht= rt? ct+ (1 rt) ? xt , where is set to 3 such that Varh t Varxt at initialization. When the highway network is ini- tialized with a non-zero bias br= b, the scaling constant can be accordingly set as: = p 1 + exp(b) 2 . Figure 2 compares the training progress with and without the scaling correction. See Appendix A for the derivation and more discussion. 0.25 0.38 0.52 0.65 5 layer stepno scalingwith scaling 10.69587987661361 69 0.69471210241317 75 20.69360971450805 66 0.70679479837417 6 40.69167828559875 49 0.68963408470153 81 50.69149464368820 19 0.68712782859802 25 60.69109219312667 85 0.68311333656311 04 70.69028604030609 13 0.68658131361007 69 80.68636518716812 13 0.67472738027572 63 90.69153815507888 79 0.68284726142883 3 100.68511354923248 29 0.68266063928604 13 110.67692613601684 57 0.66606879234313 96 120.68502306938171 39 0.67288947105407 71 130.68517547845840 45 0.67398208379745 48 140.66570037603378 3 0.63540393114089 97 150.65827196836471 56 0.63521456718444 82 160.66550070047378 54 0.65113437175750 73 170.66228967905044 56 0.65273129940032 96 180.59680402278900 15 0.57722926139831 54 190.66552460193634 03 0.66869622468948 36 200.58357143402099 61 0.57429039478302 210.66341418027877 81 0.69616597890853 88 220.59275627136230 47 0.61363184452056 88 230.57578068971633 91 0.57446819543838 5 240.56586354970932 01 0.61302137374877 93 250.58144080638885 5 0.61606216430664 06 260.62346082925796 51 0.61466139554977 42 270.71355241537094 12 0.65056872367858 89 280.58123964071273 8 0.62549662590026 86 300.68998241424560 55 0.63613635301589 97 310.62475991249084 47 0.65779834985733 03 320.61295872926712 04 0.61878007650375 37 330.55482220649719 24 0.62934297323226 93 340.57439488172531 13 0.57176047563552 86 350.56072294712066 65 0.57941019535064 7 360.51351976394653 32 0.52448111772537 23 370.61627656221389 77 0.65032446384429 93 380.70487678050994 87 0.68365365266799 93 390.52156120538711 55 0.55995535850524 9 400.61801612377166 75 0.59346944093704 22 410.66186791658401 49 0.71131658554077 15 430.55648767948150 63 0.60116279125213 62 440.56969976425170 9 0.62786942720413 21 450.51646989583969 12 0.57864898443222 05 460.62215346097946 17 0.60114127397537 23 0.25 0.42 0.58 0.75 02505007501000 no scaling with scaling 5 layers 20 layers ?2 Figure 2 : Training curves of SRU on classifi cation. The x-axis is the number of training steps and the y-axis is the training loss. Scaling correction im- proves the training progress, especially for deeper models with many stacked layers. 4Experiments We evaluate SRU on several natural language pro- cessing tasks and perform additional analyses of the model. The set of tasks includes text classifi ca- tion, question answering, machine translation, and character-level language modeling. Training time on these benchmarks ranges from minutes (classi- fi cation) to days (translation), providing a variety of computation challenges. The main question we study is the performance- speed trade-off SRU provides in comparison to ModelSizeCRSUBJMRTRECMPQASSTTime Best reported results: Wang and Manning (2013)82.193.679.1-86.3- Kalchbrenner et al. (2014)-93.0-86.8- Kim (2014)85.093.481.593.689.688.1- Zhang and Wallace (2017)84.793.781.791.689.685.5- Zhao et al. (2015)86.395.583.192.493.3- Our setup (default Adam, fi xed word embeddings): CNN360k83.11.692.70.978.91.393.20.889.20.885.10.6417 LSTM352k82.71.992.60.879.81.393.40.989.40.788.10.82409 QRNN (k=1)165k83.51.993.40.682.01.092.50.590.20.788.20.4345 QRNN (k=1) + highway204k84.01.993.40.882.11.293.20.689.61.288.90.2371 SRU (2 layers)204k84.91.693.50.682.31.294.00.590.10.789.20.3320 SRU (4 layers)303k85.91.593.80.682.91.094.80.590.10.689.60.5510 SRU (8 layers)502k86.41.793.70.683.11.094.70.590.20.888.90.6879 Table 1 : Test accuracies on classifi cation benchmarks (Section 4.1). The fi rst block presents best reported results of various methods. The second block compares SRU and other baselines given the same setup. For the SST dataset, we report average results of 5 runs. For other datasets, we perform 3 independent trials of 10-fold cross validation (310 runs). The last column compares the wall clock time (in seconds) to fi nish 100 epochs on the SST dataset. other architectures. We stack multiple layers of SRU to directly substitute other recurrent, convo- lutional or feed-forward modules. We minimize hyper-parameter tuning and architecture
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025年西班牙语DELEB1级阅读训练试卷:艺术创作与审美体验
- 2025年事业单位招聘考试综合类无领导小组讨论面试真题模拟试卷:领导力测试高阶篇
- 保健品知识培训课件
- 医教结合的职场之路技术与人文的融合
- 大理州洱源县教育体育局机关所属事业单位选调教师笔试真题2024
- 2025年公共卫生知识考试题带答案
- 无菌技术伤口换药技术培训考试题(附答案)
- 2025年印刷用品及器材合作协议书
- 2025年GPS电子探空仪项目建议书
- 供热知识培训学习课件
- 2025年中药调剂师试卷及答案
- 2025年北京市房屋租赁合同范本(个人版)
- (新教材)2025年秋期人教版二年级上册数学核心素养教案(第3单元)(教学反思有内容+二次备课版)
- 登革热与基孔肯雅热防控指南
- 手术室护理个案分析
- 2025年可靠性工程师MTBF计算强化练习
- DB4451T 4-2023 潮州工夫茶艺技术规程
- 2025年时事政治考试题及参考答案(100题)
- 井工煤矿风险监测预警处置方案之安全监控系统监测预警处置方案
- 员工社保补贴合同协议
- 国际反洗钱师cams考试真题中文版题库汇总(含答案)
评论
0/150
提交评论