ese_ efficient speech recognition engine with sparse lstm on fpga_h_第1页
ese_ efficient speech recognition engine with sparse lstm on fpga_h_第2页
ese_ efficient speech recognition engine with sparse lstm on fpga_h_第3页
ese_ efficient speech recognition engine with sparse lstm on fpga_h_第4页
ese_ efficient speech recognition engine with sparse lstm on fpga_h_第5页
已阅读5页,还剩5页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、ESE: Efcient Speech Recognition Engine with Sparse LSTM on FPGASong Han1,2, Junlong Kang2, Huizi Mao1,2, Yiming Hu2,3, Xin Li2, Yubin Li2, Dongliang Xie2 Hong Luo2, Song Yao2, Yu Wang2,3, Huazhong Yang3 and William J. Dally1,41 Stanford University, 2 DeePhi Tech, 3 Tsinghua University, 4 NVIDIA1 son

2、ghan,, 2 song.yaodeephi.tech,3 ABSTRACTLong Short-Term Memory (LSTM) is widely usedConventionalpeechrecognition. In order to achieve higher prediction accuracy,machine learning scientists have built increasingly larger mod- els. Such large model is both co

3、mputation intensive and memory intensive. Deploying such bulky model results in high power consumption and leads to a high total cost of ownership (TCO) of a data center.To speedup the prediction and make it energy ecient, we rst propose a load-balance-aware pruning methodthatcan compress the LSTM m

4、odel size by 20 (10 from prun-ing and 2 from quantization) with negligible loss of the prediction accuracy. The pruned model is friendly for par- allel processing. Next, we propose a scheduler that encodes and partitions the compressed model to multiple PEs for parallelism and schedule the complicat

5、ed LSTM data ow. Finally, we design the hardware architecture, named E- cient Speech Recognition Engine (ESE) that works directly on the sparse LSTM model.Implemented on Xilinx XCKU060 FPGA running at 200MHz, ESE has a performance of 282 GOPS working directly on the sparse LSTM network, correspondin

6、g to 2.52 TOPS on the dense one, and processes a full LSTM for speech recogni- tion with a power dissipation of 41 Watts. Evaluated on the LSTM for speech recognition benchmark, ESE is 43 and3 faster than Core i7 5930k CPU and Pascal Titan X GPU implementations. It achieves 40 and 11.5 higher energy

7、 eciency compared with the CPU and GPU respectively.TrainingInferenceProposedTrainingFigure 1: Proposed ecient DNN deployment ow: model compression+accelerated inference.$OJRULWKP6RIWZDUH+DUGZDUHFigure 2: ESE optimizes LSTM computation across algorithm, software and hardware stack.recognition. In th

8、is work, we evaluated the most complex one: LSTM 14. A similar methodology could be easily applied to other types of recurrent neural network.Despite the high prediction accuracy, LSTM is hard to deploy because of its high computation complexity and high memory footprint, leading to high power consu

9、mption. Mem- ory reference consumes more than two orders of magnitude higher energy than ALU operations, thus our focus narrows down to optimizing the memory footprint.To reduce memory footprint, we design a novel method to optimize across the algorithm, software and hardware stack: we rst optimize

10、the algorithm by compressing the LSTM model to 5% of its original size (10% density and 2 narrower weights) while retaining similar accuracy, then we develop a software mapping strategy to represent the compressed model in a hardware-friendly way, nally we de- sign specialized hardware to work direc

11、tly on the compressed LSTM model.The proposed ow for ecient deep learning inference is illustrated in Fig. 1. It shows a new paradigm for ef- cient deep learning inference, from Training=Inference, to Training=Compression=Accelerated Inference, whichKeywordsDeep Learning; Speech Recognition; Model C

12、ompression; Hardware Acceleration; Software-Hardware Co-Design; FPGA1.INTRODUCTIONDeep neural network is becoming the state-of-the-art method for speech recognition 6, 13. Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) are two popular types of recurrent neural networks (RNNs) used for

13、speechPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for co

14、mponents of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. Request permissions from .FPGA 17, Februar

15、y 22 - 24, 2017, Monterey, CA, USAc 2017 Copyright held by the owner/author(s). Publication rights licensed to ACM. ISBN 978-1-4503-4354-1/17/02. . . $15.00DOI: /10.1145/3020078.302174575)3*$FFHOHUDWLRQ VSHHGXS ORZHU HQHUJ6FKHGXOLQJ &RPSLOLQJUHODWLYH LQGHHG EORFNHG &6&/670 0RGHO &RPS

16、UHVVLRQ VPDOOHU VLPLODU DFFXUDFThis WorkCompressionAccelerated InferencePruning Quantizationhas advantage of inference speed and energy eciency com- pared with conventional method. Using LSTM as a case study for the propose paradigm, the design ow is illustrated in Fig. 2.The main contributions of t

17、his work are:SpeechText1.We present an eective model compression algorithm for LSTM, which is composed of pruning and quanti- zation. We highlight our load balance-aware pruning and automatic ow for dynamic-precision data quan- tization.The recurrent nature of RNN and LSTM produces complicated data

18、dependency, which is more challeng- ing than feedforward neural nets. We design a sched- uler that can eciently schedule the complex LSTM operations with memory reference overlapped with com- putation.The irregular computation pattern after compression posed a challenge on hardware. We design a hard

19、- ware architecture that can work directly on the sparse model. ESE achieves high eciency by load balanc- ing and partitioning both the computation and stor- age. ESE also supports processing multiple speech data concurrently.We present an in-depth study of the LSTM and speech recognition system and

20、 did optimization across the al- gorithm, software, hardware boundary. We jointly an- alyze the trade-o between prediction accuracy and prediction latency.Figure 3: Speech recognition engine: LSTM is used in acoustic model. LSTM takes more than 90% time in the whole computation pipeline.2.,QSXW/6703

21、./670)&6RIWPD4.$kjdkjFigure 4: Data ow of the LSTM model.The LSTM architecture is shown in Fig. 4, which is the same as the standard LSTM implementation 19. LSTM is one type of RNN, where the input at time T depends onthe output at T 1. Compared to the traditional RNN,2.BACKGROUNDSpeech recognition

22、is theprocess of converting speechsignals to a sequence of words. As shown in Fig. 3, theLSTM contaspecial memory blocks in the recurrent hid-speech recognition system contathe front-end and back-den layer. The memory cells with self-connections in mem- ory blocks can store the temporal state of the

23、 network.end, where front-end unit is used for extracting features fromspeech signals, and back-end processes the features and out- put the text. The back-end includes acoustic model (AM), language model (LM), and decoder. Here, Long Short-Term Memory (LSTM) recurrent neural network is used in the a

24、coustic model.The feature vectors extracted from front-end unit are pro- cessed by acoustic model, then the decoder uses both acous- tic and language models to generate the sequence of words by maximum a posteriori probability (MAP) estimation, which can be described asThe memory blocks also contape

25、cial multiplicative unitscalled gates: input gate, output gate and forget gate. As inFig. 4, the input gate i controls the ow of input activations into the memory cell. The output gate o controls the output ow into the rest of the network. The forget gate f scales the internal state of the cell befo

26、re adding it as input to the cell, which can adaptively forget the cells memory.An LSTM network accepts an input sequence x = (x1; . ; xT ), and computes an output sequence y = (y1; . ; yT ) by usingthe following equations iteratively from t = 1 to T := arg max P (W|X) = argmax P (X|W)P (W)Wit = (Wi

27、xxt + Wiryt1 + Wicct1 + bi)(1)(2)(3)(4)(5)(6)(7)P (X)WWft = (Wfxxt + Wfryt1 + Wfcct1 + bf )gt = (Wcxxt + Wcryt1 + bc) ct = ft ct1 + gt itot = (Woxxt + Woryt1 + Wocct + bo)where for the given feature vector X = X1X2 . Xn, speechword sequence WW1W2 . Wmrecognition is to nd=with maximum posterior proba

28、bility P (W|X). Because Xis xed, the above equation can be rewritten asW= arg max P (X|W)P (W)Wmt = oth(ct )where P (X|W) and P (W) are the probabilities computed by acoustic and language models respectively in Fig. 3 20.In modern speech recognition system, LSTM architecture is often used in large-s

29、cale acoustic modeling and for com- puting acoustic output probabilities. In the speech recogni- tion pipeline, LSTM is the most computation and memory intensive part. Thus we focus on accelerating the LSTM.yt =WymmtHere the big O dot operator means element-wise multiplica-Wix is thetion, the W term

30、s denote weight matrices (e.g.matrix of weights from the input to the input gate), andWic, Wfc, Wocare diagonal weight matrices for peepholeconnections. The b terms denote bias vectors, while is the76LSTMFront-EndFeature ExtractionLanguageBack-EndModelAcousticDecoder Modellogistic sigmoid function.

31、The symbols i, f , o, c and m are respectively the input gate, forget gate, output gate, cell ac- tivation vectors and cell output activation vectors, and all of which are the same size. The symbols g and h are the cell input and cell output activation functions.3.MODEL COMPRESSIONIt has been widely

32、 observed that deep neural networks usually have a lot of redundancy 11, 12. Getting rid of the redundancy wont hurt prediction accuracy. From the hard- ware perspective, model compression is critical for saving the computation as well as memory footprint, which means lower latency and better energy

33、 eciency. Well discuss two steps of model compression that consist of pruning and quan- tization in the next three subsections.UnbalancedBalanced5 cycles2 cycles4 cycles1 cycle3 cycles3 cycles3 cycles3 cyclesOverall: 5 cyclesOverall: 3 cyclesFigure 5: Load Balance Aware Pruning and its Ben- et for P

34、arallel Processing3.1PruningIn the pruning phase we rst train the model to learn which weights are necessary, then prune away weights that are not contributing to the prediction accuracy; nally, we retrain the model given the sparsity constraint. The processwith load balancewithout load balance28 0%

35、26 5%is the same as 12.tep two, the saliency of the weight25 0%is determined by the weights absolute value: if the weights absolute value is smaller than a threshold, then we prune it away. The pruning threshold is empirical: pruning too much will hurt the accuracy while pruning at the right level w

36、ont.23 5%22 0%20 5%19 0%Our pruning experiments are performed on the Kaldi speech0%10% 20% 30% 40% 50% 60% 70% 80% 90% 100%Parame ers Pruned Awayrecognition toolkit 17. The trade-o curve of the percent- age of parameters pruned away and phone error rate (PER) is shown in Fig.6. The LSTM is evaluated

37、 on the TIMIT dataset 8. Not until we prune away more than 93% pa-Figure6:Accuracy curve of load-balance-awarepruning and original pruning.rameters did the PER begto increase dramatically. Wefurther experiment on a proprietary dataset which is much larger: it has 1000 hours of training speech data,

38、100 hours of validation speech data, and 10 hours of test speech data, we nd that we can prune away 90% of the parameters with- out hurting word error rate (WER), which aligns with our result on TIMIT dataset. In our later discussions, we use 10% density (90% sparsity).As illustrated in Fig. 5, the

39、matrix is divided into four col- ors, and each color belongs to a PE for parallel processing.With conventional pruning, PE0might have ve non-zeroweights while P E3 may have only one. The total process-ing time is restricted to the longest one, which is ve cy- cles. With load-balance-aware pruning, a

40、ll PEs have three non-zero weights; thus only three cycles are necessary to carry out the operation. Both cases have the same non-zero weights in total, but load-balance-aware pruning needs fewer cycles. The dierence of prediction accuracy with / with- out load-balance-aware pruning is very small, a

41、s shown in Fig. 6. There is some noise around 70% sparsity, so we put more experiments around 90% sparsity, which is the sweet point. We nd the performance is very similar.3.2Load Balance-Aware PruningOn top of the basic deep compression method, we high- light our practical design consideration for

42、hardware e- ciency. To execute sparse matrix multiplication in parallel, we propose the load balance-aware pruning method, which is very critical for better load balancing and higher utilization on the hardware.Pruning could lead to a potential problem of unbalanced non-zero weights distribution. Th

43、e workload imbalance over PEs may cause a gap between the real performance and peakTo show that load-balance-aware pruning still obtacom-parable prediction accuracy, we compare it with original pruning on the TIMIT dataset. As demonstrated in Fig.6,the accuracy margin between two methods is within t

44、he vari- ance of pruning process itself.3.3Weight and Activation QuantizationWe further compressed the model by quantizing 32bit oating point weights into 12bit integer. We used linear quantization strategy on both the weights and activations.In the weight quantization phase, the dynamic ranges of w

45、eights for all matrices in each LSTM layer are analyzed rst, then the length of the fractional part is initialized to avoid data overow.The activation quantization phase aims to gure out theperformance. This problem is further addressedection 4.Load-balance-awarepruning is designed to solve this pro

46、b-lem and obtain hardware-friendly sparse network, which pro- duces the same sparsity ratio among all the submatrices. During pruning, we make eorts to avoid the scenario when the density of one submatrix is 5% while the other is 15%. Although the overall density is about 10%, the submatrix with a d

47、ensity of 5% has to wait for the other one with more computation, whichleads to idlecycles. Load-balance-aware pruning assigns the same sparsity quota to submatrices, thus ensures an even distribution of non-zero weights.77Phone Error Ratew0 000w0 300w 200w20w2 300w3 2000w4 20w5 000w5 3w6 00000w0w 3

48、w0 0w00w0 300w 200w20w2 3000000w4 2w4 3w5 0000w6 000w6 30w00Table 1: Weight Quantization under dierent Bits.Table 3: Other Activation Quantization.Table 4: PER Before and After Compression.We design a data ow scheduler to make full use of the hard- ware accelerator.Data is divided into n blocks by r

49、ow where n is the number of PEs in one channels of our hardware accelerator. The rst n rows are put in n dierent PEs. The n + 1 row are put in the rst PE again. This ensures that the rst part of the matrix will be read in the rst reading cycle and can be used in the next step computation immediately

50、.Because of the sparsity of pruned matrices. We only store the nonzero number in weight matrices to save redundant memory. We use relative row index and column pointer to help store the sparse matrix. The relative row index for each weight shows the relative position from the last nonzero weight. Th

51、e column pointer indicates where the new column1 Only weights in LSTM layers are qunantized.2 In Kaldi, Wcx, Wix, Wfx, Wox are saved together as W gifo x, and so does W gifo r mean.Table 2: Activation Function Lookup Table.optimal solution to the activation functions and the inter- mediate results.

52、We build lookup tables and use linear in- terpolation for the activation functions, such as sigmoid and tanh, and analyze the dynamic range of their inputs to de- cide the sampling strategies. We also investigated how many bits are enough for the intermediate results of the matrices operations.We ex

53、plore dierent data quantization strategies of weights with real network trained under TIMIT corpus. The spar- sity of LSTM layers after pruning and ne-tune procedure is about 88.8% (i.e. a density of 11.2%). Performing the weight and activation quantization, we can achieve 12bit quantization without

54、 any accuracy loss. The data quantiza- tion strategies are shown in Table.1,2 and Table.3. For the lookup tables of activation functions sigmoid and tanh, the sampling ranges are -64, 64, -128, 128 respectively, the sampling points are both 2048, and the outputs are 16bit with 15bit decimals. All th

55、e results are obtained under Kaldi framework.For TIMIT, as shown in Table.4, the PER is 20.4% for the original network, and change to 20.7% after pruning and ne-tune procedure when 32-bit oating-point numbers arebegin the matrix. The accelerator will read the weightaccording to the column pointer.Co

56、nsidering the byte-aligned bit width limitation of DDR,we use 16bit data to store the weight. The quantized weightand relative row index are put together(i.e. 12bit for quan- tized weight and 4bit for relative row index).Fig.7 shows an example for the compressed sparse column (CSC) storage format and zero-padding method. We lo

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论