毕设完成吕伊鹏外文原文_h_第1页
毕设完成吕伊鹏外文原文_h_第2页
毕设完成吕伊鹏外文原文_h_第3页
毕设完成吕伊鹏外文原文_h_第4页
毕设完成吕伊鹏外文原文_h_第5页
已阅读5页,还剩7页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL 23, NO 4, APRIL 2015631State-Clustering Based Multiple Deep Neural NetworksModelingApproachfor Speech RecognitionPan Zhou, Hui Jiang, Senior Member, IEEE, Li-Rong Dai, Yu Hu, and Qing-Feng LiuI. INTRODUCTIONTATE of the art automati

2、c speech recognition (ASR) systems rely on acoustic models of context-dependentAbstractThe hybrid deep neural network (DNN) and hidden Markov model (HMM) has recently achieved dramatic per-Sformance gain automatic speech recognition (ASR). TheDNN-based acoustic model is very powerful but its learnin

3、g process is extremely time-consuming. In this paper, we propose a novel DNN-based acoustic modeling framework for speech recognition, where the posterior probabilities of HMM states arephones in language, each of which is usually modeled bya 3-state hidden Markov model (HMM) to handle temporal vari

4、ability of speech signals. Throughout decades of research investigation, and until recently, the Gaussian mixture model (GMM) has normally been regarded as the dominant model for the probability density function of acoustic observations in each HMM state. In the literature many successful training m

5、ethodscomputed from multiple DNNs (mDNN),tead of a single largeDNN, for the purpose of parallel training towards faster turn- around. In the proposed mDNN method all tied HMM states arerst grouped into several disjoint clusters based on data-driven methods. Next, several hierarchically structured DN

6、Ns are trained separately in parallel for these clusters using multiple computing units (e.g. GPUs). In decoding, the posterior probabilities of HMM states can be calculated by combining outputs from multiple DNNs. In this work, we have shown that the training procedure of the mDNN under popular cri

7、teria, including both frame-level cross-entropy and sequence-level discriminative training, can be parallelized efciently to yield signicant speedup. The training speedup is mainly attributed to the fact that multiple DNNs are(MLE) using the expectation maximization (EM) algorithm as well as a numbe

8、r of discriminative training methods, as summarized in 1. The GMM is quite suitable for modeling the probability distribution of acoustic feature vectors because a GMM can approximate any probabilistic distribution as long as a sufcient number of Gaussian components is used. However, one major limit

9、ation in the framework of GMM is how to model highly correlated feature vectors since it requires to estimate full covariance matrices, which may be ill-conditioned due to strong correlation among some feature dimensions. On the other hand, articial neural networks (ANN) are a powerful discriminativ

10、e model to map speech feature vectors to HMM states, by directly modeling the classication surface in feature space. Since the 1990s, the so-called multi-layer perceptron (MLP), has been used as an alternative to the GMM to compute posterior probabilities of mono-phone HMM states based on a xed numb

11、er of speech frames within a long context window 2. These probabilities may be directly used by HMMs to decode input speech utterances for recognition. At that time, however, the computing hardware was not adequate to learn deeper neural networks with more hidden layers from big data sets. As a resu

12、lt, performance of neural networks as acoustic models in ASR was not good enough to compete with GMMs. The main application of neural networks at that time was to ex- tract features as in tandem 3 or bottleneck4 congurations. More recently, as triggered by some generative pre-training strategies tha

13、t are used to initialize neural networks prior to the standard supervised training procedure 5, neural networks have revived, under the name of deep learning, as a very competitive acoustic model for ASR 6, 7, 8, 9, where the so-called deep neural networks (DNN) are used to calculate scaled likeliho

14、ods directly for all context-dependent triphone HMM tied-states. Dramatic performance improvements have been reported in a number of challenging, large vocabularyparallelized over multiple GPUs and each DNN is smallerizeand trained by only a subset of training data. We have evaluated the proposed mD

15、NN method on a 64-hour Mandarin transcrip- tion task and the 320-hour Switchboard task. Compared to the conventional DNN, a 4-cluster mDNN model with similar size canyield comparable recognition performancewitchboard (onlyabout 2% performance degradation) with a greater than 7 timesspeed improvement

16、 in CE training and a 2.9 times improvement equence training, when 4 GPUs are used.Index TermsCross entropy training, data partition, deep neural networks (DNN), model parallelism, multiple DNNs (mDNN), parallel training, sequence training, speech recognition, state clustering.Manuscript received No

17、vember 27, 2013; revised August 09, 2014; acceptedJanuary 12, 2015. Date of publication January 15, 2015; date of current version March 06, 2015. This work was supported in part by the National Nature Sci- ence Foundation of China under Grant 61273264 and in part by the National 973 program of China

18、 under Grant 2012CB326405. The associate editor co- ordinating the review of this manuscript and approving it for publication was Dr. Mei-Yuh Hwang.P. Zhou, L.-R. Dai, Y. Hu, and Q.-F. Liu are with National Engineering Lab- oratory of Speech and Language Information Processing, University of Science

19、 and Technology of China, Hefei 230026, China (e-mail: pan2005mail.ustc. ; ; ; ).H. Jiang is with Department of Electrical Engineering and Computer Sci- ence, Lassonde School of Engineering, York University, Toronto, ON M3J 1P3, Canada (e-mail: hjcse.yo

20、rku.ca).Color versions of one or more of the gures in this paper are available online at .Digital Object Identier 10.1109/TASLP.2015.23929442329-9290 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See /public

21、ations_standards/publications/rights/index.html for more information.632IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL 23, NO 4, APRIL 2015ASR tasks, see 10, 11, 12. The performance gaarelelization is the so-called “data partition” strategy, i.e., splitting all training data in

22、to disjoint subsets that are processed by dif- ferent computing units independently and all individual results are combined afterwards by a merger. Recently some methods have been proposed to conduct parallel training of ANNs by taking advantage of a number of CPUs in a large computing cluster. In 3

23、2, 33, it achieves parallelization using multiple computers based on the server-client mode, where training datamainly attributed to the deeper structure due to more hidden layers as well as a wider conguration per layer (many more nodes are used for the hidden and output layers). The deep structure

24、 enables a DNN to deal with variations peech signals in a more reliable way. According to 13, speech features learned in the upper layers become more invariant, which may lead to better model generalization and eventuallyimproved recognition performance. Moreover, recent advances in computing hardwa

25、re, particularly the general purpose GPU computing framework, make it possible to train these large DNNs on a huge amount of training data. Accordingly, DNNs have attracted massive research interest in the past few years 14, 15, 16, 17, 18, 19, 20. In these works DNNs arenormally initialized by rand

26、om initialization or layer-by-layer pre-training. Then DNNs are typically trained with the concate- nated speech features as well as the corresponding HMM state labels based on the frame-level cross entropy (CE) training criterion. In this stage the well-known back-propagation (BP) method is used to

27、 ne-tune DNNs according to the stochastic gradient descent (SGD) algorithm. Furthermore, since speech recognition is a sequence classication problem in nature, it is well known that GMM-HMM based speech recognizers typically obtain notable performance ga after adjusting parameters with sequence-leve

28、l discriminative training criteria 1, such as maximum mutual information (MMI) 21, min- imum phone error (MPE) 22, minimum Bayes risk (MBR)23 or large margin estimation (LME) 24. Similarly, as shown in 25, 26, it may yield further improvement (about 10-15% relative error reduction) if DNN parameters

29、 are rened based on a sequence level discriminative criterion closely related tois split into many bunch-sized training patternseveral clientcomputers that learn and update weight matrices and all clientsand servers communicate through computer networks synchro- nously or asynchronously. Similarly,

30、in 34, the training set is divided into N disjoint subsets in each epoch and a separate MLP is trained on each subset. Then these sub-MLPs are com-bined by a merger network that is trainedby another subsetof data. More recently, there has been interest in conductingparallel training of DNNs using mu

31、ltiple GPUs. In 35, the so-called asynchronous SGD is implemented for training DNNs by using a number of GPUs. In contrast, another possible par- allelization strategy is “model parallelism”, i.e., splitting the model into several parts, each of which is computed by one com- puting unit. In 36, a pi

32、pelined implementation of back prop- agation (BP) is proposed for parallel training of DNNs using multiple GPUs, where computation related to different layers of a DNN is distributed to several GPUs. The limitation of this pipelined method is how to balance computation load among different GPUs, par

33、ticularly when the output layer dominates the computation. In 37 it scales up training to a cluster of thou- sands of CPUs by implementing both model parallelism and the asynchronous SGD. However, no matter whether data parti- tion or model parallelism is implemented, all of the above-men- tioned pa

34、rallel training methods suffer from the communica- tion overhead problem when collecting gradients, redistributing updated model parameters over different units, and delivering model outputs into another unit. This cross-unit communication is poised to become the major performance bottleneck, espe-

35、cially when parallel training is scaled up to a large number of computing units.In this paper we propose to use cluster-based multiple deep neural networks, called mDNN, to replace a single large DNN with parallelization of DNN training across multiple GPUs. In our method we rst automatically cluste

36、r a large training set into several subsets. Different from other previous data division methods, we use a data-driven unsupervised clustering method to group training data so that all subsets are disjoint in terms of HMM state labels. In this way we can independently train a smaller DNN on each sub

37、set of training data to distinguish different states within this cluster, and another even smaller DNN to distinguish different clusters. Therefore, this approach nicely combines both parallelism strategies of “data partition” and “model parallelism”, while minimizing the communication trafc among d

38、ifferent computing units during the training process. This method can achieve signicant training speedupsequence level decision makingpeech recognition.However, regardless of training criteria, training DNNs is al- ways a very slow and time-consuming process, especially from big data sets. Even thou

39、gh the process has been remarkably ac- celerated by modern GPUs, it is still not possible to achieve a reasonable turnaround time in DNN training. For example it normally takes a very long time to train a typical six-hidden- layer DNN from thousands of hours of speech data 27. The underlying reason

40、is that the basic DNN learning algorithm, i.e., stochastic gradient descent (SGD), is relatively slow in conver- gence, and even worse, it is difcult to parallelize SGD because it is inherently a serial learning method. During the recent years researchers have been pursuing more efcient DNN training

41、 methods. The rst idea is to simplify model structure by ex- ploring sparseness in DNN models. As reported in 28, there is almost no performance loss when zeroing 80% of small weights in a large DNN model. This method is good to reduce total DNN model size but it gives no gain in training speed due

42、to highly random memory accesses introduced by sparse matrices. Along this line, as in 29, 30, it is proposed to factorize each weight matrix in a DNN into a product of two smaller matrices, which is reported to achieve about 30-50% speedup in DNN computa- tion. A more recent work in 31 uses a shrin

43、king hidden layers structure to simplify the DNN model, producing up to a 50% speedup in DNN computation. On the other hand, a more effec- tive way to speed up DNN training is to parallelize it using mul- tiple CPUs or GPUs. The most straightforward way for paral-for some of the popular DNN training

44、 criteriapeechrecognition. For example, we have constructed mDNN for theframe-level CE-based training, which can be done in parallel by using multiple GPUs with zero communication data trafc among GPUs. The proposed mDNN method can effectivelyZHOU et al.: STATE-CLUSTERING BASED MULTIPLE DNNs MODELIN

45、G APPROACH FOR SPEECH RECOGNITION633take advantage of multiple GPUs to signicantly reduce the total training time in many large-scale ASR tasks. Moreover, we have also extended the mDNN method to full sequence training for better recognition performance. The mDNN also helps to speed up sequence trai

46、ning, though at a lesser degree. In this paper we not only investigate how sequence-level discriminative training can be applied to the mDNN modeling framework, but also implement mDNN sequence training in parallel by using multiple GPUs for faster training speed. In this work we have evaluated the

47、proposed mDNN method on a 64-hour Mandarin transcription task and the 320-hour Switchboard task. Experimental results have shown that the new method using cluster-based multiple DNNs can achieve a comparable recognition performance with an over 7 times reduction in total training time for cross-entr

48、opy training when 4 GPUs are used. Furthermore, experiments on the Switchboard task have revealed that even one epoch of MMI-based sequence training can improve the CE-trained mDNN from 15.9% to 14.5% in word error rate (about 8.8% relative error reduction). Meanwhile, sequence training of mDNN can

49、be expedited by2.9 times by using a parallel implementation in 4 GPUs. When taking the initial CE learning into account, we can achieve an over 6 times training speedup with mDNN, when 4 GPUs are used. In this paper we consolidate and expand our previous research work scattered in various conference

50、 papers 38, 39 for better accessibility and more coherent exposition of the topic. Moreover, we have included additional technical detailsobservation vector. Thehidden layers normally use thesigmoid activation function, denoted as, to compute theoutput vector, from the activation vector in the -th l

51、ayer,which is in turn fed to the next layer as input. Meanwhile, the top output layer uses the softmax operation to compute posteriors for all class labels. The main equations related to DNN modeling are listed as follows:(1)(2)(3)(4)(5)represent the weight matrix and bias vectorwhere for layer vect

52、orsand, and andanddenote the -th component of, respectively. For notational simplicity, wemay expand the input vector, in each layer by adding anadditional dimension of constant 1 to absorb the bias, aspart of weight matrix, learn weight matrices, sections. As a result we only consider how to, in th

53、e followingA DNN is often trained with stochastic gradient descent (SGD) in a well-formed error back-propagation (BP) procedure:to illustrate the mateps to construct cluster based multipleDNN acoustic models, show its potential in accelerating DNN(6)training by multiple GPUs for both frame level cro

54、ss-entropy training and the full sequence level discriminative training, and provide additional experimental results to investigate how different mDNN congurations may affect the nal ASR performance as well as training efciency.In the remainder of this paper we briey review the standardwhereis the l

55、oss function to be minimized andis thelearning rate, and the gradientis what we need to calculatefor each parameter update. In the following sections we will show how to derive this gradient for several popular objectivefunctions for DNN trainingpeech recognition.hybrid DNN-HMM model and its trainin

56、g methodection II.Lastly, when using DNN-HMM for decoding, state poste-ection III we introduce the proposed cluster-based mul-riorsgenerated from the DNN output layer are con-tiple DNNs modeling framework in detail, including the basicverted to scaled likelihoods by dividing by the priors, as in 10,

57、 11, and then these scaled likelihoods are used to replace GMM likelihoods in a Viterbi decoder to search for the best path for recognition.B. Initialization with Pre-trainingIt is well-known that error back-propagation can easily get stuck in a poor local optimum in a deep network if the model para

58、meters are not properly initialized. In many cases it is im- portant to alleviate this problem with a layer-wise pre-training algorithm. The pre-training procedure treats each pair of con- secutive layers in a DNN as a restricted Boltzmann machine (RBM). An RBM is a two-layer undirected neural network that associates each conguration of visible and hidden state vec- tors with an energy function. The one step contrastive diver- gence (CD) algorithm 5 can efciently learn th

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论