语音识别文献翻译.doc

上传人：伐*** IP属地：宁夏上传时间：2019-01-13 格式：DOC 页数：18 大小：1.38MB 积分：12 举报 版权申诉

已阅读5页，还剩13页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

毕业设计(文献翻译) 第18页语音识别在计算机技术中，语音识别是指为了达到说话者发音而由计算机生成的功能，利用计算机识别人类语音的技术。（例如，抄录讲话的文本，数据项;经营电子和机械设备;电话的自动化处理），是通过所谓的自然语言处理的计算机语音技术的一个重要元素。通过计算机语音处理技术，来自语音发音系统的由人类创造的声音，包括肺，声带和舌头，通过接触，语音模式的变化在婴儿期、儿童学习认识有不同的模式，尽管由不同人的发音，例如，在音调，语气，强调，语调模式不同的发音相同的词或短语，大脑的认知能力，可以使人类实现这一非凡的能力。在撰写本文时（2008年），我们可以重现，语音识别技术不只表现在有限程度的电脑能力上，在其他许多方面也是有用的。语音识别技术的挑战古老的书写系统,要回溯到苏美尔人的六千年前。他们可以将模拟录音通过留声机进行语音播放，直到1877年。然而，由于与语音识别各种各样的问题，语音识别不得不等待着计算机的发展。首先,演讲不是简单的口语文本同样的道理,戴维斯很难捕捉到一个note-for-note曲作为乐谱。人类所理解的词、短语或句子离散与清晰的边界实际上是将信号连续的流,而不是听起来: i went to the store yesterday昨天我去商店。单词也可以混合,用whadd ayawa吗?这代表着你想要做什么。第二,没有一对一的声音和字母之间的相关性。在英语,有略多于5个元音字母a,e,i,o,u,有时y和w。有超过二十多个不同的元音, 虽然,精确统计可以取决于演讲者的口音而定。但相反的问题也会发生,在那里一个以上的信号能再现某一特定的声音。字母c可以有相同的字母k的声音，如蛋糕，或作为字母s，如柑橘。此外,说同一语言的人使用不相同的声音,即语言不同,他们的声音语音或模式的组织，有不同的口音。例如“水”这个词,wadder可以显著watter，woader wattah等等。每个人都有独特的音量男人说话的时候,一般开的最低音，妇女和儿童具有更高的音高(虽然每个人都有广泛的变异和重叠)。发音可以被邻近的声音、说话者的速度和说话者的健康状况所影响，当一个人感冒的时候，就要考虑发音的变化。最后,考虑到不是所有的语音都是有意义的声音组成。通常语音自身是没有任何意义的，但有些用作分手话语以传达说话人的微妙感情或动机的信息:哦,就像,你知道,好的。也有一些听起来都不认为是字，这是一项词性的：呃,嗯,嗯。嗽、打喷嚏、谈笑风生、呜咽,甚至打嗝的可以成为上述的内容之一。在噪杂的地方与环境自身的噪声中，即使语音识别也是困难的。“我昨天去了商店”的波形图“我昨天去了商店”的光谱图语音识别的发展史尽管困难重重，语音识别技术却随着数字计算机的诞生一直被努力着。早在1952年，研究人员在贝尔实验室就已开发出了一种自动数字识别器，取名“奥黛丽”。如果说话的人是男性，并且发音者在词与词之间停顿350毫秒并把把词汇限制在19之间的数字，再加上“哦”，另外如果这台机器能够调整到适应说话者的语音习惯，奥黛丽的精确度将达到9799，如果识别器不能够调整自己，那么精确度将低至60.奥黛丽通过识别音素或者两个截然不同的声音工作。这些因素与识别器经训练产生的参考音素是有关联的。在接下来的20年里研究人员花了大量的时间和金钱来改善这个概念，但是少有成功。计算机硬件突飞猛进、语音合成技术稳步提高，乔姆斯基的生成语法理论认为语言可以被程序性地分析。然而，这些似乎并没有提高语音识别技术。乔姆斯基和哈里的语法生成工作也导致主流语言学放弃音素概念，转而选择将语言的声音模式分解成更小、更易离散的特征。1969年皮尔斯坦率地写了一封信给美国声学学会的会刊，大部分关于语音识别的研究成果都发表在上面。皮尔斯是卫星通信的先驱之一，并且是贝尔实验室的执行副主任，贝尔实验室在语音识别研究中处于领先地位。皮尔斯说所有参与研究的人都是在浪费时间和金钱。如果你认为一个人之所以从事语音识别方面的研究是因为他能得到金钱，那就太草率了。这种吸引力也许类似于把水变成汽油、从海水中提取黄金、治愈癌症或者登月的诱惑。一个人不可能用削减肥皂成本10的方法简单地得到钱。如果想骗到人，他要用欺诈和诱惑。皮尔斯1969年的信标志着在贝尔实验室持续了十年的研究结束了。然而，国防研究机构arpa选择了坚持下去。1971年他们资助了一项开发一种语音识别器的研究计划，这种语音识别器要能够处理至少1000个词并且能够理解相互连接的语音，即在语音中没有词语之间的明显停顿。这种语音识别器能够假设一种存在轻微噪音背景的环境，并且它不需要在真正的时间中工作。到1976年，三个承包公司已经开发出六种系统。最成功的是由卡耐基麦隆大学开发的叫做“harpy”的系统。“harpy”比较慢，四秒钟的句子要花费五分多钟的时间来处理。并且它还要求发音者通过说句子来建立一种参考模型。然而，它确实识别出了1000个词汇，并且支持连音的识别。研究通过各种途径继续着，但是“harpy”已经成为未来成功的模型。它应用隐马尔科夫模型和统计模型来提取语音的意义。本质上，语音被分解成了相互重叠的声音片段和被认为最可能的词或词的部分所组成的几率模型。整个程序计算复杂，但它是最成功的。在1970s到1980s之间，关于语音识别的研究继续进行着。到1980s，大部分研究者都在使用隐马尔科夫模型，这种模型支持着现代所有的语音识别器。在1980s后期和1990s，darpa资助了一些研究。第一项研究类似于以前遇到的挑战，即1000个词汇量，但是这次要求更加精确。这个项目使系统词汇出错率从10下降了一些。其余的研究项目都把精力集中在改进算法和提高计算效率上。2001年微软发布了一个能够与0ffice xp 同时工作的语音识别系统。它把50年来这项技术的发展和缺点都包含在内了。这个系统必须用大作家的作品来训练为适应某种指定的声音，比如埃德加爱伦坡的厄舍古屋的倒塌和比尔盖茨的前进的道路。即使在训练之后，该系统仍然是脆弱的，以至于还提供了一个警告：“如果你改变使用微软语音识别系统的地点导致准确率将降低，请重新启动麦克风”。从另一方面来说，该系统确实能够在真实的时间中工作，并且它确实能识别连音。语音识别的今天技术当今的语音识别技术着力于通过共振和光谱分析来对我们的声音产生的声波进行数学分析。计算机系统第一次通过数字模拟转换器记录了经过麦克风传来的声波。那种当我们说一个词的时候所产生的模拟的或者持续的声波被分割成了一些时间碎片，然后这些碎片按照它们的振幅水平被度量，振幅是指从一个说话者口中产生的空气压力。为了测量振幅水平并且将声波转换成为数字格式，现在的语音识别研究普遍采用了奈奎斯特香农定理。奈奎斯特香农定理奈奎斯特香农定理是在1928年研究发现的，该定理表明一个给定的模拟频率能够由一个是原始模拟频率两倍的数字频率重建出来。奈奎斯特证明了该规律的真实性，因为一个声波频率必须由于压缩和疏散各取样一次。例如，一个20khz的音频信号能准确地被表示为一个44.1khz的数字信号样本。工作原理语音识别系统通常使用统计模型来解释方言，口音，背景噪音和发音的不同。这些模型已经发展到这种程度，在一个安静的环境中准确率可以达到90以上。然而每一个公司都有它们自己关于输入处理的专项技术，存在着4种关于语音如何被识别的共同主题。1.基于模板：这种模型应用了内置于程序中的语言数据库。当把语音输入到系统中后，识别器利用其与数据库的匹配进行工作。为了做到这一点，该程序使用了动态规划算法。这种语音识别技术的衰落是因为这个识别模型不足以完成对不在数据库中的语音类型的理解。2.基于知识：基于知识的语音识别技术分析语音的声谱图以收集数据和制定规则，这些数据和规则回馈与操作者的命令和语句等值的信息。这种识别技术不适用关于语音的语言和语音知识。3.随机：随机语音识别技术在今天最为常见。随机语音分析方法利用随机概率模型来模拟语音输入的不确定性。最流行的随机概率模型是hmm（隐马尔科夫模型）。如下所示： yt是观察到的声学数据，p（w）是一个特定词串的先天随机概率，p（ytw）是在给定的声学模型中被观察到的声学数据的概率，w是假设的词汇串。在分析语音输入的时候，hmm被证明是成功的，因为该算法考虑到了语言模型，人类说话的声音模型和已知的所有词汇。1.联结：在联结主义语音识别技术当中，关于语音输入的知识是这样获得的，即分析输入的信号并从简单的多层感知器中用多种方式将其储存在延时神经网络中。如前所述，利用随机模型来分析语言的程序是今天最流行的，并且证明是最成功的。识别指令当今语音识别软件最重要的目标是识别指令。这增强了语音软件的功能。例如微软sync被装进了许多新型汽车里面，据说这可以让使用者进入汽车的所有电子配件和免提。这个软件是成功的。它询问使用者一系列问题并利用常用词汇的发音来得出语音恒量。这些常量变成了语音识别技术算法中的一环，这样以后就能够提供更好的语音识别。当今的技术评论家认为这项技术自20世纪90年代开始已经有了很大进步，但是在短时间内不会取代手控装置。听写关于指令识别的第二点是听写。就像接下来讨论的那样，今天的市场看重听写软件在转述医疗记录、学生试卷和作为一种更实用的将思想转化成文字方面的价值。另外，许多公司看重听写在翻译过程中的价值，在这个过程中，使用者可以把他们的语言翻译成为信件，这样使用者就可以说给他们母语中另一部分人听。在今天的市场上，关于该软件的生产制造已经存在。语句翻译中存在的错误当语音识别技术处理你的语句的时候，它们的准确率取决于它们减少错误的能力。它们在这一点上的评价标准被称为单个词汇错误率（swer）和指令成功率（csr）。当一个句子中一个单词被弄错，那就叫做单个词汇出错。因为swers在指令识别系统中存在，它们在听写软件中最为常见。指令成功率是由对指令的精确翻译决定的。一个指令陈述可能不会被完全准确的翻译，但识别系统能够利用数学模型来推断使用者想要发出的指令。商业主要的语音技术公司随着语音技术产业的发展，更多的公司带着他们新的产品和理念进入这一领域。下面是一些语音识别技术领域领军公司名单（并非全部）nice systems（nasdaq:nice and tel aviv：nice），该公司成立于1986年，总部设在以色列，它专长于数字记录和归档技术。他们在2007年收入5.23亿美元。欲了解更多信息，请访问verint系统公司（otc：vrnt），总部设在纽约的梅尔维尔，创立于1994年把自己定位为“劳动力优化智能解决方案，ip视频，通讯截取和公共安全设备的领先供应商。详细信息，请访问nuance公司（纳斯达克股票代码：nuan）总部设在伯灵顿，开发商业和客户服务使用语音和图像技术。欲了解更多信息，请访问vlingo，总部设在剑桥，开发与无线/移动技术对接的语音识别技术。 vlingo最近与雅虎联手合作，为雅虎的移动搜索服务一键通功能提供语音识别技术。欲了解更多信息，请访问在语音技术领域的其他主要公司包括：unisys，chacha，speechcycle，sensory，微软的tellme公司，克劳斯纳技术等等。专利侵权诉讼考虑到这两项业务和技术的高度竞争性，各公司之间有过无数次的专利侵权诉讼并不奇怪。在开发语音识别设备所涉及的每个元素都可以作为一个单独的技术申请专利。使用已经被另一家公司或个人申请专利的技术，即使这项技术是你自己独立研发的，你也可能被要求赔偿，并并可能不公正地禁止你以后使用该项技术。语音产业中的政治和商业紧紧地与语音技术的发展联系在一起，因此，必须认识到可能阻碍该行业的进一步发展的政治和法律障碍。下面是对一些专利侵权诉讼的叙述。应当指出，目前有许多这样的诉讼立案，许多诉讼案被推上法庭。语音识别未来的发展今后的发展趋势和应用医疗行业医疗行业有多年来一直在宣传电子病历(emr)。不幸的是,产业迟迟不能够满足emrs,一些公司断定原因是由于数据的输入。没有足够的人员将大量的病人信息输入成为电子格式,因此,纸质记录依然盛行。一家叫nuance(也出现在其他领域,软件开发者称为龙指令)相信他们可以找到一市场将他们的语音识别软件出售那些更喜欢声音而非手写输入病人信息的医生。军事国防工业研究语音识别软件试图将其应用复杂化而非更有效率和亲切。为了使驾驶员更快速、方便地进入需要的数据库，语音识别技术是目前正在飞机驾驶员座位下面的显示器上进行试验。军方指挥中心同样正在尝试利用语音识别技术在危急关头用快速和简易的方式进入他们掌握的大量资料库。另外，军方也为了照顾病员涉足emr。军方宣布，正在努力利用语音识别软件把数据转换成为病人的记录。摘自：/wiki/speech_recognition附：英文原文speech recognitionin computer technology, speech recognition refers to the recognition of human speech by computers for the performance of speaker-initiated computer-generated functions (e.g., transcribing speech to text; data entry; operating electronic and mechanical devices; automated processing of telephone calls) a main element of so-called natural language processing through computer speech technology. speech derives from sounds created by the human articulatory system, including the lungs, vocal cords, and tongue. through exposure to variations in speech patterns during infancy, a child learns to recognize the same words or phrases despite different modes of pronunciation by different people e.g., pronunciation differing in pitch, tone, emphasis, intonation pattern. the cognitive ability of the brain enables humans to achieve that remarkable capability. as of this writing (2008), we can reproduce that capability in computers only to a limited degree, but in many ways still useful. the challenge of speech recognition writing systems are ancient, going back as far as the sumerians of 6,000 years ago. the phonograph, which allowed the analog recording and playback of speech, dates to 1877. speech recognition had to await the development of computer, however, due to multifarious problems with the recognition of speech. first, speech is not simply spoken text-in the same way that miles davis playing so what can hardly be captured by a note-for-note rendition as sheet music. what humans understand as discrete words, phrases or sentences with clear boundaries are actually delivered as a continuous stream of sounds: iwenttothestoreyesterday, rather than i went to the store yesterday. words can also blend, with whaddayawa? representing what do you want? second, there is no one-to-one correlation between the sounds and letters. in english, there are slightly more than five vowel letters-a, e, i, o, u, and sometimes y and w. there are more than twenty different vowel sounds, though, and the exact count can vary depending on the accent of the speaker. the reverse problem also occurs, where more than one letter can represent a given sound. the letter c can have the same sound as the letter k, as in cake, or as the letter s, as in citrus. in addition, people who speak the same language do not use the same sounds, i.e. languages vary in their phonology, or patterns of sound organization. there are different accents-the word water could be pronounced watter, wadder, woader, wattah, and so on. each person has a distinctive pitch when they speak-men typically having the lowest pitch, women and children have a higher pitch (though there is wide variation and overlap within each group.) pronunciation is also colored by adjacent sounds, the speed at which the user is talking, and even by the users health. consider how pronunciation changes when a person has a cold. lastly, consider that not all sounds consist of meaningful speech. regular speech is filled with interjections that do not have meaning in themselves, but serve to break up discourse and convey subtle information about the speakers feelings or intentions: oh, like, you know, well. there are also sounds that are a part of speech that are not considered words: er, um, uh. coughing, sneezing, laughing, sobbing, and even hiccupping can be a part of what is spoken. and the environment adds its own noises; speech recognition is difficult even for humans in noisy places. history of speech recognition despite the manifold difficulties, speech recognition has been attempted for almost as long as there have been digital computers. as early as 1952, researchers at bell labs had developed an automatic digit recognizer, or audrey. audrey attained an accuracy of 97 to 99 percent if the speaker was male, and if the speaker paused 350 milliseconds between words, and if the speaker limited his vocabulary to the digits from one to nine, plus oh, and if the machine could be adjusted to the speakers speech profile. results dipped as low as 60 percent if the recognizer was not adjusted. audrey worked by recognizing phonemes, or individual sounds that were considered distinct from each other. the phonemes were correlated to reference models of phonemes that were generated by training the recognizer. over the next two decades, researchers spent large amounts of time and money trying to improve upon this concept, with little success. computer hardware improved by leaps and bounds, speech synthesis improved steadily, and noam chomskys idea of generative grammar suggested that language could be analyzed programmatically. none of this, however, seemed to improve the state of the art in speech recognition. chomsky and halles generative work in phonology also led mainstream linguistics to abandon the concept of the phoneme altogether, in favour of breaking down the sound patterns of language into smaller, more discrete features. in 1969, john r. pierce wrote a forthright letter to the journal of the acoustical society of america, where much of the research on speech recognition was published. pierce was one of the pioneers in satellite communications, and an executive vice president at bell labs, which was a leader in speech recognition research. pierce said everyone involved was wasting time and money. it would be too simple to say that work in speech recognition is carried out simply because one can get money for it. . . .the attraction is perhaps similar to the attraction of schemes for turning water into gasoline, extracting gold from the sea, curing cancer, or going to the moon. one doesnt attract thoughtlessly given dollars by means of schemes for cutting the cost of soap by 10%. to sell suckers, one uses deceit and offers glamor. pierces 1969 letter marked the end of official research at bell labs for nearly a decade. the defense research agency arpa, however, chose to persevere. in 1971 they sponsored a research initiative to develop a speech recognizer that could handle at least 1,000 words and understand connected speech, i.e., speech without clear pauses between each word. the recognizer could assume a low-background-noise environment, and it did not need to work in real time. by 1976, three contractors had developed six systems. the most successful system, developed by carnegie mellon university, was called harpy. harpy was slowa four-second sentence would have taken more than five minutes to process. it also still required speakers to train it by speaking sentences to build up a reference model. nonetheless, it did recognize a thousand-word vocabulary, and it did support connected speech. research continued on several paths, but harpy was the model for future success. it used hidden markov models and statistical modeling to extract meaning from speech. in essence, speech was broken up into overlapping small chunks of sound, and probabilistic models inferred the most likely words or parts of words in each chunk, and then the same model was applied again to the aggregate of the overlapping chunks. the procedure is computationally intensive, but it has proven to be the most successful. throughout the 1970s and 1980s research continued. by the 1980s, most researchers were using hidden markov models, which are behind all contemporary speech recognizers. in the latter part of the 1980s and in the 1990s, darpa (the renamed arpa) funded several initiatives. the first initiative was similar to the previous challenge: the requirement was still a one-thousand word vocabulary, but this time a rigorous performance standard was devised. this initiative produced systems that lowered the word error rate from ten percent to a few percent. additional initiatives have focused on improving algorithms and improving computational efficiency. in 2001, microsoft released a speech recognition system that worked with office xp. it neatly encapsulated how far the technology had come in fifty years, and what the limitations still were. the system had to be trained to a specific users voice, using the works of great authors that were provided, such as edgar allen poes fall of the house of usher, and bill gates the way forward. even after training, the system was fragile enough that a warning was provided, if you change the room in which you use microsoft speech recognition and your accuracy drops, run the microphone wizard again. on the plus side, the system did work in real time, and it did recognize connected speech. speech recognition today technology current voice recognition technologies work on the ability to mathematically analyze the sound waves formed by our voices through resonance and spectrum analysis. computer systems first record the sound waves spoken into a microphone through a digital to analog converter. the analog or continuous sound wave that we produce when we say a word is sliced up into small time fragments. these fragments are then measured based on their amplitude levels, the level of compression of air released from a persons mouth. to measure the amplitudes and convert a sound wave to digital format the industry has commonly used the nyquist-shannon theorem. nyquist-shannon theoremthe nyquist shannon theorem was developed in 1928 to show that a given analog frequency is most accurately recreated by a digital frequency that is twice the original analog frequency. nyquist proved this was true because an audible frequency must be sampled once for compression and once for rarefaction. for example, a 20 khz audio signal can be accurately represented as a digital sample at 44.1 khz.how it workscommonly speech recognition programs use statistical models to account for variations in dialect, accent, background noise, and pronunciation. these models have progressed to such an extent that in a quiet environment accuracy of over 90% can be achieved. while every company has their own proprietary technology for the way a spoken input is processed there exists 4 common themes about how speech is recognized. 1. template-based: this model uses a database of speech patterns built into the program. after receiving voice input into the system recognition occurs by matching the input to the database. to do this the program uses dynamic programming algorithms. the downfall of this type of speech recognition is the inability for the recognition model to be flexible enough to understand voice patterns unlike those in the database. 2. knowledge-based: knowledge-based speech recognition analyzes the spectrograms of the speech to gather data and create rules that return values equaling what commands or words the user said. knowledge-based recognition does not make use of linguistic or phonetic knowledge about speech. 3. stochastic: stochastic speech recognition is the most common today. stochastic methods of voice analysis make use of probability models to model the uncertainty of the spoken input. the most popular probability model is use of hmm (hidden markov model) is shown below. yt is the observed acoustic data, p(w) is the a-priori probability of a particular word string, p(yt|w) is the probability of the observed acoustic data given the acoustic models, and w is the hypothesised word string. when analyzing the spoken input the hmm has proven to be successful because the algorithm takes into account a language model, an acoustic model of how humans speak, and a lexicon of known words. 4. connectionist: with connectionist speech recognition knowledge about a spoken input is gained by analyzing the input and storing it in a variety of ways from simple multi-layer perceptrons to time delay neural nets to recurrent neural nets. as stated above, programs that utilize stochastic models to analyze spoken language are most common today and have proven to be the most successful. recognizing commandsthe most important goal of current speech recognition software is to recognize commands. this increases the functionality of speech software. software such as microsost sync is built into many new vehicles, supposedly allowing users to access all of the cars electronic accessories, hands-free. this software is adaptive. it asks the user a series of questions and utilizes t

人人文库> 全部分类> 专业文献 > 工程机械

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

语音识别文献翻译.doc

文档简介

温馨提示

最新文档

评论

相关文档