




已阅读5页,还剩17页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
Speech Acoustics ProjectOCE 471 Underwater AcousticsJesse HansenSpeech Acoustics ProjectJesse HansenAbstract:In this paper, basic methods for analyzing recorded speech are presented. The spectrogram is introduced and subsequently utilized in a Matlab environment to reveal patterns in recorded voice data. Several examples of speech are recorded, analyzed, and compared. A model for voice production is introduced in order to explain the variety of time-frequency patterns in the waveforms. Specifically, a single tube and then a multi-tube model for the vocal tract are considered and related to resonances in the speech spectrum. It is shown that a series of connected acoustic tubes results in resonances similar to those that occur in speech.IntroductionMotivation:Consider the problem of speech recognition. When two different people speak the same phrase (or if one person utters the same phrase twice), a human listener will generally have no trouble understanding each instance of that phrase. This leads us to believe that even though the two speakers may have different vocal qualities (different pitch, different accents, etc.) there must be some sort of invariant quality between the two instances of the spoken phrase.Thinking about the problem a bit further, we realize that when two different people articulate the same phrase, they perform essentially the same mechanical motions. In other words, they move their mouths, tongue, lips, etc., in the roughly the same way. We hypothesize that as a result of the similarities in speech mechanics from person to person there should be some features in the recorded speech waveform that are similar for multiple instances of a spoken phrase.One such set of speech features is called formants, which are resonances in the vocal tract. The frequencies at which these resonances occur are a direct result of the particular configuration of the vocal tract. As words are spoken, the speaker moves his or her tongue, mouth, and lips, changing the resonant frequencies with time. Analysis of these time-varying frequency patterns forms the basis for all modern speech recognition systems.Organization:This paper is broadly divided into two sections. Part 1 is concerned with analysis of voice waveforms. In Part 2, we will delve into models for voice production and relate them to the data presented in Part 1.Part 1 is organized as follows. In Section 1.1 we briefly describe the spectrogram, a widely used tool for time-frequency analysis of acoustic data, and illustrate its benefits with an example. A Matlab program for recording sounds and viewing their spectrograms is presented in Section 1.2. In Section 1.3 we divide speech sounds into two broad categories, voiced and unvoiced speech, restricting our analysis to voiced speech. Finally, in Section 1.4, several speech waveforms are presented and analyzed.Part 2 is organized as follows. Section 2.1 briefly describes the vocal tract, and then Section 2.2 presents a single acoustic tube model for the vocal tract. Section 2.3 presents a multi-tube model and discusses various ways that the model can be analyzed. Closing remarks are made in Section 2.4.PART I: Data Analysis1.1 The SpectrogramThe spectrogram of a waveform shows that signal as a function of frequency and time.The spectrogram is computed as follows:1. The original waveform is first broken into smaller blocks, each with equal size. The choice of the block size depends on the frequency of the underlying data. For speech, a width of 20 to 30 ms is often used. Blocks are allowed to overlap. An overlap of 50% is typical.2. Each block is multiplied by a window function. Most window functions have a value of 1 in the middle and taper down towards 0 at the edges. Windowing a block of data has the effect of diminishing the magnitude of the samples at the edges, while maintaining the magnitude of the samples in the middle.3. The Discrete Fourier Transform (DFT) of each windowed block is computed. Only the magnitude of the DFT is retained. The result is several vectors of frequency data (the magnitude of the DFT), one vector for each block of the original waveform. The frequency information is localized in time depending on the location of the time block that it was computed from.A simple example will help to illustrate the point. Below we have the waveform and spectrogram of a bird chirping. This sound was borrowed from a Matlab demonstration.The waveform spectrogram of a chirping birdThe upper plot shows the time domain waveform of a bird chirping. Below this is the spectrogram, which shows frequency content as a function of time. Frequency is on the vertical axis and time is on the horizontal. Blue indicates larger magnitude while red indicates smaller magnitude.The beauty of the spectrogram is that it clearly illustrates how the frequency of a signal varies with time. In this example we can see that each chirp starts at a high frequency, usually between 3 and 4 kHz, and over the course of about 0.1 seconds, decreases in frequency to about 2 kHz. This type of detail would be lost if we chose to take the DFT of the entire waveform.Technical details: The sampling rate is 8 kHz. The block size is 25 ms or 200 samples There is 87.5% overlap between blocks The blocks are multiplied by a Hamming window There are 211 = 2048 points in the DFT of each block1.2 Matlab Recording & Analysis ProgramHere we will present a Matlab program, Record, written by the author in the summer of 2001. The program is intended to simplify the recording and basic editing of speech waveforms as well as to present the spectrogram and the time waveform in a side-by-side format for ease of analysis.The remainder of this section will describe the programits inner working and functionality.Running the program: The program can be run by typing record at the Matlab prompt or by opening the program in the Matlab editor and selecting Run from the Debug menuRecording:Sound recording is initiated through the Matlab graphical user interface (GUI) by clicking on the record button. The duration of the recording can be adjusted to be anywhere from 1 to 6 seconds. (These are the GUI defaults, but the code can be modified to record for longer durations if desired.)Upon being clicked, the record button executes a function that reads in mono data from the microphone jack on the sound card and stores it a Matlab vector.Most of the important information in a typical voice waveform is found below a frequency of about 4 kHz. Accordingly, we should sample at a least twice this frequency, or 8 kHz. (Note that all sound cards have a built in pre-filter to limit the effects of aliasing.) Since there is at least some valuable information above 4 kHz, the Record GUI has a default sampling rate of 11.025 kHzthis can be modified in code. A sampling rate of 16 kHz had been used in the past, but the data acquisition toolbox in Matlab 6.0 does not support this rate.Once recorded, the time data is normalized to a maximum amplitude of 0.99 and displayed on the upper plot in the GUI window. In addition to the time domain waveform, a spectrogram is computed using Matlabs built in specgram function (part of the signal processing toolbox).An example recording of the sentence, “We were away a year ago” is shown below.“We were away a year ago”Zooming in on the Waveform:One can examine a region of interest in the waveform using the Zoom in button. When Zoom in is clicked, the cursor will change to a cross hair. Clicking the left mouse button and dragging a rectangle around the region of interest in the time domain waveform will select a sub-section of data. In the example below we have zoomed in on the region from about 1 to 1.2 seconds.Zoomed in on the waveformZooming out:The Zoom out button will change the axis back to what it was before Zoom in was used. If you zoom in multiple times, zooming out will return you to the previous axis limits.Listening to the Waveform:The Play button uses Matlabs sound function to play back (send to the speakers) the waveform that appears in the GUI. If you have zoomed in on a particular section of the waveform, only that portion of the waveform will be sent to the speakers.Save is used to write the waveform to a wave file. If zoomed in on segment of data, only that portion of the waveform will be saved.Click Load to import any mono wave file into the Record GUI for analysis.1.3 Voiced and Unvoiced SpeechThere are two broad categories into which speech is segmented, voiced and unvoiced speech. We will differentiate the two classes by their method of production and the by time and frequency patterns that we observe in the recorded data. This project is primarily concerned with voiced speech for reasons well explain in a moment.Voiced Speech:All voiced speech originates as vibrations of the vocal cords. Its primary characteristic is its periodic nature.Voiced speech is created by pushing air from the lungs up the trachea to the vocal folds (cords), where pressure builds until the folds part, releasing a puff of air. The folds then return to their original position as pressure on each side is equalized. Muscles controlling the tension and elasticity of the folds determine the rate at which they vibrate. See 3.The puffs of air from the vocal cords are subsequently passed through the vocal tract and then through the air to our ears. The periodicity of the vocal cord vibrations is directly related to the perceived pitch of the sound. We will examine the effects of the vocal tract in more detail later on.Vowels sounds are one example voiced speech. Consider the /aa/ sound in father, or the /o/ sound in boat. In the segment of voiced speech below, note the periodicity of the waveform.Segment of voiced speech, /aa/ in fatherUnvoiced speech:Unvoiced speech does not have the periodicity associated with voiced speech. In many kinds of unvoiced speech, noise-like sound is produced at the front of the mouth using the tongue, lips, and/or teeth. The vocal folds are held open for these sounds.Consider the sounds /f/ as in fish, /s/ as in sound. The /f/ sound is created by forcing air between the lower lip and teeth, while /s/ is created by forcing air through a constriction between the tongue and the roof of the mouth or the teeth.The waveform below shows a small segment of unvoiced speech. Note its distinguishing characteristics. It is low amplitude, noise-like, and it changes more rapidly than voiced speech.Segment of unvoiced speech, /sh/ in sheLets further examine the waveform and spectrogram of a word containing both voiced and unvoiced speech. A recording of the word sky was made with the Matlab program. The /s/ sound, we know now, is unvoiced, while the /eye/ sound is voiced. (The /k/ is also unvoiced, but not noise-like) The plots are shown below.Looking at the spectrogram we note that /s/ contains a broad range of frequencies, but is concentrated at higher frequencies. The resonances, or formants, in the speech waveform can be seen as blue, horizontal stripes in the spectrogram. These formants, mentioned in the introduction, arent particularly clear in the unvoiced /s/, but are quite obvious in the voiced /eye/. It is for this reason that we restrict our analysis to voiced speech.1.4Data AnalysisSeveral speech waveforms will be analyzed here, and various features in the waveforms and spectrograms will be noted. The most prominent features in the spectrograms are the dark (blue) horizontal bands, called formants, corresponding to frequencies of greater energy. It will be shown in Part II that these formants result from resonances in the vocal tract.In addition to the waveforms and spectrograms, we will analyze the spectrum of small segments of the waveforms. These 2-D spectrums will help us compare formant frequencies from sound to sound.Waveform 1: “Already”The word already contains several different sounds, all of them voiced. Notice how the formants change with time. One significant feature that is also easy to identify in the spectrogram is the /d/ sound. This sound is called a stop, and for obvious reasons. When the /d/ is pronounced, the tongue temporarily stops air from leaving the oral cavity. This action leads to a small amplitude in the waveform and a hole in the spectrogram. Can you spot it?Waveform and spectrogram of the word alreadyWell now take a look at a small sub-section of the waveform. Weve isolated a about 0.07 seconds of sound towards the beginning if the word. By itself, this portion of the waveform sounds a bit like the /au/ in the word caught. The plot below shows the waveform and spectrum of the sub-signal./au/ in already from about 0.09 to 0.16 secThe primary item of interest is the bottom-most plot, which shows the spectrum of the signal in decibels (dB). The approximate locations of the formants are indicated by the vertical dotted lines. Note how this plot matches up with the spectrogram. Both indicate a wide first formant and 4 more formants between 2 and 4 kHz (although the 4th isnt as well defined as the other 3).Moving on, we isolate a different portion of the same waveform, the /ee/ sound at the end of the word already. By looking at either the waveform or the spectrum, we can see that /ee/ contains more high frequency information than /au/. /ee/ has a narrow first formant and 3 additional formants at higher frequencies./ee/ in already from about 0.418 to 0.457 secWaveform 2: “Cool”The word cool contains three distinct sounds. The /c/ at the beginning is an unvoiced sound. The /oo/ and the /L/ are both voiced sounds, but each has different properties. For the sake of time well stick to the /L/ sound at the end of the word. An /L/ is created by arching the tongue in the mouth. This leads to a sound that differs greatly from most of the vowel sounds.Waveform and spectrogram of the word coolBelow, youll see a segment of the /L/ sound. It is different from the last two sounds weve analyzed in that it contains virtually no high frequency components. There are two high frequency formants at about 3000 and 3500 Hz, but they are a great deal smaller in magnitude than the first formant./L/ in cool from about 0.3 to 0.33 secWaveform 3: “Cot”We chose the word cot to get a look at the /ah/ sound in the middle. Youll notice from the spectrogram that this sound is quite frequency rich, containing several large formants between 0 and 5 kHz.Waveform and spectrogram of the word cotCompare the /ah/ waveform (below) to the previous /L/ waveform. There is a great deal more activity in this waveform, which explains variety of frequencies in the spectrogram. The spectrum of the signal reveals 5 (possibly 6) significant formants, each one having a sizable bandwidth./ah/ in cot from about 0.17 to 0.22 secConclusion from Data Analysis:It is clear from the data than different vocal sounds have widely varying spectral content. However, each sound contains similar features, like formants, that arise from the methods of speech production. Well now begin to talk a bit about the vocal tract and the attempts that have been made to model its functionality. The predictions of these models will then be related to the empirical results from this sectionPART II: Speech Production Models2.1 The Vocal TractHere, we examine a bit of the physiology behind the production of speech. Voiced speech is created by simultaneously exciting our vocal cords into vibration and configuring our vocal tract to be a particular shape. The vocal tract consists of everything between the vocal cords and the lips: the pharyngeal cavity, tongue, oral cavity, etc. Pressure waves travel from the vocal cords along the non-uniform path to the mouth, and eventually, to our ears.Figure from 5 which is in turn from 1The figure above shows a schematic diagram of the vocal tract on the left and, on the right, a plot the area of the vocal tract as a function of distance (in centimeters) from the vocal cords. The area/distance function plotted here is for the sound /i/, as in bit. The configuration of the vocal tract, and hence the plot, is different for different sounds.Looking at the area vs. distance function in the plot above, youll notice that there are two major resonant chambers in the vocal tract, the first, the pharyngeal cavity, is from about 1 to 8 cm and the second, the oral cavity, is from about 14 to 16 cm. This manner of thinking about the vocal tractidentifying resonant chambersleads us to model it as on acoustic tube, or concatenation of acoustic tubes, an idea well examine further in subsequent sections.Aside: Some sounds, called nasals, also use the nasal cavity as an additional resonant chamber and path to outside world. However, these sounds are small subset of the set of voiced sounds, and we will ignore them in this presentation.More detail about the vocal cords can be found in 3, and an in depth analysis of the vocal tract can be found in 4.Here come the modelsWell now examine some models of speech production. All of these models make a couple of simplifying assumptions:1. The vocal cords are independent from the vocal tract2. The vocal tract is a linear system.Neither of these
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025版事业单位人员海外实习与职业规划服务合同
- 2025年度围墙夜景照明设计与施工合同
- 2025成都二手房买卖合同含租赁权处理及转租条款
- 2025年度车辆租赁及车辆租赁租赁车辆租赁服务合同
- 贵州省凯里市2025年上半年公开招聘辅警试题含答案分析
- 贵州省余庆县2025年上半年事业单位公开遴选试题含答案分析
- 2025蛋糕店员工保密与竞业禁止劳动合同书
- 2025年文化产业园区场地租赁合同模板(含知识产权)
- 贵州省都匀市2025年上半年公开招聘村务工作者试题含答案分析
- 2025年度新能源汽车租赁共享经济合同范本
- 应急响应第一人考试试题及答案
- 门窗店入股合同协议书
- T/CIE 171-2023企业级固态硬盘测试规范第7部分:功耗能效测试
- 2025年采购管理专业考试题及答案
- 实验室安全操作规程
- 2025-2030中国DCS控制系统行业市场现状分析及竞争格局与投资发展研究报告
- 叉车基本技能培训课件
- 2024初级注册安全工程师笔试真题解析
- 高三数学教学经验交流发言稿
- 沪科版八年级物理上册教学计划(含进度表)
- 矿山三级安全教育培训文档
评论
0/150
提交评论