IROS2019国际学术会议论文集 1814_第1页
IROS2019国际学术会议论文集 1814_第2页
IROS2019国际学术会议论文集 1814_第3页
IROS2019国际学术会议论文集 1814_第4页
IROS2019国际学术会议论文集 1814_第5页
免费预览已结束,剩余1页可下载查看

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Abstract This paper proposes an environmental sound segmentation method using Mask U Net Recent research in robot audition has analyzed noise reduction section detection and sound source separation for use in a real world environment with many noises and overlaps However conventional methods apply respective functions in cascades The biggest problem of cascade systems is the accumulation of errors generated at each function block Although many methods of human voice separation have been proposed robots operating in a real world environment must be able to separate not only human voices but other environmental sounds Unlike traditional sound source separation using spatial information environmental sound segmentation must simultaneously detect sections and separate sound sources based on pre trained features One such method U Net which was proposed for semantic segmentation of images has been applied to the separation of singing voices However this method deals only with limited classes of sounds The current study proposes an environmental sound segmentation method using Mask U Net which combines segmentation using U Net with sound event detection using CNN to 75 classes of environmental sounds Experimental application confirmed that this method improved learning speed and sound source separation compared with the conventional method I INTRODUCTION Recent research in robot audition has analyzed noise reduction section detection and sound source separation for use in a real world environment containing many noises and overlaps 1 2 3 Conventional methods however apply respective functions in cascades 4 The biggest drawback of cascade systems is their accumulation of errors generated at each function block For example an error in a sound source separation system configured as shown in Fig 1 occurs when a noise reduction block propagates to the sound event detection and sound source separation blocks resulting in reduced accuracy of the entire system In a cascade system since each block is optimized independently of the performance of the entire task it may not always be the most optimal output for the following blocks Therefore it is necessary to optimize the entire task as one block Although many human voice separation methods have been proposed for robotic sound source separation such as separation of speech and singing voices 5 6 robots in a real world environment should be able to separate not only human voices but also other environmental sounds Therefore it is necessary to configure an optimized system that can deal with separation of both human voices and other environmental sounds at the same time Fig 1 1Department of Systems and Control Engineering School of Engineering Tokyo Institute of Technology 2 12 1 Ookayama Meguro ku Tokyo 1528552 Japan sudo ra sc e titech ac jp 2Honda Research Institute Japan Co Ltd 8 1 Honcho Wako Saitama 351 0188 Japan Figure 1 Proposed framework of an environmental sound recognition system The upper panel shows a conventional sound sequence and the lower panel shows the proposed framework Spatial information about sound sources have been used for the separation and extraction of sounds Many methods are available to identify dominant sound sources in each section assuming no temporal overlapping and a method of identifying separated extraction signals In contrast a U Net based high performance sound source separation method using deep learning analogous to semantic segmentation of images has been applied to the separation of singing voices 7 8 Few studies to date however have assessed sound source separation targeting multiple classes in a real world environment In addition performance of semantic segmentation of images has been reported to deteriorate if the sizes of the target classes are large or small 9 Therefore the performance of conventional methods applied to multi class environmental sounds may be reduced when sound data over time are sparse This study proposes an environmental sound segmentation method using Mask U Net This system consists of a combination of a pre trained sound event detection model with a conventional source separation method using U Net The effectiveness of this method was evaluated by comparing it with the conventional method using a created multi class environmental sound dataset II RELATED WORK Sound source separation by non negative matrix factorization NMF has been shown effective for the separation of monaural sound sources and singing voices 10 More recently a sound source separation method using a deep learning model such as a U Net method was reported to show high performance 11 Most of these studies however targeted a small number of classes such as speech and singing voice separation Application of conventional segmentation methods to multi class segmentation may result in poor performance In contrast many image segmentation methods have been proposed for large numbers of classes These include Mask R CNN which combines conventional segmentation FCN 12 with object detection Faster R CNN 13 which improves segmentation performance 14 Concretely Mask R CNN applies FCN which is a semantic segmentation method in the bounding box detected by Faster R CNN Environmental sound segmentation utilizing Mask U Net Yui Sudo1 Katsutoshi Itoyama1 Kenji Nishida1 Kazuhiro Nakadai1 2 2019 IEEE RSJ International Conference on Intelligent Robots and Systems IROS Macau China November 4 8 2019 978 1 7281 4003 2 19 31 00 2019 IEEE5340 Figure 2 Complete architecture of the Mask U Net which is consists of sound event detection segmentation and reconstruction The waveforms of mixed environmental sounds are transformed into spectrogram By using this input Mask U Net consists of sound event detection CNN and segmentation U Net predicts masks for separating out each class from the input spectrogram Inverse STFT is applied to reconstruct the time domain signal Using a method similar to the Mask R CNN approach this study evaluated an environmental sound segmentation method that combines sound event detection using CNN with a conventional method using U Net This proposed method Mask U Net was tested experimentally III PROPOSED METHOD Fig 2 shows the overall structure of the Mask U Net proposed in this paper Using short time Fourier transform STFT the waveforms of mixed environmental sounds were transformed into spectrograms These spectrograms were regarded as images and input into the deep learning model The model predicted a mask for separating out each class from the input spectrogram Inverse STFT was applied to reconstruct the time domain signal Because the spectrogram of each ambient sound obtained by the mask showed only amplitude the phase of the original mixed sound was used The segmentation U Net section shown in Fig 1 represents the conventional segmentation method with the proposed method having a sound event detection section at the front stage of the segmentation U Net section This section describes sound source separation using U Net as a conventional method and the sound event detection method using CNN This is followed by an explanation of the proposed Mask U Net A Architecture of Mask U Net Fig 2 shows the structure of the U Net 8 used in this study The U Net consisted of convoluted encoder and decoder layers The encoder layers have repeated 2D convolution structures that cut image sizes in half by strided convolutions doubling the number of channels All encoder layers have 3 x 3 sized kernels strides of 2 and padding of 1 and use batch normalization and leaky ReLU with a negative slope of 0 2 15 The decoder layers have repeated structures of 2D deconvolutions that double image size and cut the number of channels in half with kernel size of 3 x 3 strides of Figure 3 Architecture of the segmentation U Net consists of encoder and decoder layers and has skip connections 2 and padding of 1 All decoder layers used batch normalization and ReLU with the first three decoder layers having drop out with a probability of 0 5 In addition the U Net added skip connections between layers at the same hierarchical levels in the encoder and decoder This allowed low level information to flow directly from the high resolution input to the high resolution output The final layer used softmax activation The model was trained using an ADAM optimizer 16 and trained 100 epochs at a learning rate of 0 001 To reduce the costs of calculation the input audio was reduced to 16 kHz The STFT was calculated using a window size of 256 samples and a hop length of 128 samples to obtain an amplitude spectrogram The magnitude spectrograms were subsequently normalized to the range 0 1 B Loss function Equation 1 expresses the loss function used for learning If X denotes the magnitude of the spectrograms of the original mixed signal and Y denotes the magnitude of the spectrograms of the target audio that is each environmental component of the input signals then the loss function used to train the model was the L1 1 norm of the difference between the target spectrogram and the masked input spectrogram 5341 Figure 4 Architecture of sound event detection using CNN L X Y f X X Y 1 1 1 where f X is the output of the network model applied to the input X with parameters i e the mask generated by the model C Sound event detection utilizing CNN Although many methods have been proposed for sound event detection this study assessed sound event detection by application of CNN to spectrograms based on the ease of coupling to U Net 17 Details of CNN are shown in Fig 3 Except for the last layer convolution was applied to all layers using parameters such as kernel size 3 3 ReLU maximum pooling batch normalization and dropout with a probability of 0 25 To maintain temporal resolution maximum pooling was not applied to time and dimensions were reduced by pooling only for frequency a process called frequency max pooling 18 Convolution was applied to the final layer at a kernel size of 1 1 D Application of Sound Event Detection to Environmental Sound Segmentation The output of sound event detection was concatenated with the input spectrogram after some postprocessing However the output of sound event detection and the input of the spectrogram differ in dimensions The output of the sound event detection was subjected to postprocessing Fig 5 Concretely the output of the sound event detection was a vector indicating the presence or absence of each sound event at each time frame whereas the input spectrogram was a matrix consisting of the detection of both time and frequency The output vector of the sound event detection was duplicated to be of the same dimension as the input spectrogram The output of sound event detection was concatenated with the input spectrogram Normally during sound event detection a threshold is set for the output and the occurrence or non occurrence of a sound event is represented by two classes In this method however the output of the sound event detection was regarded as the probability of each environmental sound Performance improvement was expected by using the result of the sound event detection for segmentation Figure 5 Concatenation of predicted sound events to produce the input of U Net The output vector of the sound event detection was duplicated to be of the same dimension as the input spectrogram and concatenated with the input spectrogram E Reconstruction The deep learning model used in this study predicted only the spectrogram of each environmental sound that is its amplitude making it unable to reconstruct the sound source signal over time Therefore sound signals were reconstructed over time by using the phase information of mixed sound before separation of the sound sources IV EVALUATION A Dataset To train the system to separate sound sources it is necessary to prepare pairs of mixed sounds and the ground truth sound source signal corresponding to it Tab 1 shows the dry source corpuses used to create the datasets Training is accomplished by selecting the sound sources and combining them randomly as shown in Fig 6 The upper spectrograms show the dry source and is used as ground truth The lower spectrogram shows a result of combining which is used for the input to the neural network In this study each sound was 4 192 sec long each mixed sound consisted of three or four class dry sources and dry sources were randomly mixed with a maximum 0 5 sec overlap Only clean dry sources not containing sounds of other classes were used Even if the corpus differed sounds of similar classes were merged to create 75 class datasets of environmental sounds The training set consisted of 10 000 synthetic mixed sounds and ground truths and the evaluation set consisted of 1 000 signals created by using dry sources not used for training data 5342 TABLE I CORPUS USED FOR THE DATASET Database set Contents of classes ATR words Male female 2 RWCP Bell coin buzzer clock phone pinpong whistle rap castanet maracas alarm bottle claps air pump book phone spray tear 19 RWC MusicDatabase Timpani cembalo electric guitar violin 4 Japan wild bird science Bird 1 Bird research Grasshopper cricket grasshopper singing voice Insect 1 Sound database Cat baby bird footstep frog bathroom clock golf tennis trampoline dog 11 Japanese cicada Cicada 1 Freesound General Purpose Audio Tagging Challenge Kaggle Tearing shatter gunshot fireworks writing computer keyboard scissors microwave oven keys jangling drawer open or close knock phone saxophone oboe flute clarinet acoustic guitar tambourine gong glockenspiel snare drum bass drum hi hat electric piano harmonica trumpet violin double bass cello chime cough laughter applause finger snapping fart burping cowbell bark meow 43 DCASE 2016 Task 2dataset Clearthroat cough doorslam drawer keyboard keysdrop knock laughter pageturn phone speech 11 Total similar classes were merged and target domain related were excluded 16kHz 16bit 75 Figure 6 Dataset creation procedure Each dry source is synthesized at a random time frame The mixed sound is used as input data and the sound source of each class before mixing are used as ground truth Figure 7 An example of sound event detection The upper panel shows input spectrogram the center panel shows ground truth and the lower panel shows prediction B Pre trained result of sound event detection The learning results of the sound event detection unit with inputs as prior information were determined The F score F was calculated using 2 3 and 4 with TP FP and FN representing true positive false positive and false negative respectively and P and R representing precision and recall respectively P TP TP FP 2 R TP TP FN 3 F 2PR P R 4 Calculations showed that the F score of the pre trained sound event detection model was 0 72 An example of the sound event detection result is shown in Fig 7 The upper panel shows an input spectrogram the middle panel shows ground truth and the lower panel shows the result of sound event detection As described in section 3 the results of sound event detection show the probability of each event A comparison with labeled data showed that sound events could be detected using the dataset described in this study The probability of each event which is the result of event detection is input to the segmentation U Net as prior information C Results of the sound source separation Subsequently segmentation results of the Mask U Net which uses sound event detection using CNN and U Net were compared The overall root mean squared error RMSE and the per class RMSE were calculated from 5 with the results shown in Tab 2 and Fig 9 The horizontal axis in Fig 9 represents the number of TF bins having components in the spectrogram i e the area occupied in the spectrogram image The red and blue lines show the moving average of the nearest five points of the area occupied by each point RMSE was also calculated in the range from which only the overlapping portion was extracted RMSE 5 5343 Figure 8 Examples of segmentation results The left column shows ground truth the center column shows the results of U Net and the right column shows the results of Mask U Net The RMSE of the proposed method was smaller than that of the conventional method Tab 2 indicating that the Mask U Net can segment each sound source with high accuracy Similarly the RMSE of the proposed method was smaller for all classes overall Because the reduction effect of RMSE was large in a large part of the occupied area the size of each class was trained efficiently from the results of sound event detection provided in advance Fig 8 shows a sample spectrogram image separated by class with the colors of segmentation results corresponding to classes Using the conventional method some data are classified locally into the wrong classes and some single class environmental sounds are assigned to multiple classes Also overlaps may be biased towards the larger class whereas the proposed method can separate each class correctly The classification and generation of each highly precise environmental sound can be detected by the pre trained sound event detection model with advance information converted to the sound source separation model using U Net Because it is input the boundaries between classes are likely to be clearly separated Fig 10 shows the transition of the loss function after training 100 epochs Compared with the conventional method the proposed method converges more quickly The faster learning speed due to prior information also improves performance However in assessing completely overlapping TF bins neither method was able to correctly separate sound sources suggesting that the proposed method does not better separate completely overlapping sounds TABLE II COMPARISON OF RMSE RMSE dB U Net Mask U Net Total RMSE 24 85 22 34 RMSE overlapped 29 82 27 40 Figure 9 Relationship between RMSE and area occupied in the spectrogram image The red and blue lines show the moving average of the nearest five points of the area occupied by each poin

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

最新文档

评论

0/150

提交评论