




已阅读5页,还剩3页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
A Novel Approach for Outlier Detection and Robust Sensory Data Model Learning Francesco Cursi, Student Member, IEEE, Guang-Zhong Yang, Fellow, IEEE AbstractIn the past few decades machine learning and data analysis have been having a huge growth and they have been applied in many different problems in the fi eld of robotics. Data are usually the result of sensor measurements and, as such, they might be subjected to noise and outliers. The presence of outliers has a huge impact on modelling the acquired data, resulting in inappropriate models. In this work a novel approach for outlier detection and rejection for input/output mapping in regression problems is presented. The robustness of the method is shown both through simulated data for linear and nonlinear regression, and real sensory data. Despite being validated by using artifi cial neural networks, the method can be generalized to any other regression method. I. INTRODUCTION Machine learning and data analysis have been having a huge growth over the past few decades. This is due to the falling cost of large data storage devices and the increasing ease of collecting data over networks, the development of robust and effi cient machine learning algorithms to process this data, and the falling cost of computational power 1. Machine learning is defi ned as a set of methods that can automatically detect patterns in data, and then use the un- covered patterns to predict future data and perform decision making under uncertainty 2. In the fi eld of robotics, machine learning has been widely used to accurately approximate models of robots, without the need of analytical models, which may be hard to obtain due to the complexity of the system 3. There exist a large variety of problems that have been addressed by using machine learning, ranging from kinematic modelling, dynamic modelling, manipulation, navigation. A thorough review is presented in 3. Most of these problems fall into the fi eld of supervised machine learning algorithms and, specifi cally, regression 4. In order to build the desired model, data are needed and they are acquired by the many different kinds of sensors robots are equipped with (force sensors, cameras, encoders, motors, and many others). Many algorithms have been developed for solving the regression problem in robot modelling, which are reviewed in detail in 5. However, in order to build good models, data must be carefully analysed, since outliers and noise may affect the results. Most of the proposed methods do not address the problem of outlier rejection and usually assume that data points follow a known distribution (typically Gaussian). The authors are with the Hamlyn Center, Imperial College London, Exhibition Road, London, UK. Email: f.cursi17,g.z.yangimperial.ac.uk An outlier is defi ned as a data point signifi cantly different from the others 6. There are many reasons why outliers may arise in a dataset such as human error, changes in the behaviour of the system, instrument error, faults in the system 7. The presence of unwanted data points may lead to a wrong model describing the relationship between the input and the output values 8. Different approaches exist to detect outliers in datasets. A review of the different methodologies can be found in 7, 9, 10. Classifi cation based methods learn classifi ers in the dataset that can distinguish between normal and anomalous data. Different methods can be used for identifying the classes (which affects the computational effi ciency) but labels of normal classes are needed. Proximity-based methods identify outliers as those points which are isolated from the other data 10. These are among the most popular and can be divided into clustering, density- based, and nearest-neighbour methods. Clustering methods group data points into similar clusters and defi ne outliers as those points that belong to small sparse clusters 9. Density- based methods are very similar, but they divide the data space into regions, rather than the points themselves. Nearest- neighbour approaches, instead, defi ne outliers as those points whose distance form the kthnearest neighbour is larger than a certain threshold. Dimensionality reduction and principal component analy- sis 11 can also be used for outlier detection. Data is mapped to a lower subspace and outliers are identifi ed as those points having larger reconstruction error after the projection to the lower subspace. Probabilistic and statistical methods assume that the data points have a defi ned probability distribution. Outliers are those points lying in low probability regions, e.g. points whose distance from the mean value is larger than three times the data standard deviation 12. Despite being very powerful in discerning “good” data from “bad” data points, all these outlier detection method ei- ther rely on only one particular method, or on the knowledge of the data statistical distribution. Moreover, they can be re- garded to as a sort of preprocessing approaches. As a matter of fact, fi rst outliers have to be detected with one of these methods, and, only afterwards, regression methods can be applied to the good data points to fi nd the appropriate model. Robust regression approaches such as iteratively reweighed least squares 13 or random sample consensus (RANSAC) 14 for linear regression, instead, allow to neglect outliers while the regression process takes place. 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) Macau, China, November 4-8, 2019 978-1-7281-4003-2/19/$31.00 2019 IEEE4250 In this work a novel approach for robust data modelling and input/output mapping is presented. The method doesnt require any preprocessing to identify outliers, since they are automatically found while learning the model. Moreover, no assumption on the data distribution is required. The proposed method has been validated by using neural networks for regression, yet it can be generalized to any other regression method such as linear regression, Gaussian process regres- sion, etc. The work is organised as follows. Section II describes the problem being addressed in this work, which is regression analysis. In Section III the novel robust learning method is presented. Results on simulated and real datasets are shown in Section IV and, fi nally, conclusions are drawn in Section V. II. PROBLEMFORMULATION Given a dataset of input points x Rninand output points yRnout , the goal of regression is to fi nd best the relationship between the two, meaning y f(x)(1) where f() can be any linear or nonlinear function. A very popular approach for nonlinear regression is the use of arti- fi cial neural network (ANN). It has been shown that neural networks can be regarded to as universal approximators, meaning they can model any suitably smooth function, given enough hidden units, to any desired level of accuracy 15, 16. They are thus capable of representing complicated behaviours, without the need of knowing any mathematical or physical model. Feedforward networks consist of different layers of neu- rons. The fi rst layer is the input layer, the last one is the output layer, and all the others in between are called hidden layers. Each layer has several neurons, each one receiving inputs from the neurons of the previous layer and sending an output to the neurons of the following layer. Given a network with one input, one hidden, and one output layer, nininputs, M nodes in the hidden layer, and noutoutputs, then the input of the jthnode is obtained as 17, 18: aj= nin i=1 w(1) i,jxi+w (1) j,0 .(2) Then, the node applies an activation function h(), providing the output zj= h(aj). These values are passed to the output layer, where the input for each output node k = 1.noutis computed as ak= M i=1 w(2) k,jzj+w (2) k,0 .(3) Finally the output value is computed as yk=h(ak), where h() can be a different activation function from the hidden units one. The overall network function is thus computed as: yk(x,w) =h ?M i=1 w(2) k,jh ?nin i=1 w(1) i,jxi+w (1) j,0 ? +w(2) k,0 ? ,(4) Enough inliers on new dataset? Old model with f = fold Compute modelnew yes no Compute weights and refine Compute modelmixed Iteratively reweight and refine Compute fnew Iteratively reweight and refine Compute fmixed Take model with highest f Compare fold, fnew, fmixed Get new data Fig. 1: Flow chart describing the presented learning method. which shows that the neural network is a nonlinear function of the input values, controlled by the nodes weights. These weights can be retrieved by minimizing a desired cost function 17. Despite the great function approximation capabilities, it has been shown that NN behaviour is infl uenced by outliers 19, 20. Indeed, NN perform well when presented with data similar to those used for the training, but when substan- tial novel data is presented, the NN misbehaves 21. III. METHOD In this Section the proposed method is presented for the case of single-input/single-output mapping. The method con- sists of three processes which are described in the following subsections: Learning Algorithm: describes the procedure for learn- ing the input/output mapping; Inliers/Outliers Detection and Model Refi nement: de- scribes how outliers are detected and how the model is refi ned on its inliers; Selection Function Computation: describes the compu- tation of the function used for model selection. A. Learning Algorithm The presented algorithm has been developed for offl ine data modelling, yet it takes inspiration from online model 4251 learning. In online learning continuous streams of data, consisting of input and output values, are stored and a model describing the data is continuously built. The model may need adjustments when new data is acquired and, if outliers are present, they need to be neglected so as not to have a wrong model estimation. Nevertheless, the data that are considered as outliers for a model may be inliers for another model. Since input and output values continuously stream, sticking to a defi ned model may lead to wrong results that discard too many data points. Repeatedly building a new model from the whole amount of data acquired may not be the best way, since it might be computationally expensive due to the large amount of data to analyse, and also because possible outliers may be included. In 22 the authors proposed a method for online calibration of force/torque sensors. The problem can be seen as a multivariate linear regression, since the multi- dimensionalinput/multi-dimensionaloutputmappingis linear. The method presented in this work is an extension to a more general nonlinear modelling. For offl ine data modelling it is supposed that N data points have been stored. Each data point consists of an input vector x Rninand output vector y Rnout. Of this points, packages of n samples, taken randomly from the whole dataset, are iteratively analysed and sent to the algorithm - which simulates an online data acquisition procedure. At the fi rst iteration no pre-existing model exists. The fi rst model is built when the fi rst package of data arrives. After the fi rst model is built, at each iteration it is fi rst checked if the model built so far has enough inliers in the new coming data package. If the precentage of inliers for the model in the new dataset is enough (greater than a certain value), then the model computed so far is considered good and it is refi ned on all its inliers from the whole dataset acquired (old data and new data). The refi nement of the model and outliers/inliers detection is described in III-B. If, instead, the number of inliers on the new dataset is not enough, it may mean that either the package of data is “bad” or that the model is not good enough. For this reason a new model (modelnew) is built from the new dataset. This new model will have a certain percentage of inliers (Innew) on the whole dataset acquired so far. These inliers will have a certain population (Popnew) and spread(Sprnew) which defi ne a certain value of the selection function f = fnew(as defi ned in III-C). The same applies to the old model, which is associated with Inold,Popold,Sprold , defi ning f = fold. Another model (modelmix) is also built with n samples, randomly taken with a certain percentage p1from the old dataset and with a percentage p2= 1 p1from the new coming data package. The percentage is computed as fol- lows: p1= ( 0.5 ,if InoldAND Innew 0.5 , 0.5+ (foldfnew) 2 ,otherwise .(5) This allows to take more points from the dataset providing a better model. This third model will be associated with Inmix,Popmix,Sprmix , defi ning f = fmix. Once the three mod- els have been built, fold, fnew, fmixare compared and the one with the highest value of f is taken as best model. The process then continues, sending new data packages, until the whole dataset has been analysed. Figure 1 shows the fl ow chart for the presented learning method, whereas Figure 2 shows an example of a single iteration of the algorithm. 1 -4 -11 model mixed model new 1 -4 -11 final model 1 -4 -11 Fig. 2: Single iteration of the learning algorithm.The blue dotted points are the data acquired so far; the large blue dots are the data points from which each model is computed; the red dots are the estimated outputs; the yellow dots defi ne the inliers threshold; the black line is the desired mapping. B. Outliers Detection and Model Refi nement In order to build a model which is not affected by bad data, outliers must be detected and somehow neglected. In the proposed method data points are considered outliers if their residual from the model value (i.e. the difference of the output value and the estimated one) is above a certain threshold. For a certain model, given n data points their estimated output value for each output component is yi,jfor i=1.nout, j = 1.n. For each output dimension the vector of residuals is computed ri= ?y i,1 yi,1, .,yi,n yi,n?. The median miand the median absolute deviation MADiof each residual vector are calculated, and the threshold is then set to ti=MADi, with being a positive constant. The MAD is used as measure being robust to outliers 23. Once the medians and the thresholds have been retrieved, each data sample is assigned an output weight as follows: v = ri.jmi ti Wi,j= e7v 8 . (6) A data point for a certain output component is thus an outlier if its residual is too far from the median of the residuals, or, equivalently, if its weight is too small ( e7). Otherwise, a data point is an inlier. Different weighting 4252 Popout = 0.8 Sprout = 0.96 din = 0.1dout = 0.1 x -2,2y -3,3 Popin = 0.5 Sprin = 0.49 x y -3 3 0 -202 ninbins= 40 noutbins = 30 Pop = 0.65Spr = 0.725 x y -202 -303 Fig. 3: Example of population and spread computation for one-dimensional input/one-dimensional output mapping. functions can however be assigned. In order to build a robust model from a given dataset, an iterative re-weighing process is performed. At fi rst, each sample for each output component is assigned a unitary weight and a fi rst model is built. Then the inliers and their weights are computed, and the model is refi ned with the new weights. The process continues until a desired number of refi nements is reached. C. Selection Function Computation In order to choose the better model among those com- puted, a selection function f is introduced: f = wpPop+wsprSpr+winIn .(7) The selection function f returns a real value in the interval 0,1. The closer f is to 1, the better the model. The weights wp,wspr,win 0,1 in Eq(7) are such that wp+wspr+win= 1. Pop 0,1 is an index measuring the population of the inliers for the given model, Spr 0,1 defi nes the spread of the inliers and In 0,1 is the percentage of inliers for the model. These values are user defi ned and can be tuned depending on the importance given to each of the three components. Algorithm 1 shows how these values are computed. In a dataset of N samples a certain model will have ni inliers, for each output component i = 1.nout. The input and output values of each inlier data point are stored in Xinland Yinl, which are multidimensional arrays of noutrows. Each row of Yinlstores the inliers output vector yinl R1xni. To Algorithm 1 Algorithm for inliers Population, Spread, and Percentage computation. 1:functionPOP,SPR,IN GETPSI(N,Xinl,Yinl,xmin,xmax,ymin,ymax,din,dout) . Initialize inliers population, spread and percentage 2:Pop = Spr = In = 0 3:for i = 1.noutdo 4:ni size(Yinl). inliers for the output i 5:if ni= 0 then. no inliers 6:Popi= Spri= Ini= 0 7:else 8:Ini ni N 9:rngout i ymax,iymin,i 10:nout bins,i rngout i dout,i . Find output populated intervals 11:nout int,i FindPopulatedInts(Y inl i ,nout bins,i) 12:Popout i nint,i nout bins,i 13:yinl max max(Y inl i ) ,yinl min min(Y inl i ) 14:Sprout i yinl maxyinlmin rngout i . Initialize input population, spread 15:Popin i = Sprin i = Inin i = 0 16:xinl Xinl i . inliers input values Rninxni 17:for j = 1.nindo 18:rngin j xmax,jxmin,j 19:nin bins,j rngin j din,j 20:invect getRow(xinl, j) . returns jthrow . Find input populated intervals 21:nin int,j FindPopulatedInts(xinlj ,nout bins,i) 22:Pin j nint,j nin bins,j 23:xinl max max(invect) ,xinlmin min(invect) 24:Sin j xinl maxxinlmin rngin j 25:Sprin i Sprin i +Sin j 26:Popin i Popin i +Pin j 27:end for 28:Sprin i Sprin i nin 29:Popin i Popin i nin 30:Spri Sprin i +Sprout i 2 31:Popi Popin i +Popout i 2 32:end if 33:Spr Spr+Spri 34:Pop Pop+Popi 35:In In+Ini 36:end for 37:Spr Spr nout 38:Pop Pop nout 39:In In nout 40:return Pop,Spr,In 41:end function 4253 each collection of inliers output values correspond an array of input values xinl Rninxni, stored in each row of Xinl. It is also assumed that each input and output component is defi ned within
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 体育课程中队列动作教学实录
- 智能表计购销合同样本及注意点
- 淡水鱼养殖项目投资可行性分析
- 教学反思与学生成长案例分析报告
- 冲压模具设计流程与案例分享
- 企业财务总监安全生产责任制度建设
- 教师节主题活动策划方案及案例
- 利用激励理论促进青少年成长案例
- 自主招生考试英语写作专项训练
- 建筑工程合同管理要点与风险防范
- 八年级英语上学期 选词填空解题方法及专项训练(解析版)
- 《永遇乐-京口北固亭怀古》课件
- 《幼儿舞蹈基础》 课件 项目八 蒙古族舞蹈
- 血浆灌流联合其他治疗方法治疗肿瘤的研究进展
- 穴位按摩法操作评分标准
- 城乡供水一体化项目(一期)-给水工程施工图设计说明
- NISP一级考前模拟训练题库200题(含答案)
- CT检查设备十大品牌简介
- (完整版)最实用小学英语单词总表(含音标、单词默写表)
- 项目产品研发各阶段质量控制输出文件
- 述情障碍的社会根源
评论
0/150
提交评论