已阅读5页,还剩68页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1,Module Six: Outlier Detection for Two Sample Case,Two sample plot, also known as Youdens Plot, is a scatter plot with a confidence region. Youden used it for detecting labs with unusual testing results when two samples are tested in n different lab. Youden plot is a special case of the bivariate control chart, and the idea behind is the Principal Component Analysis. In this module, we will discuss Principal Component Analysis and how it is applied to construct bivariate control charts and discuss the interpretations of the plot.,2,There are three types of laboratory testing, where two-sample plots can be applied: A group of labs participate in testing two similar materials using the same method. It is important to identify labs, if any, which perform extremely different from the rest in either one or both materials. A two-sample plot is used to detect the extreme labs.This is the case studied by Youden (1954), and later extended by Mandel and Lashof (1974).,3,2. A participated lab tests a variety of material using the same method as the standard lab, and the testing results are compared to the standardized lab to study if any particular test is extremely different from the standard lab. A paired bivariate control chart is used to determine how good the participated lab is when compared with the standard lab. Tracy, Young and Mason (1995) used a similar approach, bivariate control chart for studying paired measurements in quality control.3. A lab test similar two or more materials using a standard procedure on a regular bases for many days. A bivariate or multivariate control chart is used for process control. In recent decades, multivariate control charts have been developed for process control of two or more quality characteristics simultaneously. Similar situation may occur in laboratory testing process control.,4,The classical Youdens Plot for Two-sample casesThe following inter-laboratory study about the percent of insoluble residue in cement reported by 29 Laboratories,Row %residueA %residueB 1 0.31 0.22 2 0.08 0.12 3 0.24 0.14 4 0.14 0.07 5* 0.52 0.37 6 0.38 0.19 7 0.22 0.14 8* 0.46 0.23 9 0.26 0.05 10 0.28 0.14 11 0.10 0.18 12 0.20 0.09 13 0.26 0.10 14 0.28 0.14 15 0.25 0.13,Row %residueA %residueB 16 0.25 0.11 17 0.26 0.17 18 0.26 0.18 19 0.12 0.05 20 0.29 0.14 21 0.22 0.11 22 0.13 0.10 23* 0.56 0.42 24 0.30 0.30 25 0.24 0.06 26* 0.25 0.35 27 0.24 0.09 28 0.28 0.23 29 0.14 0.10,5,Using One sample Box-Plot, we have,Quick check suggests Lab #5, 8, and 23 are likely outliers. They are excluded in the two-sample plot.,6,(from NIST website),The vertical line passes through the median of the X-variable.The Horizontal line passed through the median of the Y-variable.The center of the graph is the intersection of the two median lines. This intersection is called the Manhattan Median.,7,What is the 45o line for? What is the practical meaning of this line? Why and How Youden Plot works?Under the absolute perfect condition, the pair of data should be all equal when testing the same material twice using the same testing procedure. However, it is never the case. In fact, there are two major components of uncertainties: The systematic error and random error. From common experience, systematic error of a given lab should be the same or very close, in theory, when performing the test under the same condition using the same procedure. And that each lab is different from others. This suggests that, if there is no random or unexpected error, under the perfect condition, but allowing lab differences, then, the pair of data should be equal (that x=y) for a given lab, but, may be different for different labs. Translating this situation, the pairs of data are located at the 45o line passing through the Manhattan Median. Therefore, the distance from (x,y) pairs on the 45o line is the component of the systematic error for the given lab. How can we identify the random error component from the plot?In reality, a pair of data points are scattered on the graph. Only very few pairs will be located on the 45o line, and even so, x and y are really the same. Consider a pair of data point which is away from center and are not on the 45o line. The distance from the point (x,y) to the center is the total error:Total error = Systematic error + Random error,8,Random Error Component(RaE): the distance from (-2,-7) to (-4.5, -4.5),Systematic Error (SyE) component: the distance from (-4.5, 4.5) to (0,0),The total error =,9,How to determine (x3 , y3)? For simplicity, assume center is (0,0). Then we see that x3 = (x2 + y2)/2. That is the coordinate for the systematic component is at (x2 + y2)/2 , (x2 + y2)/2 ). Hence, systematic error component is the distance to from this point to (0,0):If the center is at (x1 , y1), then the Systematic Error component is,The distance is allowed to be negative here for reflecting the quadrant. The magnitude is positive.,10,11,12,NOTE: p in the formula is the Random Component, in our notation, we use: RaE.If there is no systematic error, the uncertainty should involve only random error component. And therefore, the standard deviation of the random error components is an estimate of the random error variation. 95% coverage region is given by the radius = 2.45(s). The value 2.45 is based on the assumption the (X,Y) follows Bivariate Normal and are independent.,13,Youdens Calculation of the standard deviation and the radiusYouden plot is applied to identify labs with unusual high systematic error as well as labs have unusually high random errors. When a lab has both large total error, the data point will be far away from the center. They are the first group of labs that should be closely investigated. It may happen that a a material is less sensitive to different environment than the other. When this happens, the data points will tend to be parallel to X- axis or Y-axis, with a large variation due to a material. This can also be quickly identified. When a lab have an unusually high systematic error, but small random component, it will scattered along the 45o line. And that more data points are in upper right quadrant and lower left quadrant. When a lab has a large random component, it will be far from the 45o line, that is, will have higher probability to be in the upper-left and lower-right quadrants.,14,In order to identify these unusual labs, labs with large systematic error, Youden suggested to draw a circle centered at the Manhattan median with the radius being a multiple of the variation due to random error. The variation due to the random errors is the standard deviation of the random errors of labs, which is obtained by:According to Youden (1959), a 95% coverage probability of a circle is given by the circle with radius = 2.448(s)A 99% coverage probability of a circle is given the cirvle with radius = 3.035(s)The relationship between the coverage probability and the multiple b is given in Youdens original paper in the Journal of Industrial Quality Control, 1959, p. 24-28:Coverage probability = 1-exp(-b2/2),15,An alternative approach to compute the systematic error components, random error components and the corresponding variations.,For a given (x,y) data point, its corresponding coordinate of the systematic component is ( (x+y)/2, (x+y)/2), and the difference between (x,y) and (x+y)/2, (x+y)/2) along the X-axis is X (x+Y)/2 = (x-y)/2The difference along the Y-axis = (y-x)/2This suggests that the random error for the data point (x,y) is (x-y)/2.For each lab,compute the systematic component (x+y) (x0+y0)/2, where (x0 , y0) is the median origin. Compute the random error component:(x-y)/2,Compute variance and s.d. for each component. sB measures the variation of between-lab systematic errors.se measures the variation due to random errors. 4. To construct the circle with 95% coverage probability, the radius = 2.45(se),16,17,18,Activity: To construct a classical Youdens Plot for Two-sample casesThe following inter-laboratory study about the percent of insoluble residue in cement reported by 29 Laboratories,Row %residueA %residueB 1 0.31 0.22 2 0.08 0.12 3 0.24 0.14 4 0.14 0.07 5* 0.52 0.37 6 0.38 0.19 7 0.22 0.14 8* 0.46 0.23 9 0.26 0.05 10 0.28 0.14 11 0.10 0.18 12 0.20 0.09 13 0.26 0.10 14 0.28 0.14 15 0.25 0.13,Row %residueA %residueB 16 0.25 0.11 17 0.26 0.17 18 0.26 0.18 19 0.12 0.05 20 0.29 0.14 21 0.22 0.11 22 0.13 0.10 23* 0.56 0.42 24 0.30 0.30 25 0.24 0.06 26* 0.25 0.35 27 0.24 0.09 28 0.28 0.23 29 0.14 0.10,19,Variable N N* Mean StDev(A-B)/2 25 4 0.04760 0.03473(A+B)/2 25 4 0.1816 0.0570,The radius of the circle for the 95% coverage region is .03473 x 2.45 = .085Hence the circle has the form:(x-.25)2 + (y-.13)2 = (.85)2,Variable N Mean Median StDevMaterial A 25 0.2292 0.2500 0.0731Material B 25 0.1340 0.1300 0.0597,Hands-on activities using Mandel and Lashofs data,20,Principal Component Analysis -The concept Behind Two-Sample Plots The idea behind the two-sample plot is the principal components and bivariate normal distribution. The following scatter plot illustrate the principal components for bivariate case.,(79.81, 92.81),21,The scatter plot is from an inter-laboratory study in Mandel & Lashof (1974). The data are tensile strength of rubbers using two different materials, and testing in 16 laboratories. Laboratory Strength-E (X2) Strength-H (X1) 1 94 80 2 103 82 3 94 77 4 99 83 5 97 86 6 91 76 7 91 81 8 102 98 9 98 83 10 91 81 11 93 82 12 82 69 13 93 81 14 82 72 15 83 73 16 92 73,22,Y1 and Y2 are new coordinates. Y1 represents the direction where the data values have the largest uncertainty. Y2 is perpendicular to Y1. They intersect at the sample averages = (79.81, 92.81) .To find Y1 and Y2, we need to make transformation from X1 and X2. To simplify the discussion, we move the origin to and redefine the (X1,X2) coordinate asx1 = X1 - , x2 = X2 - , so that the origin is (0,0).The relationship is illustrated in the following graph. We would like to present the data of a given lab, p = (x1,x2) in terms of p = (y1,y2). From basic geometry relations, we see:y1 = (cosq) x1 + (sinq) x2y2 = (-sinq) x1 + (cosq) x2,y2,x1,x2,The angle q is determined so that the observations along the Y1 axis has the largest variability. But HOW?,23,The transformation from (x1,x2) to (y1,y2) results several nice propertiesThe variability along y1 is largest. Y1 and y2 are uncorrelated, that is, orthogonal. The confidence region based on (y1,y2) is easy to construct, and provide useful interpretations of the two sample plots.Questions remain unanswered are How to determine the angle q so that the variability of observations along the y1 axis is maximized?How to construct the ellipse for confidence region with different levels of confidences?How to interpret the two-sample plots?,24,How to determine the y1 and y2 axis so that the variability of observations along the y1 axis is maximized and y2 is orthogonal to y1?Rewrite the linear relation between (y1,y2) and (x1,x2) in matrix notation:y1 = (cosq) x1 + (sinq) x2y2 = (-sinq) x1 + (cosq) x2,NOTE: X is bivariate , so is Y, and V(X) = , V(Y) = AV(X)A =l1 and l2 are called the eigen values. Which are the solutions of And, V(Y1) = l1, V(Y2) = l2, Correlation between Y1 and Y2 = 0.,25,l1 and l2 are called the eigen values. Which are the solutions of And, V(Y1) = l1, V(Y2) = l2, Correlation between Y1 and Y2 = 0. The angle q = if , when s1 = s2 , q = 45oNote the angle depends on the correlation between X1 and X2 , as well as, on the variances of X1 and X2, respectively. When r is close to zero, the angle is also close to zero. If V(X1) and V(X2) are close, then, the scatter plots are scattered like a circle. That is, there is no clear major principal component.When r is close to zero and V(X1) is much larger than V(X2), then, the angle will be close to zero, and the data points are likely to be parallel to the X-axis. On the other hand, if V(X1) is much smaller than V(X2), the angle will be close to 900, and the data points will be more likely parallel to the Y-axis.,26,Consider, now, we actually observe the following two sample data:,The sample means are given by The sample variance-covariance matrix is given by,r is the Pearsons correlation coefficient, and S2 is the sample variance. S is the sample standard deviation.,V(Y) is the solution of The solutions for l are given by,NOTE: V(Y1) + V(Y2) = l1+l2 = s12 + s22 = V(X1) + V(X2),27,Using the sample data, the angle is estimated by,q =,Case ExampleWe know use the Tensile strength data to demonstration the computation of principal components and related sample information.For the Tensile Strength Example, X1 is the material H and X2 is the material E.The number of labs, n= 16.Using Minitab, we can obtain the following information:,28,Variance-Covariances Matrix: H E H 46.4292 35.0292 E 35.0292 40.9625Correlations = .8031Principal Component Analysis: Tensile Strength-H, Tensile Strength-EEigen values are: 78.831 and 8.560, the solutions of Linear Coefficients between (Y1, Y2) and (X1,X2)Variable Y1 Y2 H 0.734 -0.679 E 0.679 0.734These are the coefficients for y1 = (cosq) x1 + (sinq) x2y2 = (-sinq) x1 + (cosq) x2The angle q = = arctan(78.831-46.4292)/35.0292 = 42.770,29,The sample means from the sample data areVariableNMean H1679.81 E1692.81In terms of (Y1, Y2), the means areVariable N Mean Y1 16 121.61 Y2 16 13.937,Two sample scatter plot is,30,Confidence Region for two-sample PlotsEach of the X1 and X2 can be treated as a univariate variable. In most cases, we consider each variable follows a normal distribution. The rules we introduced for one variable case do assume that each variable follows a normal distribution. We can apply outlier detecting methods for each variable. When we consider two variables simultaneously, X1, X2 are bivariate, and the distribution for X1,X2 is taken to be bivariate normal distribution. Because of this extension, we are able to construct ellipses that works similar to empirical rule. We can construct several ellipses so that the probability of having the pair of data inside the ellipse is .95 or .99 and so on. The construction of the ellipse can be simplified when we use the principal components as described above. And the interpretations based on the principal components are very useful.,31,Bivariate Normal Distribution and its application in two-sample plots Because the ellipse region relies on bivariate normal distribution, we briefly give an introduction of the bivariate normal in the following.The bivariate normal distribution of X1 and X2 has the form:f(x1, x2) = (2ps1s2)-1(1-r2)-1/2 exp(-Q/2)Where,We usually use the notation :,is the mean vector.,is the variance-covariance matrix.,32,A ellipse Q = c, c0 centered at can then be created in the (X1, X2) coordinate. .The shape and the orientation of the ellipse is determined by the values of s12, s22 and r, and its size is determined by the choice of the constant c. The choice of the constant c can be determined based on the level of confidence using the bivariate normal distribution. When we collect two samples, the sample data provide sample means and sample variance-covariance matrix.,The sample means are given by The sample variance-covariance matrix is given by,r is the Pearsons correlation coefficient, and S2 is the sample variance. S is the sample standard deviation.,33,When replacing the population parameters by the corresponding sample information, we obtain the Hotellings T2:,T2 is distributed as Which is a multiple of an F-distribution. The ellipse region is now given by T2 = c* The constant c* is determined using the F-distribution described above. The corresponding 100(1-a)% percentile is from the F distribution with degrees of freedom (2, n-2).For example, when participated labs, n = 16, then a 95% percentile of F(2, n-2) = F(2,14) = 3.74. Therefore, c* = 2(255)/224 x (3.74) = 8.515A 95% ellipse region can then be constructed using T2 = 8.515,34,How to construct a 100(1-a)% region in a two-sample plot the Youdens Plot?Under the (X1,X2) coordinate, a general form of an ellipse is given by :Under the principal component coordinate and m
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- GB-T 36030-2018制药机械(设备)在位清洗、灭菌通 用技术要求专题研究报告
- 调味品品评师岗前班组建设考核试卷含答案
- 2025年大学二年级脑机接口工程专业《脑科学基础》期末考试测验卷及答案
- 宠物健康护理员安全防护模拟考核试卷含答案
- 家用音频产品维修工安全培训测试考核试卷含答案
- 《GB-T 40894-2021化妆品中禁用物质甲巯咪唑的测定 高效液相色谱法》专题研究报告
- 湖盐采掘工职业健康、安全、环保技术规程
- 公司射孔取心工岗位应急处置技术规程
- 石英玻璃制品加工工班组安全模拟考核试卷含答案
- 《GBT 3810.16-2016 陶瓷砖试验方法 第 16 部分:小色差的测定》专题研究报告
- 2025年《企业文化》知识考试题库及答案解析
- 瑞幸店长线上考试题库及答案
- IMPA船舶物料指南(电子版)
- 【MOOC】以案说法-中南财经政法大学 中国大学慕课MOOC答案
- GA 1026-2012机动车驾驶人考试内容和方法
- MBA市场营销课程考试范围(32题及答案)
- 宝钢作业长制详解课件
- 土壤污染及防治课件
- CAR-T细胞治疗参考课件
- 五星级酒店投资预算
- 2.2.1不等式及其性质第1课时教学设计
评论
0/150
提交评论