相关分析(Correlate).doc_第1页
相关分析(Correlate).doc_第2页
相关分析(Correlate).doc_第3页
相关分析(Correlate).doc_第4页
相关分析(Correlate).doc_第5页
已阅读5页,还剩9页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

相关分析(Correlate)Correlation and dependence In statistics, correlation and dependence are any of a broad class of statistical relationships between two or more random variables or observed data values. Correlation is computed(用.计算) into what is known as the correlation coefficient(相关系数), which ranges between -1 and +1. Perfect positive correlation (a correlation co-efficient of +1) implies(意味着) that as one security(证券) moves, either up or down, the other security will move in lockstep(步伐一致的), in the same direction. Alternatively(同样的), perfect negative correlation means that if one security moves in either direction the security that is perfectly negatively correlated will move by an equal amount in the opposite(相反的) direction. If the correlation is 0, the movements of the securities are said to have no correlation; they are completely random(随意、胡乱).There are several correlation coefficients, often denoted(表示、指示) or r, measuring(衡量、测量) the degree of correlation. The most common of these is the Pearson correlation coefficient, which is sensitive only to a linear(只进行两变量线性分析) relationship between two variables (which may exist even if one is a nonlinear function of the other). Other correlation coefficients have been developed to be more robust(有效的、稳健) than the Pearson correlation, or more sensitive to nonlinear relationships.Rank(等级) correlation coefficients, such as Spearmans rank correlation coefficient and Kendalls rank correlation coefficient () measure the extent(范围) to which, as one variable increases, the other variable tends to increase, without requiring(需要、命令) that increase to be represented by a linear relationship. If, as the one variable(变量) increases(增加), the other decreases, the rank correlation coefficients will be negative. It is common to regard these rank correlation coefficients as alternatives to Pearsons coefficient, used either to reduce the amount of calculation or to make the coefficient less sensitive to non-normality in distributions(分布). However, this view has little mathematical basis, as rank correlation coefficients measure a different type of relationship than the Pearson product-moment correlation coefficient, and are best seen as measures of a different type of association, rather than as alternative measure of the population correlation coefficient. Common misconceptions(错误的想法)Correlation and causality(因果关系)The conventional(大会) dictum(声明) that correlation does not imply causation means that correlation cannot be used to infer a causal relationship between the variables. Correlation and linearityFour sets of data with the same correlation of 0.816The Pearson correlation coefficient indicates the strength of a linear relationship between two variables, but its value generally does not completely characterize their relationship. In particular, if the conditional mean of Y given X, denoted E(Y|X), is not linear in X, the correlation coefficient will not fully determine the form of E(Y|X).The image on the right shows scatterplots(散点图) of Anscombes quartet, a set of four different pairs of variables created by Francis Anscombe. The four y variables have the same mean (7.5), standard deviation (4.12), correlation (0.816) and regression line (y=3+0.5x). However, as can be seen on the plots, the distribution of the variables is very different. The first one (top left) seems to be distributed normally, and corresponds to what one would expect when considering two variables correlated and following the assumption of normality. The second one (top right) is not distributed normally; while an obvious relationship between the two variables can be observed, it is not linear. In this case the Pearson correlation coefficient does not indicate that there is an exact functional relationship: only the extent to which that relationship can be approximated(大概) by a linear relationship. In the third case (bottom left), the linear relationship is perfect, except for one outlier which exerts enough influence to lower the correlation coefficient from 1 to 0.816. Finally, the fourth example (bottom right) shows another example when one outlier(异常值) is enough to produce a high correlation coefficient, even though the relationship between the two variables is not linear.(离群值可降低、也可以增加数据的相关性。但是,这种降低或增加相关性行为都扭曲了数据的实际相关性,是一种我们需要极力避免的局面)These examples indicate that the correlation coefficient, as a summary statistic, cannot replace the individual examination of the data. 研究随机变量之间的“相关关系”的一种统计方法。相关关系是一种非确定的关系,例如,以X和Y分别记一个人的身高和体重,或分别记每亩田的施肥量和产量,则X与Y显然有关系,而又没有确切到可由其中的一个去精确地决定另一个的程度,这就是相关关系。当两变量X和Y有相关关系时,虽然知道不足以通过X之值x决定Y之值,但可以决定Y的条件分布。反之,也可以由Y之值决定X的条件分布XYy。这种依赖关系正是相关关系的实质所在。相关分析与回归分析在实际应用中有密切的关系。然而,回归分析关心的是一个随机变量Y对另一个(或另一组)随机变量X的依赖关系的函数式。用预测的语言来说,X是预测因子,Y是预测对象,故X和Y的地位是不平等的。在相关分析中,所讨论的变量的地位是平等的,分析侧重于随机变量的种种相关特征。例如,以X和Y分别记小学生的数学和语文成绩,感兴趣的是两者的关系如何,而不在于利用X来预测Y。相关分析的主要任务是由X、Y的一组观测值(Xi,Yi)(i1,2,n),估计XY以及检验XY的假设,特别是H0:XY =0。在统计学上,称为样本相关系数,并用于估计XY。费希尔于1915年在(X,Y)的总体分布为二维正态分布的情况下,求得了r的抽样分布,由此可以对XY =0的假设进行检验,这是一项重大的进展,标志了相关分析这一统计方法的建立。复相关:上述相关系数只涉及两个变量X、Y。若有多个变量X1,X2,Xk,则可以考虑其中之一(如X1)与其余变量(X2,X3,Xk)的相关,基本指标是X1对(X2,X3,Xk)的复相关系数R。任取常数a2,a3,ak,计算X1与的相关系数,变动a2,a3,ak的数值,使相关系数达到极大,这个极大值就是R。(一个变量同时与多个变量之间的最大关系)偏相关:这也是相关分析中的一个重要概念。设X、Y和Z分别记同一个人每月的基本开支、娱乐开支和收入。经过分析,可以发现X与Y之间有高度的正相关,究其原因,是由于X和Y同时受Z的影响。若把Z对两者的影响消除,则剩余部分的相关程度就会改变,甚至会变成负相关。后者就是X、Y相对于Z的偏相关,可以用偏相关系数来衡量。(在有控制变量的前提下两个变量之间的关系)有时,需要考虑一组变量与另一组变量的关系,为此引进了典型相关系数,相应的方法称为典型相关分析,这属于多元统计分析的范围。典型相关分析:寻求两组变量各自的线性函数中相关系数达到最大值的一对,称为第一对典型变量,还可以求第二对、第三对,等等。这些成对的变量,彼此是互不相关的。各对的相关系数称为典型相关系数。通过这些典型变量所代表的实际含意,可以找到这两组变量间的一些内在联系。典型相关分析(20世纪)30年代已经出现。(两组变量集之间的关系,即多个变量与多个变量之间的关系)1、 相关分析与回归分析的区别 资料方面:相关分析要求变量X、Y均服从正态分布;回归分析要求因变量Y服从正态分布,而X是可以精确测量和严格控制的变量,一般称这为型回归。对于双变量正态分布的资料若进行回归分析则称为型回归。 应用方面:说明两变量之间的相关关系用相关分析,说明两变量之间依存变化的数量关系用回归分析。2、 相关分析与回归分析的联系 对于直线相关与回归(y=a+bx),相关系数R与回归方程中的斜率b的符号一致。R为正说明两变量同向变化,b为正说明x增(减)一个单位,则y也平均增(减)一个单位。 对于直线相关与回归,对参数R和b的假设检验是等价的。由于对R的假设检验可以查表,而对b的假设检验计算较繁,因此在实际应用中常以对前者的假设检验代替后者。H0:1; H1:1,(显著性水平)0.05。,自由度=n2公式中的回归系数以及回归系数标准误均由SPSS的线性回归方程提供。根据上面得出的t值,在SPSS上计算原假设H0成立的双侧概率值:P(H0:1)= 2(1-CDF.T(t, 自由度) (t0)P(H0:1)= 2CDF.T(t, 自由度) (t0)1-Bivariate CorrelationsThe Bivariate Correlations procedure computes Pearsons correlation coefficient, Spearmans rho, and Kendalls tau-b with their significance levels. Correlations measure how variables or rank orders are related. Before calculating a correlation coefficient, screen(筛选) your data for outliers (which can cause misleading results) and evidence of a linear relationship. Pearsons correlation coefficient is a measure of linear association. Two variables can be perfectly related, but if the relationship is not linear, Pearsons correlation coefficient is not an appropriate statistic for measuring their association. Example. Is the number of games won by a basketball team correlated with the average number of points scored per game? A scatterplot indicates that there is a linear relationship. Analyzing data from the 19941995 NBA season yields that Pearsons correlation coefficient (0.581) is significant at the 0.01 level. You might suspect that the more games won per season, the fewer points the opponents scored. These variables are negatively correlated (0.401), and the correlation is significant at the 0.05 level. Data. Use symmetric quantitative variables for Pearsons correlation coefficient and quantitative variables or variables with ordered categories for Spearmans rho and Kendalls tau-b.Choose correlation coefficients based on the characteristics of your data. If you have scale, symmetrically(对称的) distributed data, the Pearson correlation coefficient is appropriate. If your data are non-symmetrically distributed or are ordinal in nature (such as ranks), Kendalls tau-b or the Spearman coefficient are more appropriate. Assumptions. Pearsons correlation coefficient assumes that each pair of variables is bivariate normal.Using Correlations to Study the Association between Motor Vehicles Sales and FuelThe Bivariate Correlations procedure computes the pairwise associationsWhen computing a measure of association between two variables in a larger set, cases are included in the computation when the two variables have nonmissing values, irrespective of the values of the other variables in the set. (成对的关联分析)for a set of variables and displays (展示、陈列)the results in a matrix(矩阵). It is useful for determining(决定) the strength and direction of the association between two scale or ordinal variables. In order to increase sales, motor vehicle design engineers(设计师) want to focus their attention on aspects (方面)of the vehicle(车辆) that are important to customers-for example, how important is fuel efficiency (燃料效率)with respect to sales? One way to measure this is to compute the correlation between past sales and fuel efficiency. Information concerning various makes of motor vehicles is collected in car_sales.sav . This data file contains hypothetical(假设的) sales estimates(估量), plus list prices, and physical specifications(规格、说明) for various makes and models of vehicles. The list prices and physical specifications were obtained alternately (交替的)from and manufacturer sites.Use Bivariate Correlations to measure the importance of fuel efficiency to the salability of a motor vehicle To run a correlations analysis, from the menus choose: Analyze Correlate Bivariate. Select Sales in thousands and Fuel efficiency as analysis variables. Click OK. These selections produce a correlation matrix for Sales in thousands and Fuel efficiency. The Pearson correlation coefficient measures the linear association between two scale variables. The correlation reported in the table is negative(!), although not significantly different from 0 because the p-value of 0.837 is greater than 0.10. This suggests that designers should not focus their efforts on making cars more fuel efficient because there isnt an appreciable effect on sales. However, the Pearson correlation coefficient works best when the variables are approximately normally distributed and have no outliers. A scatterplot can reveal these possible problems. To produce a scatterplot of Sales in thousands by Fuel efficiency, from the menus choose:GraphsChart Builder. Select the Scatter/Dot gallery and choose Simple Scatter. Select Sales in thousands as the y variable and Fuel efficiency as the x variable. Click the Groups/Point ID tab and select Point ID Label. Select Model as the variable to label cases by. Click OK.The resulting scatterplot shows two potential outliers, one in the lower right of the plot and one in the upper left. To identify these points, activate the graph by double-clicking on it. Click the Data ID Mode tool. Select the point in the lower right. It is identified as the Metro. Select the point in the upper left. It is identified as the F-Series. The F-Series is found to be generally representative of the vehicles your design team is working on, so you decide to keep it in the data set for now. This point may appear to be an outlier because of the skew distribution of Sales in thousands, so try replacing it with Log-transformed sales in further analyses. The Metro is not representative of the vehicles that your design team is working on, so you can safely remove it from further analyses. To remove the Metro from the correlation computations, from the menus choose: Data Select Cases. Select If condition is satisfied and click If. Type model = Metro in the text box. Click Continue. Click OK in the Select Cases dialog box. A new variable has been created that uses all cases except for the Metro in further computations. To analyze the filtered data, recall the Bivariate Correlations dialog box. Deselect Sales in thousands as an analysis variable. Select Log-transformed sales as an analysis variable. Click OK. After removing the outlier and looking at the log-transformed sales, the correlation is now positive but still not significantly different from 0. However, the customer demographics for trucks and automobiles are different, and the reasons for buying a truck or a car may not be the same. Its worthwhile to look at another scatterplot, this time marking trucks and autos separately. To produce a scatterplot of Log-transformed sales by Fuel efficiency, controlling for Vehicle type, recall the Simple Scatterplot dialog box. Deselect Sales in thousands and select Log-transformed sales as the y variable. Select Vehicle type as the variable to set markers by. Click OKThe scatterplot shows that trucks and automobiles form distinctly different groups. By splitting the data file according to Vehicle type(Select Vehicle type as the variable on which groups should be based.), you might get a more accurate view of the association. Also note that the log-transformation of sales, the potential outlier in the upper left has disappeared. To split the data file according to Vehicle type, from the menus choose: Data Split File. Select Compare groups. Select Vehicle type as the variable on which groups should be based. Click OK. To analyze the split file, recall the Bivariate Correlations dialog box. Click OK. Splitting the file on Vehicle type has made the relationship between sales and fuel efficiency much more clear. There is a significant and fairly strong positive correlation between sales and fuel efficiency for automobiles. For trucks, the correlation is positive but not significantly different from 0. Reaching these conclusions has required some work and shown that correlation analysis using the Pearson correlation coefficient is not always straightforward. For comparison, see how you can avoid the difficulty of transforming variables by using nonparametric correlation measures.The Spearmans rho and Kendalls tau-b statistics measure the rank-order association between two scale or ordinal variables. They work regardless of the distributions of the variables. select all cases To obtain an analysis using Spearmans rho, recall the Bivariate Correlations dialog box. Select Sales in thousands as an analysis variable. Deselect Pearson and select Spearman. Click OK. Spearmans rho is reported separately for automobiles and trucks. As with Pearsons correlation coefficient, the association between Log-transformed sales and Fuel efficiency is fairly strong. However, Spearmans rho reports the same correlation for the untransformed sales! This is because rho is based on rank orders, which are unchanged by log transformation. Moreover, outliers have less of an effect on Spearmans rho, so its possible to save some time and effort by using it as a measure of association. Using Bivariate Correlations, you produced a correlation matrix for Sales in thousands by Fuel efficiency and, surprisingly, found a negative correlation. Upon removing an outlier and using Log-transformed sales, the correlation became positive, although not significantly different from 0. However, you found that by computing the correlations separately for trucks and autos, there is a positive and statistically significant correlation between sales and fuel efficiency for automobiles. Furthermore, you found similar results without the transformation using Spearmans rho, and perhaps are wondering why you should go through the effort of transforming variables when Spearmans rho is so convenient. The measures of rank order are handy for discovering whether there is any kind of association between two variables, but when they find an association its a good idea to find a transformation that makes the relationship linear. This is because there are more predictive models available for linear relationships, and the linear models are generally easier to implement and interpret. The Bivariate Correlations procedure is useful for studying the pairwise associations for a set of scale or ordinal variables. If you have nominal variables, use the Crosstabs procedure to obtain measures of association. If you want to model the value of a scale variable based on its linear relationship to other variables, try the Linear Regression procedure. If you want to decompose the variation in your data to look for underlying patterns, try the Factor Analysis procedure. Pearson Correlation. The most widely-used type of correlation coefficient is Pearson r (Pearson, 1896), also called linear or product-moment correlation (the term correlation was first used by Galton, 1888). Using non technical language,we can say that the correlation coefficient determines the extent to which values of two variables are proportional to each other. The value of the correlation (i.e., correlation coefficient) does not depend on the specific measurement units used; for example, the correlation between height and weight will be identical regardless of whether inches and pounds, or centimeters and kilograms are used as measurement units. Proportional means linearly related; that is, the correlation is high if it can be approximated by a straight line (sloped upwards or downwards). This line is called the regression line or least squares line, because it is determined such that the sum of the squared distances of all the data points from the line is the lowest possible. Pearson correlation assumes that the two variables are measured on at least interval scales. Spearman R. Spearman R can be thought of as the regular Pearson product-moment correlation coefficient (Pearson r); that is, in terms of the proportion of variability accounted for, except that Spearman R is computed from ranks. As mentioned above, Spearman R assumes that the variables under consideration were measured on at least an ordinal (rank or

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论