应用多元统计分析1_第1页
应用多元统计分析1_第2页
应用多元统计分析1_第3页
应用多元统计分析1_第4页
应用多元统计分析1_第5页
已阅读5页,还剩33页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Preface to the 1st EditionMost of the observable phenomenafinmin in the empirical (empirikl经验)sciences are of a multivariate nature. In financial studies, assets in stock markets are observed simultaneously and their joint development is analyzed to better understand general tendencies(趋势) and to track indices(路灯) . The underlying theoretical structure of these and many other quantitative studies of applied sciences is multivariate. This book on Applied Multivariate Statistical Analysis presents the tools and concepts of multivariate data analysis with a strong focus on applications.The aim of the book is to present multivariate data analysis in a way that is understandable for non-mathematicians and practitioners who are (面对) by statistical data analysis. This is achieved by focusing on the practical relevance and through the e-book character of this text. All practical examples may be recalculated and modified by the reader using a standard web browser and without reference or application of any specific software. Most of the observable phenomenafinmin in the empirical (empirikl经验 )sciences are of a multivariate nature. The underlying theoretical structure of these and many other quantitative studies of applied sciences is multivariate. This book on Applied Multivariate Statistical Analysis presents the tools and concepts of multivariate,mlti vereit data analysis with a strong focus on applications.The book is divided into three main parts. The first part is devoted to graphical techniques describing the distributions of the variables involved. The second part deals with multivariate random variables and presents from a theoretical point of view distributions, estimators and tests for various practical situations.The last part is on multivariate techniques and introduces the reader to the wide selection of tools available for multivariate data analysis.All data sets are given in the appendix and are downloadable from . The text contains a wide variety of exercises the solutions of which are given in a separate textbook. In addition a full set of transparencies on is provided making iteasier for an instructor to present the materials in this book. All transparencies contain hyper links to the statistical web service so that students and instructors alike may recompute all examples via a standard web browser.1-2 week UNIT-I Descriptive Techniques(描述技术 ) 1 Comparison(对照) of Batches 1.1 Boxplots 4 1.2 Histograms 10 1.3 Scatterplots 17 1.4 Data Set -Boston Housing 351 Comparison of BatchesMultivariate statistical analysis is concerned with analyzing and understanding data in high dimensions. We suppose that we are given a set xini=1 of n observations of a variable vector X in Rp. That is, we suppose that each observation xi has p dimensions:xi = (xi1, xi2, ., xip),and that it is an observed value of a variable vector X Rp. Therefore, X is composed of p random variables:X = (X1,X2, .,Xp)where Xj, for j = 1, . . . , p, is a one-dimensional random variable. 1 Comparison of BatchesMultivariate statistical analysis is concerned with analyzing and understanding data in high dimensions. How do we begin to analyze this kind of data? Before we investigate questions on what inferences we can reach from the data, we should think about how to look at the data. This involves descriptive techniques. Questions that we could answer by descriptive techniques are:Are there components of X that are more spread out than others?Are there some elements of X that indicate subgroups of the data?Are there outliers in the components of X?How “normal” is the distribution of the data?1.1 Boxplots1 Comparison of BatchesGenuinedenjuin真正的X6X1 The median and mean bars are measures of locations. The relative location of the median (and the mean) in the box is a measure of skewness. The length of the box and whiskers are a measure of spread. The length of the whiskers indicate the tail length of the distribution. The outlying points are indicated with a “ ” or “” depending on if they are outside of FUL 1.5dF or FUL 3dF respectively. The boxplots do not indicate multi modality or clusters. If we compare the relative size and location of the boxes, we are comparing distributions.SummaryReading material21. data capacity 数据容量 kpsiti22. data handling 数据处理 hndli23. data reduction 数据缩减分析 ridkn24. data transformation 数据变换25. density function 密度函数26. description 描述27. descriptive 描述性的28. deviation from average 均值离差 ,di:viein 背离29. Df. Fit 拟合差值30.df.(degree of freedom) 自由度31. distribution shape 分布形状 eip32. double logarithmic 双对数 ,l:grimik33. eigenvector 特征向量 aign,vekt(r)34. error of estimate 估计误差 estimeit35. estimation 估计量 estimein 重音差别36. Euclidean distance 欧式距离 ju:klidin37. expected value 期望值 ikspektid38. experimental sampling 实验抽样 ik,sperimentl s:mpli39. explanatory variable 说明变量 iksplntri vribl40. explore Summarize 探索 摘要 ikspl: smraiz1.2 Histogramsh=0.4Histograms are density ( denst) (密度 ) estimates(estimeits概算 ).A density estimate gives a good impression of the distribution of the data. In contrast to boxplots, density estimates show possible multimodality (多模式;综合 ,mltimdliti ) of the data.The idea is to locally represent the data density by counting the number of observations in a sequence of consecutive ( 连续的 ) intervals (bins) ( 箱 ) with origin ( rn起源 、原点 ) x0 .Let Bj(x0, h) denote (dinut,指示 , 表示 ) the bin of length h which is the element of a bin grid starting at x0 : Bj(x0, h) = x0 + (j 1)h, x0 + jh ), j Z,where ., .) (square brackets) denotes a left closed and right open interval (ntrvl 间隔 ,右开区间 ). If xin i=1 is an i.i.d. sample with density f, the histogram is defined as follows:In sum (1.7) the first indicator function I xi Bj(x0, h) counts the number of observations falling into bin Bj(x0, h). The second indicator function I is responsible for “localizing”( luklizi局限) the counts around x. The parameter h is a smoothing or localizing parameter and controls the width(wid) of the histogram bins. An h that is too large leads to very big blocks and thus to a very unstructured histogram. On the other hand, an h that is too small gives a very variable estimate with many unimportant peaks. H=0.1H=0.2H=0.3Diagonaldaignladj.对角线的 , 斜的 n.对角线 , 斜线H=0.4The effect of h is given in detail in Figure 1.6. It contains the histogram (upper left) for the diagonal of the counterfeit bank notes for x0 = 137.8 (the minimum of these observations) and h = 0.1. Increasing h to h = 0.2 and using the same origin, x0 = 137.8, results in the histogram shown in the lower left of the figure. This density histogram is somewhat smoother due to the larger h. The binwidth is next set to h = 0.3 (upper right). From this histogram, one has the impression that the distribution of the diagonal is bimodal with peaks at about 138.5 and 139.9. The detection of modes requires a fine tuning of the binwidth.Using methods from smoothing methodology (medldi, n.方法学 ) one can find an “optimal” binwidth h for n observations:counterfeitkauntfitadj.假冒的 , 假装的In Figure 1.7, we show histograms with x0 = 137.65 (upper left), x0 = 137.75 (lower left), with x0 = 137.85 (upper right), and x0 = 137.95 (lower right). All the graphs have been scaled equally on the y-axis to allow comparison. One sees thatdespite the fixed binwidth hthe interpretation is not facilitated (fsiliteitid vt.使容易 ). The shift of the origin x0 (to 4 different locations) created 4 different histograms. This property of histograms strongly contradicts the goal of presenting data features. Modes of the density are detected with a histogram. Modes correspond to strong peaks in the histogram. Histograms with the same h need not be identical. They also depend on the origin x0 of the grid. The influence of the origin x0 is drastic. Changing x0 creates different looking histograms. The consequence of an h that is too large is an unstructured histogram that is too flat. A bin width h that is too small results in an unstable histogram. There is an “optimal” h = (24 /n)1/3. It is recommended to use averaged histograms. They are kernel densities.Summary1.4 ScatterplotsScatterplots are bivariate or trivariate plots of variables(vribl) against each other. They help us understand relationships among the variables of a data set. A downward-sloping (slupi ) scatter indicates that as we increase the variable on the horizontal axis, the variable on the vertical axis decreases (di:kri:s vt.减少 ). An analogous(nlgs adj.类似的 ) statement can be made for upward-sloping s

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论