hierarchical.docx_第1页
hierarchical.docx_第2页
hierarchical.docx_第3页
hierarchical.docx_第4页
hierarchical.docx_第5页
已阅读5页,还剩6页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

来源:/ucinet/help/3j.x0e.htmContents - Index TOOLS CLUSTER ANALYSIS HIERARCHICALPURPOSE Perform Johnsons hierarchical clustering on a proximity matrix.DESCRIPTION Given a symmetric n-by-n representing similarities or dissimilarities among a set of n items, the algorithm finds a series of nested partitions of the items. The different partitions are ordered according to decreasing increasing levels of similarity dissimilarity. The algorithm begins with the identity partition (in which all items are in different clusters). It then joins the pair of items most similar (least different), which are then considered a single entity. The algorithm continues in this manner until all items have been joined into a single cluster (the complete partition).PARAMETERSInput datasetName of file containing proximity matrix to be clustered. Data type: Square symmetric matrix.Similarities or Distances? (Default = Similarities)Whether items i and j should be clustered together when X(i,j) is large or when it is small. If data are Similarities, items i and j are clustered together if X(i,j) is very large. If data are Dissimilarities, items i and j are clustered together if X(i,j) is very small.Output Partition matrix: (Default = Part)Name of dataset to contain the partition-by-item indicator matrix. Each column of this matrix gives the cluster to which each item was assigned in a given partition. The columns are labeled by the level of the cluster. A value of k in a column labeled x and row j means that actor j was in partition k at level x. Actor k is always a member of partition k and is a representative label for the group. It can be used by procedures like TransformBlock to obtain density matrices at any level of blocking. This file is not displayed in the LOG FILE.Output Ultrametric matrix (if desired):Name of dataset to contain the item-by-item ultrametric proximity matrix, if seleceted.Method: (Default = WTD_AVERAGE)Choices are:SINGLE_LINKAlso known as the minimum or connectedness method. Distance between two clusters is defined as smallest dissimilarity (largest similarity) between members.COMPLETE_LINKAlso known as the maximum or diameter method. Distance between two clusters is defined as largest dissimilarity (smallest similarity) between members.WTD_AVERAGEDistance between clusters is calculated as the average dissimilarity (or similarity) value weighted by cluster size.SIMPLE AVERAGEDistance between clusters defined as average dissimilarity (or similarity) between members where the clusters are treated as single data points.Graphical Dendrogram: (Default = Dendrogram)The clustering can be shown as a dendrogram or a tree diagram.Textual Dendrogram:(Default = Landscape)Whether the textual output decsribed below is given in portrait or landscape format.Maximum label length (Default = 15)Number of label characters above which labels are truncated.Compute ultrametric proximity matrix? (Default = NO)Hierarchical clustering can be seen as transforming a dissimilarity matrix into an ultrametric distance matrix. The ultrametric distances correspond monotonically to the number of iterations (partitions) needed to join a given pair of items.LOG FILE Primary output are cluster diagrams. The first diagram (either a tree diagram or a dendrogram) re-orders the actors so that they are located close to other actors in similar clusters. The level at which any pair of actors are aggregated is the point at which both can be reached by tracing from the start to the actors from right to left. The scale at the top gives the level at which they are clustered. The diagram can be printed or saved. Parts of the diagram can be viewed by moving the mouse to the split point in a tree diagram or the beginning of a line in the dendrogram and clicking. The first click will highlight a portion of the diagram and the second click will display just the highlighted portion. To return to the original right click on the mouse. There is also a simple zoom facility simply change the values and then press enter. If the labels need to be edited (particularly the scale labels) then you should take the partition indicator matrix into the spreadsheet editor remove or reduce the labels and then submit the edited data to ToolsDendrogramDraw. The output also produces a standard Log file that contains a different cluster diagram the textual dendogram which looks like this: A B C D E F G H I J 1Level 1 2 3 4 5 6 7 8 9 0- - - - - - - - - - -1.000 XXXXX XXX XXX XXXXX1.422 XXXXX XXX XXXXXXXXX1.578 XXXXXXXXX XXXXXXXXX3.287 XXXXXXXXXXXXXXXXXXXIn this example, the data were distances among 10 items, labeled A through J. The results are 4 nested partitions, corresponding to rows in the diagram. Within a given row, an X between two adjacent columns indicates that the items associated with those columns were assigned to the same cluster in that partition. For example, in the first partition (level 1.000), items D and E belong to the same cluster, but C is a member of a different cluster. In the third partition (level 1.578), items D, E and C all belong to the same cluster.The levels indicate the degree of association (similarity or dissimilarity) among items within clusters. If, as in the example, the data are distances and the clustering method is single link, the a level of 1.578 means that every item within a cluster is no more than 1.578 units distant from at least one other item in that cluster. If the clustering method is complete link, a level of 1.578 indicates that every item in a cluster no more than 1.578 units distant from every other item in the cluster. For the average clustering method, a level of 1.578 indicates that the average distance among items within the cluster is 1.578.For similarity data, the meaning of the levels for the single link and complete link methods is, in a sense reversed. For the single link method, a level of 1.578 means that every item in a cluster is at least 1.578 units similar to at least one other item in the cluster. For the complete link method, a level of 1.578 means that every item in a cluster is at least 1.578 units similar to every other item in the cluster.A table of cluster adequacy with the rows giving the method and columns the clusters at various levels. The columns are ordered from the finest to the coarsest partition so that column 1 corresponds to the fist step in the clustering and the last column is the clustering before everything is placed in one cluster.A table giving the size of each cluster as a proportion of the total population. The rows are the clusters and the columns are the levels the i,j entry gives the proportion of members in cluster i at level j.TIMING O(N3)COMMENTS None.REFERENCES Johnson, S C (1967). Hierarchical clustering schemes. Psychometrika, 32, 241-253.2 来源于维基百科From Wikipedia, the free encyclopediaJump to: navigation, search In statistics, hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types: Agglomerative: This is a bottom up approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. Divisive: This is a top down approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering are usually presented in a dendrogram.In the general case, the complexity of agglomerative clustering is , which makes them too slow for large data sets. Divisive clustering with an exhaustiv search is , so even worse. However, for some special cases, optimal efficient agglomerative methods (of complexity ) are known: SLINK1 for single-linkage and CLINK2 for complete-linkage clustering.Contents 1 Cluster dissimilarity o 1.1 Metric o 1.2 Linkage criteria 2 Discussion 3 Example for Agglomerative Clustering 4 Software o 4.1 Free o 4.2 Commercial 5 See also 6 Notes 7 References and further reading edit Cluster dissimilarityIn order to decide which clusters should be combined (for agglomerative), or where a cluster should be split (for divisive), a measure of dissimilarity between sets of observations is required. In most methods of hierarchical clustering, this is achieved by use of an appropriate metric (a measure of distance between pairs of observations), and a linkage criterion which specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets.edit MetricFurther information: metric (mathematics)The choice of an appropriate metric will influence the shape of the clusters, as some elements may be close to one another according to one distance and farther away according to another. For example, in a 2-dimensional space, the distance between the point (1,0) and the origin (0,0) is always 1 according to the usual norms, but the distance between the point (1,1) and the origin (0,0) can be 2, or 1 under Manhattan distance, Euclidean distance or maximum distance respectively.Some commonly used metrics for hierarchical clustering are:3NamesFormulaEuclidean distancesquared Euclidean distanceManhattan distancemaximum distanceMahalanobis distancewhere S is the covariance matrixcosine similarityFor text or other non-numeric data, metrics such as the Hamming distance or Levenshtein distance are often used.A review of cluster analysis in health psychology research found that the most common distance measure in published studies in that research area is the Euclidean distance or the squared Euclidean distance.citation needededit Linkage criteriaThe linkage criteria determines the distance between sets of observations as a function of the pairwise distances between observations.Some commonly used linkage criteria between two sets of observations A and B are:45NamesFormulaMaximum or complete linkage clusteringMinimum or single-linkage clusteringMean or average linkage clustering, or UPGMAMinimum energy clusteringwhere d is the chosen metric. Other linkage criteria include: The sum of all intra-cluster variance. The increase in variance for the cluster being merged (Wards criterion).6 The probability that candidate clusters spawn from the same distribution function (V-linkage). edit DiscussionHierarchical clustering has the distinct advantage that any valid measure of distance can be used. In fact, the observations themselves are not required: all that is used is a matrix of distances.edit Example for Agglomerative ClusteringFor example, suppose this data is to be clustered, and the Euclidean distance is the distance metric.Cutting the tree at a given height will give a partitioning clustering at a selected precision. In this example, cutting after the second row will yield clusters a b c d e f. Cutting after the third row will yield clusters a b c d e f, which is a coarser clustering, with a smaller number of larger clusters.Raw dataThe hierarchical clustering dendrogram would be as such:Traditional representationThis method builds the hierarchy from the individual elements by progressively merging clusters. In our example, we have six elements a b c d e and f. The first step is to determine which elements to merge in a cluster. Usually, we want to take the two closest elements, according to the chosen distance.Optionally, one can also construct a distance matrix at this stage, where the number in the i-th row j-th column is the distance between the i-th and j-th elements. Then, as clustering progresses, rows and columns are merged as the clusters are merged and the distances updated. This is a common way to implement this type of clustering, and has the benefit of caching distances between clusters. A simple agglomerative clustering algorithm is described in the single-linkage clustering page; it can easily be adapted to different types of linkage (see below).Suppose we have merged the two closest elements b and c, we now have the following clusters a, b, c, d, e and f, and want to merge them further. To do that, we need to take the distance between a and b c, and therefore define the distance between two clusters. Usually the distance between two clusters and is one of the following: The maximum distance between elements of each cluster (also called complete-linkage clustering): The minimum distance between elements of each cluster (also called single-linkage clustering): The mean distance between elements of each cluster (also called average linkage clustering, used e.g. in UPGMA): The sum of all intra-cluster variance. The increase in variance for the cluster being merged (Wards method6 The probability that candidate clusters spawn from the same distribution function (V-linkage). Each agglomeration occurs at a greater distance between clusters than the previous agglomeration, and one can decide to stop clustering either when the clusters are too far apart to be merged (distance criterion) or when there is a sufficiently small number of clusters (number criterion).edit Softwareedit Free R has several functions for hierarchical clustering: see CRAN Task View: Cluster Analysis & Finite Mixture Models for more information. Orange, a free data mining software suite, module orngClustering for scripting in Python, or cluster analysis through visual programming. hcluster is Python software, based on NumPy, which supports hierarchical clustering and plotting. Cluster 3.0 provides a nice Graphical User Interface to access to different clustering routines and is available for Windows, Mac OS X, Linux, Unix. See: 1 ELKI includes multiple hierarchical clustering algorithms. figue A javascript package that implements some agglomerative clustering functions (single-linkage, complete-linkage, average-linkage) and functions to visualize clustering output (e.g. dendograms) (Online demo). MultiDendrograms An open source Java application for variable-group agglomerative hierarchical clustering,7 with graphical user interface. CrimeStat implements two hierarchical clustering routines, a nearest neighbor (Nnh) and a risk-adjusted(Rnnh). edit Commercial Software for analyzing multivariate data with instant response using Hierarchical clustering SAS CLUSTER Matlab Hierarchical Clustering (Statistics Toolbox) edit See also Cluster analysis CURE data clustering algorithm Dendrogram Determining the number of clusters in a data set Hierarchical clustering of networks Nearest-neighbor chain algorithm Numerical taxonomy OPTICS algorithm edit Notes1. R. Sibson (1973). SLINK: an optimally efficient algorithm for the single-link cluster method. The Computer Journal (British Computer Society) 16 (1): 30-34. /wkim/index_files/papers/sibson.pdf. 2. D. Defays (1977). An efficient algorithm for a complete link method. The Computer Journal (British Computer Society) 20 (4): 364-366. 3. The DISTANCE Procedure: Proximity Measures. SAS/STAT 9.2 Users Guide. S

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论