CHAMELEON-A-Hierarchical-Clustering-Algorithm-:变色龙的层次聚类算法课件_第1页
CHAMELEON-A-Hierarchical-Clustering-Algorithm-:变色龙的层次聚类算法课件_第2页
CHAMELEON-A-Hierarchical-Clustering-Algorithm-:变色龙的层次聚类算法课件_第3页
CHAMELEON-A-Hierarchical-Clustering-Algorithm-:变色龙的层次聚类算法课件_第4页
CHAMELEON-A-Hierarchical-Clustering-Algorithm-:变色龙的层次聚类算法课件_第5页
已阅读5页,还剩28页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、CHAMELEON:A Hierarchical Clustering Algorithm Using Dynamic ModelingPaper presentation in data mining classPresenter : 許明壽 ; 蘇建仲Data : 2001/12/182001/12/181CHAMELEONAbout this paper Department of Computer Science and Engineering , University of MinnesotaGeorge KarypisEui-Honh (Sam) HanVipin KumarIEE

2、E Computer Journal - Aug. 19992001/12/182CHAMELEONOutlineProblems definitionMain algorithmKeys features of CHAMELEONExperiment and related workedConclusion and discussion2001/12/183CHAMELEONProblems definitionClusteringIntracluster similarity is maximizedIntercluster similarity is minimizedProblems

3、of existing clustering algorithmsStatic model constrainBreakdown when clusters that are of diverse shapes,densities,and sizesSusceptible to noise , outliers , and artifacts 2001/12/184CHAMELEONStatic model constrainData space constrainK means , PAM etcSuitable only for data in metric spacesCluster s

4、hape constrainK means , PAM , CLARANSAssume cluster as ellipsoidal or globular and are similar sizesCluster density constrainDBScanPoints within genuine cluster are density-reachable and point across different clusters are notSimilarity determine constrainCURE , ROCKUse static model to determine the

5、 most similar cluster to merge 2001/12/185CHAMELEONPartition techniques problem(a) Clusters of widely different sizes(b) Clusters with convex shapes2001/12/186CHAMELEONHierarchical technique problem (1/2)The (c) , (d) will be choose to merge when we only consider closeness2001/12/187CHAMELEONHierarc

6、hical technique problem (2/2)The (a) , (c) will be choose to merge when we only consider inter-connectivity2001/12/188CHAMELEONMain algorithmTwo phase algorithmPHASE IUse graph partitioning algorithm to cluster the data items into a large number of relatively small sub-clusters.PHASE IIUses an agglo

7、merative hierarchical clustering algorithm to find the genuine clusters by repeatedly combining together these sub-clusters.2001/12/189CHAMELEONFramework ConstructSparse GraphPartition the GraphMerge PartitionFinal ClustersData Set2001/12/1810CHAMELEONKeys features of CHAMELEONModeling the dataModel

8、ing the cluster similarityPartition algorithmsMerge schemes2001/12/1811CHAMELEONTermsArguments neededKK-nearest neighbor graphMINSIZEThe minima size of initial clusterTRIThreshold of related inter-connectivity TRC Threshold of related intra-connectivityCoefficient for weight of RI and RC2001/12/1812

9、CHAMELEONModeling the dataK-nearest neighbor graph approachAdvantagesData points that are far apart are completely disconnected in the GkGk capture the concept of neighborhood dynamicallyThe edge weights of dense regions in Gk tend to be large and the edge weights of sparse tend to be small2001/12/1

10、813CHAMELEONExample of k-nearest neighbor graph2001/12/1814CHAMELEONModeling the clustering similarity (1/2)Relative interconnectivityRelative closeness2001/12/1815CHAMELEONModeling the clustering similarity (2/2) If related is considered , (c) , (d) will be merged2001/12/1816CHAMELEONPartition algo

11、rithm (PHASE I)WhatFinding the initial sub-clustersWhyRI and RC cant be accurately calculated for clusters containing only a few data pointsHowUtilize multilevel graph partitioning algorithm (hMETIS)Coarsening phasePartitioning phaseUncoarsening phase2001/12/1817CHAMELEONPartition algorithm (cont.)I

12、nitialall points belonging to the same clusterRepeat until (size of all clusters MINSIZE)Select the largest cluster and use hMETIS to bisectPost scriptumBalance constrainSpilt Ci into CiA and CiB and each sub-clusters contains at least 25% of the node of Ci2001/12/1818CHAMELEON2001/12/1819CHAMELEONW

13、hatMerging sub-clusters using a dynamic frameworkHowFinding and merging the pair of sub-clusters that are the most similarScheme 1Scheme 2Merge schemes (Phase II)and2001/12/1820CHAMELEONExperiment and related workedIntroduction of CUREIntroduction of DBSCANResults of experimentPerformance analysis20

14、01/12/1821CHAMELEONIntroduction of CURE (1/n)Clustering Using Representative points1. Properties :Fit for non-spherical shapes.Shrinking can help to dampen the effects of outliers.Multiple representative points chosen for non-sphericalEach iteration , representative points shrunk ratio related to me

15、rge procedure by some scattered points chosen Random sampling in data sets is fit for large databases2001/12/1822CHAMELEONIntroduction of CURE (2/n)2. Drawbacks :Partitioning method can not prove data points chosen are good.Clustering accuracy with respect to the parameters below :(1) Shrink factor

16、s : CURE always find the right clusters by range of s values from 0.2 to 0.7.(2) Number of representative points c : CURE always found right clusters for value of c greater than 10.(3) Number of Partitions p : with as many as 50 partitions , CURE always discovered the desired clusters.(4) Random Sam

17、ple size r : (a) for sample size up to 2000 , clusters found poor quality(b) from 2500 sample points and above , about 2.5% of the data set size , CURE always correctly find the clusters.2001/12/1823CHAMELEON3. Clustering algorithm : Representative points 2001/12/1824CHAMELEONMerge procedure2001/12/

18、1825CHAMELEONIntroduction of DBSCAN (1/n)Density Based Spatial Clustering of Application With Noise1. Properties :Can discovery clusters of arbitrary shape.Each cluster with a typical density of points which is higher than outside of cluster.The density within the areas of noise is lower than the de

19、nsity in any of the clusters.Input the parameters MinPts onlyEasy to implement in C+ language using R*-treeRuntime is linear depending on the number of points.Time complexity is O(n * log n)2001/12/1826CHAMELEONIntroduction of DBSCAN (2/n)2. Drawbacks :Cannot apply to polygons.Cannot apply to high d

20、imensional feature spaces.Cannot process the shape of k-dist graph with multi-features.Cannot fit for large database because no method applied to reduce spatial database.3. DefinitionsEps-neighborhood of a point pNEps(p)=qD | dist(p,q)=MinPts (core point condition)We know directly density-reachable

21、is symmetric when p and q both are core point , otherwise is asymmetric if one core point and one border point.5. p is density-reachable from q if there is a chain of points between p and qDensity-reachable is transitive , but not symmetricDensity-reachable is symmetric for core points.2001/12/1828C

22、HAMELEONIntroduction of DBSCAN (4/n)6. A point p is density-connected to a point q if there is a point s such that both p and q are density-reachable from s.Density-connected is symmetric and reflexive relationA cluster is defined to be a set of density-connected points which is maximal density-reac

23、hability.Noise is the set of points not belong to any of clusters.7. How to find cluster C ?Maximality p , q : if p C and q is density-reachable from p , then q CConnectivity p , q C : p is density-connected to q8. How to find noises ? p , if p is not belong to any clusters , then p is noise point20

24、01/12/1829CHAMELEONResults of experiment2001/12/1830CHAMELEONPerformance analysis (1/2)The time of construct the k-nearest neighborLow-dimensional data sets based on k-d trees , overall complexity of O(n log n)High-dimensional data sets based on k-d trees not applicable , overall complexity of O(n2)Finding initial sub-clustersObtains m clusters by repeated partitioning successively smaller graphs , overall computa

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论