版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、CHAMELEON:A Hierarchical Clustering Algorithm Using Dynamic ModelingPaper presentation in data mining classPresenter : 許明壽 ; 蘇建仲Data : 2001/12/182001/12/181CHAMELEONAbout this paper Department of Computer Science and Engineering , University of MinnesotaGeorge KarypisEui-Honh (Sam) HanVipin KumarIEE
2、E Computer Journal - Aug. 19992001/12/182CHAMELEONOutlineProblems definitionMain algorithmKeys features of CHAMELEONExperiment and related workedConclusion and discussion2001/12/183CHAMELEONProblems definitionClusteringIntracluster similarity is maximizedIntercluster similarity is minimizedProblems
3、of existing clustering algorithmsStatic model constrainBreakdown when clusters that are of diverse shapes,densities,and sizesSusceptible to noise , outliers , and artifacts 2001/12/184CHAMELEONStatic model constrainData space constrainK means , PAM etcSuitable only for data in metric spacesCluster s
4、hape constrainK means , PAM , CLARANSAssume cluster as ellipsoidal or globular and are similar sizesCluster density constrainDBScanPoints within genuine cluster are density-reachable and point across different clusters are notSimilarity determine constrainCURE , ROCKUse static model to determine the
5、 most similar cluster to merge 2001/12/185CHAMELEONPartition techniques problem(a) Clusters of widely different sizes(b) Clusters with convex shapes2001/12/186CHAMELEONHierarchical technique problem (1/2)The (c) , (d) will be choose to merge when we only consider closeness2001/12/187CHAMELEONHierarc
6、hical technique problem (2/2)The (a) , (c) will be choose to merge when we only consider inter-connectivity2001/12/188CHAMELEONMain algorithmTwo phase algorithmPHASE IUse graph partitioning algorithm to cluster the data items into a large number of relatively small sub-clusters.PHASE IIUses an agglo
7、merative hierarchical clustering algorithm to find the genuine clusters by repeatedly combining together these sub-clusters.2001/12/189CHAMELEONFramework ConstructSparse GraphPartition the GraphMerge PartitionFinal ClustersData Set2001/12/1810CHAMELEONKeys features of CHAMELEONModeling the dataModel
8、ing the cluster similarityPartition algorithmsMerge schemes2001/12/1811CHAMELEONTermsArguments neededKK-nearest neighbor graphMINSIZEThe minima size of initial clusterTRIThreshold of related inter-connectivity TRC Threshold of related intra-connectivityCoefficient for weight of RI and RC2001/12/1812
9、CHAMELEONModeling the dataK-nearest neighbor graph approachAdvantagesData points that are far apart are completely disconnected in the GkGk capture the concept of neighborhood dynamicallyThe edge weights of dense regions in Gk tend to be large and the edge weights of sparse tend to be small2001/12/1
10、813CHAMELEONExample of k-nearest neighbor graph2001/12/1814CHAMELEONModeling the clustering similarity (1/2)Relative interconnectivityRelative closeness2001/12/1815CHAMELEONModeling the clustering similarity (2/2) If related is considered , (c) , (d) will be merged2001/12/1816CHAMELEONPartition algo
11、rithm (PHASE I)WhatFinding the initial sub-clustersWhyRI and RC cant be accurately calculated for clusters containing only a few data pointsHowUtilize multilevel graph partitioning algorithm (hMETIS)Coarsening phasePartitioning phaseUncoarsening phase2001/12/1817CHAMELEONPartition algorithm (cont.)I
12、nitialall points belonging to the same clusterRepeat until (size of all clusters MINSIZE)Select the largest cluster and use hMETIS to bisectPost scriptumBalance constrainSpilt Ci into CiA and CiB and each sub-clusters contains at least 25% of the node of Ci2001/12/1818CHAMELEON2001/12/1819CHAMELEONW
13、hatMerging sub-clusters using a dynamic frameworkHowFinding and merging the pair of sub-clusters that are the most similarScheme 1Scheme 2Merge schemes (Phase II)and2001/12/1820CHAMELEONExperiment and related workedIntroduction of CUREIntroduction of DBSCANResults of experimentPerformance analysis20
14、01/12/1821CHAMELEONIntroduction of CURE (1/n)Clustering Using Representative points1. Properties :Fit for non-spherical shapes.Shrinking can help to dampen the effects of outliers.Multiple representative points chosen for non-sphericalEach iteration , representative points shrunk ratio related to me
15、rge procedure by some scattered points chosen Random sampling in data sets is fit for large databases2001/12/1822CHAMELEONIntroduction of CURE (2/n)2. Drawbacks :Partitioning method can not prove data points chosen are good.Clustering accuracy with respect to the parameters below :(1) Shrink factor
16、s : CURE always find the right clusters by range of s values from 0.2 to 0.7.(2) Number of representative points c : CURE always found right clusters for value of c greater than 10.(3) Number of Partitions p : with as many as 50 partitions , CURE always discovered the desired clusters.(4) Random Sam
17、ple size r : (a) for sample size up to 2000 , clusters found poor quality(b) from 2500 sample points and above , about 2.5% of the data set size , CURE always correctly find the clusters.2001/12/1823CHAMELEON3. Clustering algorithm : Representative points 2001/12/1824CHAMELEONMerge procedure2001/12/
18、1825CHAMELEONIntroduction of DBSCAN (1/n)Density Based Spatial Clustering of Application With Noise1. Properties :Can discovery clusters of arbitrary shape.Each cluster with a typical density of points which is higher than outside of cluster.The density within the areas of noise is lower than the de
19、nsity in any of the clusters.Input the parameters MinPts onlyEasy to implement in C+ language using R*-treeRuntime is linear depending on the number of points.Time complexity is O(n * log n)2001/12/1826CHAMELEONIntroduction of DBSCAN (2/n)2. Drawbacks :Cannot apply to polygons.Cannot apply to high d
20、imensional feature spaces.Cannot process the shape of k-dist graph with multi-features.Cannot fit for large database because no method applied to reduce spatial database.3. DefinitionsEps-neighborhood of a point pNEps(p)=qD | dist(p,q)=MinPts (core point condition)We know directly density-reachable
21、is symmetric when p and q both are core point , otherwise is asymmetric if one core point and one border point.5. p is density-reachable from q if there is a chain of points between p and qDensity-reachable is transitive , but not symmetricDensity-reachable is symmetric for core points.2001/12/1828C
22、HAMELEONIntroduction of DBSCAN (4/n)6. A point p is density-connected to a point q if there is a point s such that both p and q are density-reachable from s.Density-connected is symmetric and reflexive relationA cluster is defined to be a set of density-connected points which is maximal density-reac
23、hability.Noise is the set of points not belong to any of clusters.7. How to find cluster C ?Maximality p , q : if p C and q is density-reachable from p , then q CConnectivity p , q C : p is density-connected to q8. How to find noises ? p , if p is not belong to any clusters , then p is noise point20
24、01/12/1829CHAMELEONResults of experiment2001/12/1830CHAMELEONPerformance analysis (1/2)The time of construct the k-nearest neighborLow-dimensional data sets based on k-d trees , overall complexity of O(n log n)High-dimensional data sets based on k-d trees not applicable , overall complexity of O(n2)Finding initial sub-clustersObtains m clusters by repeated partitioning successively smaller graphs , overall computa
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 微创三叉神经微血管减压术的术后早期活动指导
- 影像学评分系统对职业性哮喘的严重度评估
- 康复辅助技术适配的培养策略创新优化
- 康复机器人与传统康复疗法的联合疗效对比
- 序贯免疫联合治疗在AEGC中的策略优化
- 干细胞治疗长期评估方法
- 小儿丹毒课件
- 干细胞多能性鉴定的质控体系优化策略
- 帕金森病非运动症状改善
- 寝室消防用电安全培训内容课件
- 2026秋招:贵州盐业集团笔试题及答案
- 留学合同补充协议
- 大学计算机教程-计算与人工智能导论(第4版)课件 第10章 云计算与大数据
- 全球创新药临床试验十年趋势洞察
- 2025年超声科工作总结和2026年工作计划
- 2025河南郑州公用事业投资发展集团有限公司招聘10人笔试参考题库附带答案详解(3卷)
- 北师大版初中九年级上册数学期末试卷后附答案
- 枪支管理法考试题及答案
- 张家口市氢能产业安全监督和管理办法
- 2025年自然资源部所属单位工作人员招聘考试试题(含答案)
- 小学四年级数学判断题100道(含答案)
评论
0/150
提交评论