版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Data MiningCluster Analysis: Advanced Concepts and Algorithms,Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar,Hierarchical Clustering: Revisited,Creates nested clusters Agglomerative clustering algorithms vary in terms of how the proximity of two clusters are compute
2、d MIN (single link): susceptible to noise/outliers MAX/GROUP AVERAGE: may not work well with non-globular clusters CURE algorithm tries to handle both problems Often starts with a proximity matrix A type of graph-based algorithm,Uses a number of points to represent a cluster Representative points ar
3、e found by selecting a constant number of points from a cluster and then “shrinking” them toward the center of the cluster Cluster similarity is the similarity of the closest pair of representative points from different clusters,CURE: Another Hierarchical Approach,CURE,Shrinking representative point
4、s toward the center helps avoid problems with noise and outliers CURE is better able to handle clusters of arbitrary shapes and sizes,Experimental Results: CURE,Picture from CURE, Guha, Rastogi, Shim.,Experimental Results: CURE,Picture from CURE, Guha, Rastogi, Shim.,(centroid),(single link),CURE Ca
5、nnot Handle Differing Densities,Original Points,CURE,Graph-Based Clustering,Graph-Based clustering uses the proximity graph Start with the proximity matrix Consider each point as a node in a graph Each edge between two nodes has a weight which is the proximity between the two points Initially the pr
6、oximity graph is fully connected MIN (single-link) and MAX (complete-link) can be viewed as starting with this graph In the simplest case, clusters are connected components in the graph.,Graph-Based Clustering: Sparsification,The amount of data that needs to be processed is drastically reduced Spars
7、ification can eliminate more than 99% of the entries in a proximity matrix The amount of time required to cluster the data is drastically reduced The size of the problems that can be handled is increased,Graph-Based Clustering: Sparsification ,Clustering may work better Sparsification techniques kee
8、p the connections to the most similar (nearest) neighbors of a point while breaking the connections to less similar points. The nearest neighbors of a point tend to belong to the same class as the point itself. This reduces the impact of noise and outliers and sharpens the distinction between cluste
9、rs. Sparsification facilitates the use of graph partitioning algorithms (or algorithms based on graph partitioning algorithms. Chameleon and Hypergraph-based Clustering,Sparsification in the Clustering Process,Limitations of Current Merging Schemes,Existing merging schemes in hierarchical clustering
10、 algorithms are static in nature MIN or CURE: merge two clusters based on their closeness (or minimum distance) GROUP-AVERAGE: merge two clusters based on their average connectivity,Limitations of Current Merging Schemes,Closeness schemes will merge (a) and (b),(a),(b),(c),(d),Average connectivity s
11、chemes will merge (c) and (d),Chameleon: Clustering Using Dynamic Modeling,Adapt to the characteristics of the data set to find the natural clusters Use a dynamic model to measure the similarity between clusters Main property is the relative closeness and relative inter-connectivity of the cluster T
12、wo clusters are combined if the resulting cluster shares certain properties with the constituent clusters The merging scheme preserves self-similarity One of the areas of application is spatial data,Characteristics of Spatial Data Sets,Clusters are defined as densely populated regions of the space C
13、lusters have arbitrary shapes, orientation, and non-uniform sizes Difference in densities across clusters and variation in density within clusters Existence of special artifacts (streaks) and noise,The clustering algorithm must address the above characteristics and also require minimal supervision.,
14、Chameleon: Steps,Preprocessing Step: Represent the Data by a Graph Given a set of points, construct the k-nearest-neighbor (k-NN) graph to capture the relationship between a point and its k nearest neighbors Concept of neighborhood is captured dynamically (even if region is sparse) Phase 1: Use a mu
15、ltilevel graph partitioning algorithm on the graph to find a large number of clusters of well-connected vertices Each cluster should contain mostly points from one “true” cluster, i.e., is a sub-cluster of a “real” cluster,Chameleon: Steps ,Phase 2: Use Hierarchical Agglomerative Clustering to merge
16、 sub-clusters Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters Two key properties used to model cluster similarity: Relative Interconnectivity: Absolute interconnectivity of two clusters normalized by the internal connectivity of the clusters
17、 Relative Closeness: Absolute closeness of two clusters normalized by the internal closeness of the clusters,Experimental Results: CHAMELEON,Experimental Results: CHAMELEON,Experimental Results: CURE (10 clusters),Experimental Results: CURE (15 clusters),Experimental Results: CHAMELEON,Experimental
18、Results: CURE (9 clusters),Experimental Results: CURE (15 clusters),SNN graph: the weight of an edge is the number of shared neighbors between vertices given that the vertices are connected,Shared Near Neighbor Approach,Creating the SNN Graph,Sparse Graph Link weights are similarities between neighb
19、oring points,Shared Near Neighbor Graph Link weights are number of Shared Nearest Neighbors,ROCK (RObust Clustering using linKs),Clustering algorithm for data with categorical and Boolean attributes A pair of points is defined to be neighbors if their similarity is greater than some threshold Use a
20、hierarchical clustering scheme to cluster the data. Obtain a sample of points from the data set Compute the link value for each set of points, i.e., transform the original similarities (computed by Jaccard coefficient) into similarities that reflect the number of shared neighbors between points Perf
21、orm an agglomerative hierarchical clustering on the data using the “number of shared neighbors” as similarity measure and maximizing “the shared neighbors” objective function Assign the remaining points to the clusters that have been found,Jarvis-Patrick Clustering,First, the k-nearest neighbors of
22、all points are found In graph terms this can be regarded as breaking all but the k strongest links from a point to other points in the proximity graph A pair of points is put in the same cluster if any two points share more than T neighbors and the two points are in each others k nearest neighbor li
23、st For instance, we might choose a nearest neighbor list of size 20 and put points in the same cluster if they share more than 10 near neighbors Jarvis-Patrick clustering is too brittle,When Jarvis-Patrick Works Reasonably Well,Original Points,Jarvis Patrick Clustering 6 shared neighbors out of 20,S
24、mallest threshold, T, that does not merge clusters.,Threshold of T - 1,When Jarvis-Patrick Does NOT Work Well,SNN Clustering Algorithm,Compute the similarity matrixThis corresponds to a similarity graph with data points for nodes and edges whose weights are the similarities between data points Spars
25、ify the similarity matrix by keeping only the k most similar neighborsThis corresponds to only keeping the k strongest links of the similarity graph Construct the shared nearest neighbor graph from the sparsified similarity matrix. At this point, we could apply a similarity threshold and find the co
26、nnected components to obtain the clusters (Jarvis-Patrick algorithm) Find the SNN density of each Point.Using a user specified parameters, Eps, find the number points that have an SNN similarity of Eps or greater to each point. This is the SNN density of the point,SNN Clustering Algorithm ,Find the core pointsUsing a user specified parameter, MinPts, find the core points, i.e., all points that have an SNN density greater than MinPts Form clusters from the core points If two core points are within a radius, Eps, of each other they are place in the sa
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 陕西省渭南市临渭区2025-2026学年初三(5月)第二次质量测试数学试题试卷含解析
- 宿迁市钟吾初级中学2026届初三下学期强化选填专练(二)数学试题含解析
- 山东省泰安市泰山区上高中学2025-2026学年初三元月调研考试数学试题含解析
- 四川省遂宁市市城区2025-2026学年初三下学期期末统测语文试题含解析
- 重庆十一中2026届初三第二次(5月)质量检测试题数学试题试卷含解析
- 2025 高中时评类阅读理解之文化消费现象课件
- 2026年行业标杆企业的装备节能实践
- 2026年生产线效率提升的案例分享
- 云计算导论 习题及答案 第2章习题
- 肺癌放疗后皮肤护理方案
- 江西省重点中学协作体2026届高三下学期第一次联考英语试卷(不含音频及听力原文答案不全)
- 太原铁路局集团招聘笔试题库2026
- 企业信息安全事件应急响应与处理手册
- 行业招聘面试问题清单专业能力测试版
- 广西机场管理集团秋招试题及答案
- 上交所2026校招笔试题
- 2026江西省港口集团有限公司第一批次社会招聘17人笔试备考试题及答案解析
- 车间内部转运车管理制度
- 2026年南阳农业职业学院单招职业技能考试题库及答案详解(各地真题)
- 2025年高中创新能力大赛笔试题资格审查试题(附答案)
- 内蒙古环投集团笔试试题
评论
0/150
提交评论