版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、,Chapter 7,Cluster Analysis,Cluster analysis,Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering Characteristics of Data, Clusters, and Clustering Algorithms Summary,What is Cluster Analysis?,Cluster: A collection of data objects
2、 similar (or related) to one another within the same group dissimilar (or unrelated) to the objects in other groups Cluster analysis Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters Unsupervised learning: no predefin
3、ed classes Typical applications As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms,Clustering for Data Understanding and Applications,Biology: taxonomy of living things: kingdom, phylum, class, order, family, genus and species Information retriev
4、al: document clustering Land use: Identification of areas of similar land use in an earth observation database Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs City-planning: Identifying groups of houses ac
5、cording to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults Climate: understanding earth climate, find patterns of atmospheric and ocean Economic Science: market research,Clustering as a Preprocessing T
6、ool (Utility),Summarization: Preprocessing for regression, PCA, classification, and association analysis Compression: Image processing: vector quantization Finding K-nearest Neighbors Localizing search to one or a small number of clusters Outlier detection Outliers are often viewed as those “far awa
7、y” from any cluster,Quality: What Is Good Clustering?,A good clustering method will produce high quality clusters high intra-class similarity: cohesive within clusters low inter-class similarity: distinctive between clusters The quality of a clustering method depends on the similarity measure used b
8、y the method its implementation, and its ability to discover some or all of the hidden patterns,Measure the Quality of Clustering,Dissimilarity/Similarity metric Similarity is expressed in terms of a distance function, typically metric: d(i, j) The definitions of distance functions are usually rathe
9、r different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables Weights should be associated with different variables based on applications and data semantics Quality of clustering: There is usually a separate “quality” function that measures the “goodness” of a cluster. I
10、t is hard to define “similar enough” or “good enough” The answer is typically highly subjective,Considerations for Cluster Analysis,Partitioning criteria Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable) Separation of clusters Exclusive (e.g., one
11、 customer belongs to only one region) vs. non-exclusive (e.g., one document may belong to more than one class) Similarity measure Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g., density or contiguity) Clustering space Full space (often when low dimensional) vs. s
12、ubspaces (often in high-dimensional clustering),Requirements and Challenges,Scalability Clustering all the data instead of only on samples Ability to deal with different types of attributes Numerical, binary, categorical, ordinal, linked, and mixture of these Constraint-based clustering User may giv
13、e inputs on constraints Use domain knowledge to determine input parameters Interpretability and usability Others Discovery of clusters with arbitrary shape Ability to deal with noisy data Incremental clustering and insensitivity to input order High dimensionality,Major Clustering Approaches (I),Part
14、itioning approach: Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors Typical methods: k-means, k-medoids, CLARANS Hierarchical approach: Create a hierarchical decomposition of the set of data (or objects) using some criterion Typical met
15、hods: Diana, Agnes, BIRCH, CAMELEON Density-based approach: Based on connectivity and density functions Typical methods: DBSACN, OPTICS, DenClue Grid-based approach: based on a multiple-level granularity structure Typical methods: STING, WaveCluster, CLIQUE,Major Clustering Approaches (II),Model-bas
16、ed: A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other Typical methods: EM, SOM, COBWEB Frequent pattern-based: Based on the analysis of frequent patterns Typical methods: p-Cluster User-guided or constraint-based: Clustering by considering us
17、er-specified or application-specific constraints Typical methods: COD (obstacles), constrained clustering Link-based clustering: Objects are often linked together in various ways Massive links can be used to cluster objects: SimRank, LinkClus,Cluster analysis,Basic Concepts Partitioning Methods Hier
18、archical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering Characteristics of Data, Clusters, and Clustering Algorithms Summary,Partitioning Algorithms: Basic Concept,Partitioning method: Partitioning a database D of n objects into a set of k clusters, such that the sum of sq
19、uared distances is minimized (where ci is the centroid or medoid of cluster Ci) Given k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means (MacQueen67, Lloy
20、d57/82): Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids) (Kaufman D: a data set containing n objects. Output: A set of k clusters. Method: 1) arbitrarily choose k objects from D as the initial cluster centers; 2) repeat 3) (re)assign each object t
21、o the cluster to which the object is the most similar, based on the mean value of the objects in the cluster; 4) update the cluster means, that is, calculate the mean value of the objects for each cluster; 5) Until no change;,k-Means method,Comments on the K-Means Method,Strength: Efficient: O(tkn),
22、 where n is # objects, k is # clusters, and t is # iterations. Normally, k, t n. Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k) Comment: Often terminates at a local optimal Weakness Applicable only to objects in a continuous n-dimensional space Using the k-modes method for categorical data In co
23、mparison, k-medoids can be applied to a wide range of data Need to specify k, the number of clusters, in advance (there are ways to automatically determine the best k (see Hastie et al., 2009) Sensitive to noisy data and outliers Not suitable to discover clusters with non-convex shapes,Variations of
24、 the K-Means Method,Most of the variants of the k-means which differ in Selection of the initial k means Dissimilarity calculations Strategies to calculate cluster means Handling categorical data: k-modes Replacing means of clusters with modes Using new dissimilarity measures to deal with categorica
25、l objects Using a frequency-based method to update modes of clusters A mixture of categorical and numerical data: k-prototype method,What Is the Problem of the K-Means Method?,The k-Means algorithm is sensitive to outliers ! Since an object with an extremely large value may substantially distort the
26、 distribution of the data k-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster,each cluster is represented by one of the objects near the center of the cluster Step1: randomly select
27、k objects as the medoids of the clusters Step2: for each remaining object, assign it to the cluster whose mediod is the nearest to the object Step3: if the quality of the clustering can be improved by swapping a nonmedoid with a medoid, swapping them and then go to Step 2; otherwise exit,k-Medoids m
28、ethod,each cluster is represented by one of the objects near the center of the cluster Step1: randomly select k objects as the medoids of the clusters Step2: for each remaining object, assign it to the cluster whose mediod is the nearest to the object Step3: if the quality of the clustering can be i
29、mproved by swapping a nonmedoid with a medoid, swapping them and then go to Step 2; otherwise exit,k-Medoids method,each cluster is represented by one of the objects near the center of the cluster Step1: randomly select k objects as the medoids of the clusters Step2: for each remaining object, assig
30、n it to the cluster whose mediod is the nearest to the object Step3: if the quality of the clustering can be improved by swapping a nonmedoid with a medoid, swapping them and then go to Step 2; otherwise exit,k-Medoids method,each cluster is represented by one of the objects near the center of the c
31、luster Step1: randomly select k objects as the medoids of the clusters Step2: for each remaining object, assign it to the cluster whose mediod is the nearest to the object Step3: if the quality of the clustering can be improved by swapping a nonmedoid with a medoid, swapping them and then go to Step
32、 2; otherwise exit,k-Medoids method,each cluster is represented by one of the objects near the center of the cluster Step1: randomly select k objects as the medoids of the clusters Step2: for each remaining object, assign it to the cluster whose mediod is the nearest to the object Step3: if the qual
33、ity of the clustering can be improved by swapping a nonmedoid with a medoid, swapping them and then go to Step 2; otherwise exit,k-Medoids method,each cluster is represented by one of the objects near the center of the cluster Step1: randomly select k objects as the medoids of the clusters Step2: fo
34、r each remaining object, assign it to the cluster whose mediod is the nearest to the object Step3: if the quality of the clustering can be improved by swapping a nonmedoid with a medoid, swapping them and then go to Step 2; otherwise exit,k-Medoids method,each cluster is represented by one of the ob
35、jects near the center of the cluster Step1: randomly select k objects as the medoids of the clusters Step2: for each remaining object, assign it to the cluster whose mediod is the nearest to the object Step3: if the quality of the clustering can be improved by swapping a nonmedoid with a medoid, swa
36、pping them and then go to Step 2; otherwise exit,k-Medoids method,Algorithm: k-Medoids. PAM, a k-medoids algorithm for partitioning based on mediod or center objects. Input: k: the number of clusters, D: a data set containing n objects. Output: A set of k clusters. Method: 1) arbitrarily choose k ob
37、jects in D as the initial representative objects or seeds; 2) repeat 3) assign each remaining object to the cluster with the nearest representative object; 4) randomly select a nonrepresentative object, orandom; compute the total cost, S, of swapping representative object, oj, with orendom; if S0 th
38、en swap oj with orandom to form the new set of k representative objects; 7) until no change;,k-Medoids method,K-Medoid Method,K-Medoids Clustering: Find representative objects (medoids) in clusters PAM (Partitioning Around Medoids, Kaufmann DBSCAN is a density-based algorithm. DBSCAN can handle clus
39、ters of different sizes and shapes and is not strongly affected by noise or outliers. K-Means has difficulty with non-globular clusters and clusters of different sizes. Both algorithms can perform poorly when clusters have widely differing densities. K-Means can only be used for data that has well-d
40、efined centroid, such as a means or median. DBSCAN requires that its definition of density, which is based on the traditional Euclidean notion of density, be meaningful for the data.,Example: comparing K-Means and DBSCAN (II),K-Means can be applied to sparse, high-dimensional data, such as document
41、data. DBSCAN typically performs poorly for such data because the traditional Euclidean definition of density does not work well for high-dimensional data. The original versions of K-Means and DBSCAN were designed for Euclidean data, but both have extended to handle other types of data. DBSCAN makes
42、no assumption about the distribution of the data. The basic K-Means algorithm assumes all clusters come from spherical Gaussian distribution. DBSCAN and K-Means both look for cluster using all attributes, that is, they do not look for clusters that may involve only a subset of the attributes. K-Mean
43、s can find clusters that are not well separated, even if they overlap, but DBSCAN merges clusters that overlap.,Example: comparing K-Means and DBSCAN (III),K-Means has a time complexity of O(m), while DBSCAN takes O(m2) time. DBSCAN produces the same set of clusters from one run to another, while K-
44、Means does not because it is used with random initialization of centroids. DBSCAN automatically determines the number of clusters, the number of K-Means needs to be specified as a parameter. However, DBSCAN has two other parameters that must be specified. K-Means can be viewed as a optimization prob
45、lem, i.e., minimize the sum of the squared error of each point to its closest centroid, and as a specific case of a statistical clustering approach. DBSCAN is not based on any formal model.,Data Characteristics (I),High dimensionality. In high-dimensional data set, the traditional Euclidean notion o
46、f density becomes meaningless. Size. Many clustering algorithms that work well for small or medium-size data set are unable to handle larger data sets. Sparseness. Sparse data often consists of asymmetric attributes, where zero values are not as important as non-zero values. Therefore, similarity me
47、asures appropriate for asymmetric attributes are commonly used. However, other, related issues also arise. Noise and outliers. An atypical point (outlier) can often severely degrade the performance of clustering algorithm, especially algorithms such as K-Means that are partition-based.,Data Characte
48、ristics (II),Type of attributes and data set. When attributes are of differing types, proximity and density are more difficult to define and often more ad hoc. Special data structures and algorithms may be need. Scale. Different attributes may be measured on different scale. These differences can st
49、rongly affect the distance or similarity between two objects and, consequently, the results of a cluster analysis. Mathematical properties of the data space. Some clustering techniques calculate the mean of a collection of points or use other mathematical operation that only make sense in Euclidean
50、space or in other specific data spaces. Other algorithms require that the definition of density be meaningful for the data.,Cluster characteristics (I),Data distribution. Some clustering techniques assume a particular type of distribution for the data. Shape. Some clusters are regularly shaped, but
51、in general, cluster can be of arbitrary shape. Differing sizes. Many clustering methods, such as K-Means, dont work well when clusters have different sizes. Differing densities. Clusters that have widely varying density can cause problems for methods such as DBSCAN and K-Means. Poorly separated clus
52、ters. When clusters touch or overlap, some clustering techniques combine cluster that should be kept separate. Even techniques that find distinct cluster arbitrarily assign points to one cluster or another.,Cluster characteristics (II),Relationships among clusters. In most clustering techniques, the
53、re is no explicit consideration of the relationships between clusters, such as their relative position. Subspace clusters. Clusters may only exist in a subset of dimensions (attributes), and the cluster determined using one set of dimensions may be quite different from the clusters determined by usi
54、ng another set. It becomes more acute as dimensionality increases.,General characteristics of clustering algorithms (I),Order dependence. For some algorithms, the quality and number of clusters produced can vary, perhaps dramatically, depending on the order in which the data is processed. Nondetermi
55、nism. Clustering algorithms produce different results for each run since they rely on an initialization step that requires a random choice. Scalability. It is not unusual for a data set to contain million of objects, and the cluster algorithms used for such data sets should have linear or near-linea
56、r time and space complexity. Moreover, clustering techniques for data sets cannot always assume that all the data will fit in main memory or that data elements can be randomly accessed.,General characteristics of clustering algorithms (II),Parameter selection. Most clustering algorithms have one or
57、more parameters that need to be set by the user. It can be difficult to choice the proper values. Transforming the clustering problem to another domain. Map the clustering problem to a problem in a different domain. Treating clustering as an optimization problem. Clustering is often viewed as optimi
58、zation problem: divide the points clusters in a way that maximizes the goodness of the resulting set of clusters as measured by a user-specified objective function.,Cluster analysis,Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Cluste
59、ring Characteristics of Data, Clusters, and Clustering Algorithms Summary,Summary,Cluster analysis groups objects based on their similarity and has wide applications. Measure of similarity can be computed for various types of data. Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods. K-means and K-medoids algorithms are popular partitioning-based clustering algorithms. Birch and Chameleon are interesting hierarchical clustering algorithms, and there are other hierarch
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 护理实践中的文化支持
- 护理职业精神与励志教育的实践策略
- 护理学生竞赛辅导指南
- 内科护理试题及答案
- 客户生命周期价值管理与策略制定
- 旅游行业财务分析职位的面试要点
- 零售业库存管理岗位面试要点
- 成都高新未来科技城国际科教园文化设施项目水土保持方案报告表
- 旅游行业策划岗位面试经验
- 临床事务经理培训计划与内容
- 《轨道工程施工技术》课件 长钢轨铺设
- 2026年商洛职业技术学院单招职业倾向性考试题库必考题
- 触电事故应急处理培训试题及答案
- 意识形态工作培训课件
- 《自动控制理论》课件-第二章 控制系统的数学模型
- 肾球门血管病健康宣教
- 征兵理论考试试题及答案
- 中医四诊在护理中的应用
- 生物竞赛介绍课件
- TD/T 1031.6-2011土地复垦方案编制规程第6部分:建设项目
- 初中地理作业课件
评论
0/150
提交评论