东北大学数据挖掘第7章_第1页
东北大学数据挖掘第7章_第2页
东北大学数据挖掘第7章_第3页
东北大学数据挖掘第7章_第4页
东北大学数据挖掘第7章_第5页
已阅读5页,还剩73页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、,Chapter 7,Cluster Analysis,Cluster analysis,Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering Characteristics of Data, Clusters, and Clustering Algorithms Summary,What is Cluster Analysis?,Cluster: A collection of data objects

2、 similar (or related) to one another within the same group dissimilar (or unrelated) to the objects in other groups Cluster analysis Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters Unsupervised learning: no predefin

3、ed classes Typical applications As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms,Clustering for Data Understanding and Applications,Biology: taxonomy of living things: kingdom, phylum, class, order, family, genus and species Information retriev

4、al: document clustering Land use: Identification of areas of similar land use in an earth observation database Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs City-planning: Identifying groups of houses ac

5、cording to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults Climate: understanding earth climate, find patterns of atmospheric and ocean Economic Science: market research,Clustering as a Preprocessing T

6、ool (Utility),Summarization: Preprocessing for regression, PCA, classification, and association analysis Compression: Image processing: vector quantization Finding K-nearest Neighbors Localizing search to one or a small number of clusters Outlier detection Outliers are often viewed as those “far awa

7、y” from any cluster,Quality: What Is Good Clustering?,A good clustering method will produce high quality clusters high intra-class similarity: cohesive within clusters low inter-class similarity: distinctive between clusters The quality of a clustering method depends on the similarity measure used b

8、y the method its implementation, and its ability to discover some or all of the hidden patterns,Measure the Quality of Clustering,Dissimilarity/Similarity metric Similarity is expressed in terms of a distance function, typically metric: d(i, j) The definitions of distance functions are usually rathe

9、r different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables Weights should be associated with different variables based on applications and data semantics Quality of clustering: There is usually a separate “quality” function that measures the “goodness” of a cluster. I

10、t is hard to define “similar enough” or “good enough” The answer is typically highly subjective,Considerations for Cluster Analysis,Partitioning criteria Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable) Separation of clusters Exclusive (e.g., one

11、 customer belongs to only one region) vs. non-exclusive (e.g., one document may belong to more than one class) Similarity measure Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g., density or contiguity) Clustering space Full space (often when low dimensional) vs. s

12、ubspaces (often in high-dimensional clustering),Requirements and Challenges,Scalability Clustering all the data instead of only on samples Ability to deal with different types of attributes Numerical, binary, categorical, ordinal, linked, and mixture of these Constraint-based clustering User may giv

13、e inputs on constraints Use domain knowledge to determine input parameters Interpretability and usability Others Discovery of clusters with arbitrary shape Ability to deal with noisy data Incremental clustering and insensitivity to input order High dimensionality,Major Clustering Approaches (I),Part

14、itioning approach: Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors Typical methods: k-means, k-medoids, CLARANS Hierarchical approach: Create a hierarchical decomposition of the set of data (or objects) using some criterion Typical met

15、hods: Diana, Agnes, BIRCH, CAMELEON Density-based approach: Based on connectivity and density functions Typical methods: DBSACN, OPTICS, DenClue Grid-based approach: based on a multiple-level granularity structure Typical methods: STING, WaveCluster, CLIQUE,Major Clustering Approaches (II),Model-bas

16、ed: A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other Typical methods: EM, SOM, COBWEB Frequent pattern-based: Based on the analysis of frequent patterns Typical methods: p-Cluster User-guided or constraint-based: Clustering by considering us

17、er-specified or application-specific constraints Typical methods: COD (obstacles), constrained clustering Link-based clustering: Objects are often linked together in various ways Massive links can be used to cluster objects: SimRank, LinkClus,Cluster analysis,Basic Concepts Partitioning Methods Hier

18、archical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering Characteristics of Data, Clusters, and Clustering Algorithms Summary,Partitioning Algorithms: Basic Concept,Partitioning method: Partitioning a database D of n objects into a set of k clusters, such that the sum of sq

19、uared distances is minimized (where ci is the centroid or medoid of cluster Ci) Given k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means (MacQueen67, Lloy

20、d57/82): Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids) (Kaufman D: a data set containing n objects. Output: A set of k clusters. Method: 1) arbitrarily choose k objects from D as the initial cluster centers; 2) repeat 3) (re)assign each object t

21、o the cluster to which the object is the most similar, based on the mean value of the objects in the cluster; 4) update the cluster means, that is, calculate the mean value of the objects for each cluster; 5) Until no change;,k-Means method,Comments on the K-Means Method,Strength: Efficient: O(tkn),

22、 where n is # objects, k is # clusters, and t is # iterations. Normally, k, t n. Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k) Comment: Often terminates at a local optimal Weakness Applicable only to objects in a continuous n-dimensional space Using the k-modes method for categorical data In co

23、mparison, k-medoids can be applied to a wide range of data Need to specify k, the number of clusters, in advance (there are ways to automatically determine the best k (see Hastie et al., 2009) Sensitive to noisy data and outliers Not suitable to discover clusters with non-convex shapes,Variations of

24、 the K-Means Method,Most of the variants of the k-means which differ in Selection of the initial k means Dissimilarity calculations Strategies to calculate cluster means Handling categorical data: k-modes Replacing means of clusters with modes Using new dissimilarity measures to deal with categorica

25、l objects Using a frequency-based method to update modes of clusters A mixture of categorical and numerical data: k-prototype method,What Is the Problem of the K-Means Method?,The k-Means algorithm is sensitive to outliers ! Since an object with an extremely large value may substantially distort the

26、 distribution of the data k-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster,each cluster is represented by one of the objects near the center of the cluster Step1: randomly select

27、k objects as the medoids of the clusters Step2: for each remaining object, assign it to the cluster whose mediod is the nearest to the object Step3: if the quality of the clustering can be improved by swapping a nonmedoid with a medoid, swapping them and then go to Step 2; otherwise exit,k-Medoids m

28、ethod,each cluster is represented by one of the objects near the center of the cluster Step1: randomly select k objects as the medoids of the clusters Step2: for each remaining object, assign it to the cluster whose mediod is the nearest to the object Step3: if the quality of the clustering can be i

29、mproved by swapping a nonmedoid with a medoid, swapping them and then go to Step 2; otherwise exit,k-Medoids method,each cluster is represented by one of the objects near the center of the cluster Step1: randomly select k objects as the medoids of the clusters Step2: for each remaining object, assig

30、n it to the cluster whose mediod is the nearest to the object Step3: if the quality of the clustering can be improved by swapping a nonmedoid with a medoid, swapping them and then go to Step 2; otherwise exit,k-Medoids method,each cluster is represented by one of the objects near the center of the c

31、luster Step1: randomly select k objects as the medoids of the clusters Step2: for each remaining object, assign it to the cluster whose mediod is the nearest to the object Step3: if the quality of the clustering can be improved by swapping a nonmedoid with a medoid, swapping them and then go to Step

32、 2; otherwise exit,k-Medoids method,each cluster is represented by one of the objects near the center of the cluster Step1: randomly select k objects as the medoids of the clusters Step2: for each remaining object, assign it to the cluster whose mediod is the nearest to the object Step3: if the qual

33、ity of the clustering can be improved by swapping a nonmedoid with a medoid, swapping them and then go to Step 2; otherwise exit,k-Medoids method,each cluster is represented by one of the objects near the center of the cluster Step1: randomly select k objects as the medoids of the clusters Step2: fo

34、r each remaining object, assign it to the cluster whose mediod is the nearest to the object Step3: if the quality of the clustering can be improved by swapping a nonmedoid with a medoid, swapping them and then go to Step 2; otherwise exit,k-Medoids method,each cluster is represented by one of the ob

35、jects near the center of the cluster Step1: randomly select k objects as the medoids of the clusters Step2: for each remaining object, assign it to the cluster whose mediod is the nearest to the object Step3: if the quality of the clustering can be improved by swapping a nonmedoid with a medoid, swa

36、pping them and then go to Step 2; otherwise exit,k-Medoids method,Algorithm: k-Medoids. PAM, a k-medoids algorithm for partitioning based on mediod or center objects. Input: k: the number of clusters, D: a data set containing n objects. Output: A set of k clusters. Method: 1) arbitrarily choose k ob

37、jects in D as the initial representative objects or seeds; 2) repeat 3) assign each remaining object to the cluster with the nearest representative object; 4) randomly select a nonrepresentative object, orandom; compute the total cost, S, of swapping representative object, oj, with orendom; if S0 th

38、en swap oj with orandom to form the new set of k representative objects; 7) until no change;,k-Medoids method,K-Medoid Method,K-Medoids Clustering: Find representative objects (medoids) in clusters PAM (Partitioning Around Medoids, Kaufmann DBSCAN is a density-based algorithm. DBSCAN can handle clus

39、ters of different sizes and shapes and is not strongly affected by noise or outliers. K-Means has difficulty with non-globular clusters and clusters of different sizes. Both algorithms can perform poorly when clusters have widely differing densities. K-Means can only be used for data that has well-d

40、efined centroid, such as a means or median. DBSCAN requires that its definition of density, which is based on the traditional Euclidean notion of density, be meaningful for the data.,Example: comparing K-Means and DBSCAN (II),K-Means can be applied to sparse, high-dimensional data, such as document

41、data. DBSCAN typically performs poorly for such data because the traditional Euclidean definition of density does not work well for high-dimensional data. The original versions of K-Means and DBSCAN were designed for Euclidean data, but both have extended to handle other types of data. DBSCAN makes

42、no assumption about the distribution of the data. The basic K-Means algorithm assumes all clusters come from spherical Gaussian distribution. DBSCAN and K-Means both look for cluster using all attributes, that is, they do not look for clusters that may involve only a subset of the attributes. K-Mean

43、s can find clusters that are not well separated, even if they overlap, but DBSCAN merges clusters that overlap.,Example: comparing K-Means and DBSCAN (III),K-Means has a time complexity of O(m), while DBSCAN takes O(m2) time. DBSCAN produces the same set of clusters from one run to another, while K-

44、Means does not because it is used with random initialization of centroids. DBSCAN automatically determines the number of clusters, the number of K-Means needs to be specified as a parameter. However, DBSCAN has two other parameters that must be specified. K-Means can be viewed as a optimization prob

45、lem, i.e., minimize the sum of the squared error of each point to its closest centroid, and as a specific case of a statistical clustering approach. DBSCAN is not based on any formal model.,Data Characteristics (I),High dimensionality. In high-dimensional data set, the traditional Euclidean notion o

46、f density becomes meaningless. Size. Many clustering algorithms that work well for small or medium-size data set are unable to handle larger data sets. Sparseness. Sparse data often consists of asymmetric attributes, where zero values are not as important as non-zero values. Therefore, similarity me

47、asures appropriate for asymmetric attributes are commonly used. However, other, related issues also arise. Noise and outliers. An atypical point (outlier) can often severely degrade the performance of clustering algorithm, especially algorithms such as K-Means that are partition-based.,Data Characte

48、ristics (II),Type of attributes and data set. When attributes are of differing types, proximity and density are more difficult to define and often more ad hoc. Special data structures and algorithms may be need. Scale. Different attributes may be measured on different scale. These differences can st

49、rongly affect the distance or similarity between two objects and, consequently, the results of a cluster analysis. Mathematical properties of the data space. Some clustering techniques calculate the mean of a collection of points or use other mathematical operation that only make sense in Euclidean

50、space or in other specific data spaces. Other algorithms require that the definition of density be meaningful for the data.,Cluster characteristics (I),Data distribution. Some clustering techniques assume a particular type of distribution for the data. Shape. Some clusters are regularly shaped, but

51、in general, cluster can be of arbitrary shape. Differing sizes. Many clustering methods, such as K-Means, dont work well when clusters have different sizes. Differing densities. Clusters that have widely varying density can cause problems for methods such as DBSCAN and K-Means. Poorly separated clus

52、ters. When clusters touch or overlap, some clustering techniques combine cluster that should be kept separate. Even techniques that find distinct cluster arbitrarily assign points to one cluster or another.,Cluster characteristics (II),Relationships among clusters. In most clustering techniques, the

53、re is no explicit consideration of the relationships between clusters, such as their relative position. Subspace clusters. Clusters may only exist in a subset of dimensions (attributes), and the cluster determined using one set of dimensions may be quite different from the clusters determined by usi

54、ng another set. It becomes more acute as dimensionality increases.,General characteristics of clustering algorithms (I),Order dependence. For some algorithms, the quality and number of clusters produced can vary, perhaps dramatically, depending on the order in which the data is processed. Nondetermi

55、nism. Clustering algorithms produce different results for each run since they rely on an initialization step that requires a random choice. Scalability. It is not unusual for a data set to contain million of objects, and the cluster algorithms used for such data sets should have linear or near-linea

56、r time and space complexity. Moreover, clustering techniques for data sets cannot always assume that all the data will fit in main memory or that data elements can be randomly accessed.,General characteristics of clustering algorithms (II),Parameter selection. Most clustering algorithms have one or

57、more parameters that need to be set by the user. It can be difficult to choice the proper values. Transforming the clustering problem to another domain. Map the clustering problem to a problem in a different domain. Treating clustering as an optimization problem. Clustering is often viewed as optimi

58、zation problem: divide the points clusters in a way that maximizes the goodness of the resulting set of clusters as measured by a user-specified objective function.,Cluster analysis,Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Cluste

59、ring Characteristics of Data, Clusters, and Clustering Algorithms Summary,Summary,Cluster analysis groups objects based on their similarity and has wide applications. Measure of similarity can be computed for various types of data. Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods. K-means and K-medoids algorithms are popular partitioning-based clustering algorithms. Birch and Chameleon are interesting hierarchical clustering algorithms, and there are other hierarch

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论