两步聚类-BIRCH算法(1)_第1页
两步聚类-BIRCH算法(1)_第2页
两步聚类-BIRCH算法(1)_第3页
两步聚类-BIRCH算法(1)_第4页
两步聚类-BIRCH算法(1)_第5页
已阅读5页,还剩28页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、BIRCH: Balanced Iterative Reducing and Clustering using HierarchiesTian Zhang, Raghu Ramakrishnan, Miron LivnyPresented by Zhao Li2009, SpringMarch 6, 20222OutlineIntroduction to ClusteringMain Techniques in ClusteringHybrid Algorithm: BIRCHExample of the BIRCH AlgorithmExperimental resultsConclusio

2、nsClustering IntroductionData clustering concerns how to group a set of objects based on their similarity of attributes and/or their proximity in the vector space. Main methodsPartitioning : K-MeansHierarchical : BIRCH,ROCK,Density-based: DBSCAN,A good clustering method will produce high quality clu

3、sters with high intra-class similarity low inter-class similarityMarch 6, 20223Main Techniques (1) Partitioning Clustering (K-Means)step.1March 6, 20224initial centerinitial centerinitial centerK-Means ExampleStep.2March 6, 20225xxxnew center after 1st iterationnew center after 1st iterationnew cent

4、er after 1st iterationK-Means ExampleStep.3March 6, 20226new center after 2nd iterationnew center after 2nd iterationnew center after 2nd iterationMain Techniques (2) Hierarchical ClusteringMarch 6, 20227Multilevel clustering: level 1 has n clusters level n has one cluster, or upside down.Agglomerat

5、ive HC: starts with singleton and merge clusters (bottom-up).Divisive HC: starts with one sample and split clusters (top-down).DendrogramAgglomerative HC ExampleMarch 6, 20228Nearest Neighbor Level 2, k = 7 clusters.March 6, 20229Nearest Neighbor, Level 3, k = 6 clusters.March 6, 202210Nearest Neigh

6、bor, Level 4, k = 5 clusters.March 6, 202211Nearest Neighbor, Level 5, k = 4 clusters.March 6, 202212Nearest Neighbor, Level 6, k = 3 clusters.March 6, 202213Nearest Neighbor, Level 7, k = 2 clusters.March 6, 202214Nearest Neighbor, Level 8, k = 1 cluster.RemarksMarch 6, 202215Partitioning Clusterin

7、gHierarchical ClusteringTime ComplexityO(n) O(n2log n)ProsEasy to use and Relatively efficientOutputs a dendrogram that is desired in many applications.ConsSensitive to initialization; bad initialization might lead to bad results.Need to store all data in memory.higher time complexity;Need to store

8、all data in memory.March 6, 202216Introduction to BIRCHDesigned for very large data setsTime and memory are limitedIncremental and dynamic clustering of incoming objectsOnly one scan of data is necessaryDoes not need the whole data set in advanceTwo key phases:Scans the database to build an in-memor

9、y treeApplies clustering algorithm to cluster the leaf nodesMarch 6, 202217Similarity Metric(1)Given a cluster of instances , we define:Centroid:Radius: average distance from member points to centroidDiameter: average pair-wise distance within a clusterSimilarity Metric(2)centroid Euclidean distance

10、:centroid Manhattan distance:average inter-cluster:average intra-cluster:variance increase:March 6, 202218March 6, 202219Clustering FeatureThe Birch algorithm builds a dendrogram called clustering feature tree (CF tree) while scanning the data set.Each entry in the CF tree represents a cluster of ob

11、jects and is characterized by a 3-tuple: (N, LS, SS), where N is the number of objects in the cluster and LS, SS are defined in the following. NPiNPiiiPSSPLS2March 6, 202220Properties of Clustering FeatureCF entry is more compactStores significantly less than all of the data points in the sub-cluste

12、rA CF entry has sufficient information to calculate D0-D4Additivity theorem allows us to merge sub-clusters incrementally & consistentlyMarch 6, 202221CF-TreeEach non-leaf node has at most B entriesEach leaf node has at most L CF entries, each of which satisfies threshold TNode size is determined by

13、 dimensionality of data space and input parameter P (page size)March 6, 202222CF-Tree InsertionRecurse down from root, find the appropriate leafFollow the closest-CF path, w.r.t. D0 / / D4Modify the leafIf the closest-CF leaf cannot absorb, make a new CF entry. If there is no room for new leaf, spli

14、t the parent nodeTraverse back Updating CFs on the path or splitting nodesMarch 6, 202223CF-Tree RebuildingIf we run out of space, increase threshold TBy increasing the threshold, CFs absorb more dataRebuilding pushes CFs overThe larger T allows different CFs to group togetherReducibility theoremInc

15、reasing T will result in a CF-tree smaller than the originalRebuilding needs at most h extra pages of memoryExample of BIRCHMarch 6, 202224RootLN1LN2LN3LN1LN2 LN3sc1sc2sc3sc4sc5sc6sc7sc1sc2sc3sc4sc5sc6sc7sc8sc8New subclusterInsertion Operation in BIRCHMarch 6, 202225If the branching factor of a leaf

16、 node can not exceed 3, then LN1 is split.RootLN1”LN2LN3LN1LN2LN3sc1sc2sc3sc4sc5sc6sc7sc1sc2sc3sc4sc5sc6sc7sc8sc8LN1LN1”March 6, 202226If the branching factor of a non-leaf node can notexceed 3, then the root is split and the height ofthe CF Tree increases by one.RootLN1”LN2LN3LN1LN2LN3sc1sc2sc3sc4s

17、c5sc6sc7sc1sc2sc3sc4sc5sc6sc7sc8sc8LN1LN1”NLN1NLN2March 6, 202227BIRCH OverviewMarch 6, 202228Experimental ResultsInput parameters:Memory (M): 5% of data setDisk space (R): 20% of MDistance equation: D2Quality equation: weighted average diameter (D)Initial threshold (T): 0.0Page size (P): 1024 bytes

18、March 6, 202229Experimental ResultsKMEANS clusteringDSTimeD# ScanDSTimeD# Scan143.92.092891o33.81.97197213.24.43512o12.74.2029332.93.661873o36.04.35241DSTimeD# ScanDSTimeD# Scan111.51.8721o13.61.872210.71.9922o12.11.992311.43.9523o12.23.992BIRCH clusteringMarch 6, 202230ConclusionsA CF tree is a hei

19、ght-balanced tree that stores the clustering features for a hierarchical clustering.Given a limited amount of main memory, BIRCH can minimize the time required for I/O.BIRCH is a scalable clustering algorithm with respect to the number of objects, and good quality of clustering of the data.March 6, 202231Exam QuestionsWhat is the main limitation of BIRCH?Sinc

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论