两步聚类BIRCH算法_第1页
两步聚类BIRCH算法_第2页
两步聚类BIRCH算法_第3页
两步聚类BIRCH算法_第4页
两步聚类BIRCH算法_第5页
已阅读5页,还剩29页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、BIRCH: Balanced Iterative Reducing and Clustering using HierarchiesTian Zhang, Raghu Ramakrishnan, Miron LivnyPresented by Zhao Li2009, Spring2OutlineIntroduction to ClusteringMain Techniques in ClusteringHybrid Algorithm: BIRCHExample of the BIRCH AlgorithmExperimental resultsConclusions3Clustering

2、 IntroductionData clustering concerns how to group a set of objects based on their similarity of attributes and/or their proximity in the vector space. Main methodsPartitioning : K-MeansHierarchical : BIRCH,ROCK,Density-based: DBSCAN,A good clustering method will produce high quality clusters with h

3、igh intra-class similarity low inter-class similarity4Main Techniques (1) Partitioning Clustering (K-Means)step.1initial centerinitial centerinitial center5K-Means ExampleStep.2xxxnew center after 1st iterationnew center after 1st iterationnew center after 1st iteration6K-Means ExampleStep.3new cent

4、er after 2nd iterationnew center after 2nd iterationnew center after 2nd iteration7Main Techniques (2) Hierarchical ClusteringMultilevel clustering: level 1 has n clusters level n has one cluster, or upside down.Agglomerative HC: starts with singleton and merge clusters (bottom-up).Divisive HC: star

5、ts with one sample and split clusters (top-down).Dendrogram8Agglomerative HC ExampleNearest Neighbor Level 2, k = 7 clusters.9Nearest Neighbor, Level 3, k = 6 clusters.10Nearest Neighbor, Level 4, k = 5 clusters.11Nearest Neighbor, Level 5, k = 4 clusters.12Nearest Neighbor, Level 6, k = 3 clusters.

6、13Nearest Neighbor, Level 7, k = 2 clusters.14Nearest Neighbor, Level 8, k = 1 cluster.15RemarksPartitioning ClusteringHierarchical ClusteringTime ComplexityO(n) O(n2log n)ProsEasy to use and Relatively efficientOutputs a dendrogram that is desired in many applications.ConsSensitive to initializatio

7、n; bad initialization might lead to bad results.Need to store all data in memory.higher time complexity;Need to store all data in memory.16Introduction to BIRCHDesigned for very large data setsTime and memory are limitedIncremental and dynamic clustering of incoming objectsOnly one scan of data is n

8、ecessaryDoes not need the whole data set in advanceTwo key phases:Scans the database to build an in-memory treeApplies clustering algorithm to cluster the leaf nodes17Similarity Metric(1)Given a cluster of instances , we define:Centroid:Radius: average distance from member points to centroidDiameter

9、: average pair-wise distance within a cluster18Similarity Metric(2)centroid Euclidean distance:centroid Manhattan distance:average inter-cluster:average intra-cluster:variance increase:19Clustering FeatureThe Birch algorithm builds a dendrogram called clustering feature tree (CF tree) while scanning

10、 the data set.Each entry in the CF tree represents a cluster of objects and is characterized by a 3-tuple: (N, LS, SS), where N is the number of objects in the cluster and LS, SS are defined in the following. NPiNPiiiPSSPLS220Properties of Clustering FeatureCF entry is more compactStores significant

11、ly less than all of the data points in the sub-clusterA CF entry has sufficient information to calculate D0-D4Additivity theorem allows us to merge sub-clusters incrementally & consistently21CF-TreeEach non-leaf node has at most B entriesEach leaf node has at most L CF entries, each of which sat

12、isfies threshold TNode size is determined by dimensionality of data space and input parameter P (page size)22CF-Tree InsertionRecurse down from root, find the appropriate leafFollow the closest-CF path, w.r.t. D0 / / D4Modify the leafIf the closest-CF leaf cannot absorb, make a new CF entry. If ther

13、e is no room for new leaf, split the parent nodeTraverse back Updating CFs on the path or splitting nodes23CF-Tree RebuildingIf we run out of space, increase threshold TBy increasing the threshold, CFs absorb more dataRebuilding pushes CFs overThe larger T allows different CFs to group togetherReduc

14、ibility theoremIncreasing T will result in a CF-tree smaller than the originalRebuilding needs at most h extra pages of memory24Example of BIRCHRootLN1LN2LN3LN1LN2 LN3sc1sc2sc3sc4sc5sc6sc7sc1sc2sc3sc4sc5sc6sc7sc8sc8New subcluster25Insertion Operation in BIRCHIf the branching factor of a leaf node ca

15、n not exceed 3, then LN1 is split.RootLN1”LN2LN3LN1LN2LN3sc1sc2sc3sc4sc5sc6sc7sc1sc2sc3sc4sc5sc6sc7sc8sc8LN1LN1”26If the branching factor of a non-leaf node can notexceed 3, then the root is split and the height ofthe CF Tree increases by one.RootLN1”LN2LN3LN1LN2LN3sc1sc2sc3sc4sc5sc6sc7sc1sc2sc3sc4s

16、c5sc6sc7sc8sc8LN1LN1”NLN1NLN227BIRCH Overview28Experimental ResultsInput parameters:Memory (M): 5% of data setDisk space (R): 20% of MDistance equation: D2Quality equation: weighted average diameter (D)Initial threshold (T): 0.0Page size (P): 1024 bytes29Experimental ResultsKMEANS clusteringDSTimeD#

17、 ScanDSTimeD# Scan143.92.092891o33.81.97197213.24.43512o12.74.2029332.93.661873o36.04.35241DSTimeD# ScanDSTimeD# Scan111.51.8721o13.61.872210.71.9922o12.11.992311.43.9523o12.23.992BIRCH clustering30ConclusionsA CF tree is a height-balanced tree that stores the clustering features for a hierarchical

18、clustering.Given a limited amount of main memory, BIRCH can minimize the time required for I/O.BIRCH is a scalable clustering algorithm with respect to the number of objects, and good quality of clustering of the data.31Exam QuestionsWhat is the main limitation of BIRCH?Since each node in a CF tree can hold only a limited number

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论