版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、BIRCH: Balanced Iterative Reducing and Clustering using HierarchiesTian Zhang, Raghu Ramakrishnan, Miron LivnyPresented by Zhao Li2009, Spring2OutlineIntroduction to ClusteringMain Techniques in ClusteringHybrid Algorithm: BIRCHExample of the BIRCH AlgorithmExperimental resultsConclusions3Clustering
2、 IntroductionData clustering concerns how to group a set of objects based on their similarity of attributes and/or their proximity in the vector space. Main methodsPartitioning : K-MeansHierarchical : BIRCH,ROCK,Density-based: DBSCAN,A good clustering method will produce high quality clusters with h
3、igh intra-class similarity low inter-class similarity4Main Techniques (1) Partitioning Clustering (K-Means)step.1initial centerinitial centerinitial center5K-Means ExampleStep.2xxxnew center after 1st iterationnew center after 1st iterationnew center after 1st iteration6K-Means ExampleStep.3new cent
4、er after 2nd iterationnew center after 2nd iterationnew center after 2nd iteration7Main Techniques (2) Hierarchical ClusteringMultilevel clustering: level 1 has n clusters level n has one cluster, or upside down.Agglomerative HC: starts with singleton and merge clusters (bottom-up).Divisive HC: star
5、ts with one sample and split clusters (top-down).Dendrogram8Agglomerative HC ExampleNearest Neighbor Level 2, k = 7 clusters.9Nearest Neighbor, Level 3, k = 6 clusters.10Nearest Neighbor, Level 4, k = 5 clusters.11Nearest Neighbor, Level 5, k = 4 clusters.12Nearest Neighbor, Level 6, k = 3 clusters.
6、13Nearest Neighbor, Level 7, k = 2 clusters.14Nearest Neighbor, Level 8, k = 1 cluster.15RemarksPartitioning ClusteringHierarchical ClusteringTime ComplexityO(n) O(n2log n)ProsEasy to use and Relatively efficientOutputs a dendrogram that is desired in many applications.ConsSensitive to initializatio
7、n; bad initialization might lead to bad results.Need to store all data in memory.higher time complexity;Need to store all data in memory.16Introduction to BIRCHDesigned for very large data setsTime and memory are limitedIncremental and dynamic clustering of incoming objectsOnly one scan of data is n
8、ecessaryDoes not need the whole data set in advanceTwo key phases:Scans the database to build an in-memory treeApplies clustering algorithm to cluster the leaf nodes17Similarity Metric(1)Given a cluster of instances , we define:Centroid:Radius: average distance from member points to centroidDiameter
9、: average pair-wise distance within a cluster18Similarity Metric(2)centroid Euclidean distance:centroid Manhattan distance:average inter-cluster:average intra-cluster:variance increase:19Clustering FeatureThe Birch algorithm builds a dendrogram called clustering feature tree (CF tree) while scanning
10、 the data set.Each entry in the CF tree represents a cluster of objects and is characterized by a 3-tuple: (N, LS, SS), where N is the number of objects in the cluster and LS, SS are defined in the following. NPiNPiiiPSSPLS220Properties of Clustering FeatureCF entry is more compactStores significant
11、ly less than all of the data points in the sub-clusterA CF entry has sufficient information to calculate D0-D4Additivity theorem allows us to merge sub-clusters incrementally & consistently21CF-TreeEach non-leaf node has at most B entriesEach leaf node has at most L CF entries, each of which sat
12、isfies threshold TNode size is determined by dimensionality of data space and input parameter P (page size)22CF-Tree InsertionRecurse down from root, find the appropriate leafFollow the closest-CF path, w.r.t. D0 / / D4Modify the leafIf the closest-CF leaf cannot absorb, make a new CF entry. If ther
13、e is no room for new leaf, split the parent nodeTraverse back Updating CFs on the path or splitting nodes23CF-Tree RebuildingIf we run out of space, increase threshold TBy increasing the threshold, CFs absorb more dataRebuilding pushes CFs overThe larger T allows different CFs to group togetherReduc
14、ibility theoremIncreasing T will result in a CF-tree smaller than the originalRebuilding needs at most h extra pages of memory24Example of BIRCHRootLN1LN2LN3LN1LN2 LN3sc1sc2sc3sc4sc5sc6sc7sc1sc2sc3sc4sc5sc6sc7sc8sc8New subcluster25Insertion Operation in BIRCHIf the branching factor of a leaf node ca
15、n not exceed 3, then LN1 is split.RootLN1”LN2LN3LN1LN2LN3sc1sc2sc3sc4sc5sc6sc7sc1sc2sc3sc4sc5sc6sc7sc8sc8LN1LN1”26If the branching factor of a non-leaf node can notexceed 3, then the root is split and the height ofthe CF Tree increases by one.RootLN1”LN2LN3LN1LN2LN3sc1sc2sc3sc4sc5sc6sc7sc1sc2sc3sc4s
16、c5sc6sc7sc8sc8LN1LN1”NLN1NLN227BIRCH Overview28Experimental ResultsInput parameters:Memory (M): 5% of data setDisk space (R): 20% of MDistance equation: D2Quality equation: weighted average diameter (D)Initial threshold (T): 0.0Page size (P): 1024 bytes29Experimental ResultsKMEANS clusteringDSTimeD#
17、 ScanDSTimeD# Scan143.92.092891o33.81.97197213.24.43512o12.74.2029332.93.661873o36.04.35241DSTimeD# ScanDSTimeD# Scan111.51.8721o13.61.872210.71.9922o12.11.992311.43.9523o12.23.992BIRCH clustering30ConclusionsA CF tree is a height-balanced tree that stores the clustering features for a hierarchical
18、clustering.Given a limited amount of main memory, BIRCH can minimize the time required for I/O.BIRCH is a scalable clustering algorithm with respect to the number of objects, and good quality of clustering of the data.31Exam QuestionsWhat is the main limitation of BIRCH?Since each node in a CF tree can hold only a limited number
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2026电网机械测控业务面试题及答案
- 工业机器人维护服务合同2026年制造业
- Unit 8 Making a Difference Section A 3a-3d 课件 2025-2026学年人教版英语八年级下册
- 鞭炮燃放供水供电抢修配合手册
- 教师教学质量监控规范实施手册
- 教师招聘(中学)考试附参考答案7
- 法律服务中心农民工维权服务工作手册(标准版)
- 游乐园游客摔伤骨折应急处理手册
- 银行贷款逾期风险防控手册
- 工厂生产计量器具管理手册
- 甘肃兰州新区贺阳高级中学等校2026届高三下学期考前模拟化学试卷(含答案)
- (2026版)单片机原理及应用期末考试题试卷及答案
- 2026广东东莞市公安局茶山分局警务辅助人员招聘18人(第2批)笔试参考试题及答案解析
- 杨树人工林带下艾草根茎栽培技术规程
- 2026新能源汽车产业链全景分析及发展前景预测报告
- 文物数字化保护技术规范编制说明
- 财产返还协议书合同
- 机加工车间关键尺寸稳定性分析规范
- (2025)昆士兰临床指南:引产术(V10)解读
- 慢阻肺患者呼吸肌训练器械使用
- 宠物食品制作技师试卷及答案
评论
0/150
提交评论