




已阅读5页,还剩46页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
Scalable Parallel Data Mining,Eui-Hong (Sam) HanDepartment of Computer Science and EngineeringArmy High Performance Computing Research CenterUniversity of MinnesotaResearch Supported by NSF, DOE, Army Research Office, AHPCRC/ARL/hanJoint work with George Karypis, Vipin Kumar, Anurag Srivastava, and Vineet Singh,What is Data Mining?,Many DefinitionsSearch for Valuable Information in Large Volumes of Data.Exploration & Analysis, by Automatic or Semi-Automatic Means, of Large Quantities of Data in order to Discover Meaningful Patterns & Rules.A Step in the KDD Process,Why Mine Data? Commercial ViewPoint.,Lots of data is being collected and warehoused.Computing has become affordable.Competitive Pressure is Strong Provide better, customized services for an edge.Information is becoming product in its own right.,Why Mine Data?Scientific Viewpoint.,Data collected and stored at enormous speeds (Gbyte/hour)remote sensor on a satellitetelescope scanning the skiesmicroarrays generating gene expression datascientific simulations generating terabytes of dataTraditional techniques are infeasible for raw dataData mining for data reduction.cataloging, classifying, segmenting dataHelps scientists in Hypothesis Formation,Data Mining Tasks.,Classification PredictiveClustering DescriptiveAssociation Rule Discovery DescriptiveSequential Pattern Discovery DescriptiveRegression PredictiveDeviation Detection Predictive,Classification Example,categorical,categorical,continuous,class,Training Set,Learn Classifier,Classification Application,Direct MarketingFraud DetectionCustomer Attrition/ChurnSky Survey Cataloging,Example Decision Tree,categorical,categorical,continuous,class,Refund,MarSt,TaxInc,YES,NO,NO,NO,Yes,No,Married,Single, Divorced, 80K,Splitting Attributes,The splitting attribute at a node is determined based on the Gini index.,Hunts Method,An Example:Attributes: Refund (Yes, No), Marital Status (Single, Married, Divorced), Taxable Income (Continuous)Class: Cheat, Dont Cheat,Binary Attributes: Computing GINI Index,Splits into two partitionsEffect of Weighing partitions: Larger and Purer Partitions are sought for.,Continuous Attributes: Computing Gini Index.,For efficient computation: for each attribute,Sort the attribute on valuesLinearly scan these values, each time updating the count matrix and computing gini indexChoose the split position that has the least gini index,Need for Parallel Formulations,Need to handle very large datasets.Memory limitations of sequential computers cause sequential algorithms to make multiple expensive I/O passes over data.Need for scalable, efficient (fast) data mining computationsgain competitive advantage.Handle larger data for greater accuracy in shorter times.,Constructing a Decision Tree in Parallel,Partitioning of data onlyglobal reduction per node is requiredlarge number of classification tree nodes gives high communication cost,n records,m categorical attributes,Synchronous Tree Construction Approach,No data movement is requiredLoad imbalancecan be eliminated by breadth-first expansionHigh communication costbecomes too high in lower parts of the tree,Partition Data Across Processors,Constructing a Decision Tree in Parallel,Partitioning of classification tree nodesnatural concurrencyload imbalance as the amount of work associated with each node varieschild nodes use the same data as used by parent nodeloss of localityhigh data movement cost,7,000 records,10,000 training records,3,000 records,Partitioned Tree Construction Approach,Highly concurrentHigh communication cost due to excessive data movementsLoad imbalance,Partition Data and Nodes,Hybrid Parallel Formulation,Load Balancing,Splitting Criterion,Switch to Partitioned Tree Construction whenSplitting criterion ensuresG. Karypis and V. Kumar, IEEE Transactions on Parallel and Distributed Systems, Oct. 1994,Experimental Results,Data set function 2 data set discussed in SLIQ paper (Mehta, Agrawal and Rissanen, EDBT96)2 class labels, 3 categorical and 6 continuous attributesIBM SP2 with 128 processors66.7 MHz CPU with 256 MB real memoryAIX version 4high performance switch,Speedup Comparison of the Three Parallel Algorithms,0.8 million examples,1.6 million examples,Splitting Criterion Verification in the Hybrid Algorithm,0.8 million examples on 8 processors,1.6 million examples on 16 processors,Speedup of the Hybrid Algorithm with Different Size Data Sets,Summary of Algorithms for Categorical Attributes,Synchronous Tree Construction Approachno data movement requiredhigh communication cost as tree becomes bushyPartitioned Tree Construction Approachprocessors work independently once partitioned completelyload imbalance and high cost of data movementHybrid Algorithmcombines good features of two approachesadapts dynamically according to the size and shape of trees,Handling Continuous Attributes,Sort continuous attributes at each node of the tree (as in C4.5). Expensive, hence Undesirable!Discretize continuous attributesCLOUDS (Alsabti, Ranka, and Singh, 1998)SPEC (Srivastava, Han, Kumar, and Singh, 1997)Use a pre-sorted list for each continuous attributesSPRINT (Shafer, Agrawal, and Mehta, VLDB96)ScalParC (Joshi, Karypis, and Kumar, IPPS98),Association Rule Discovery: Definition,Given a set of records each of which contain some number of items from a given collection;Produce dependency rules which will predict occurrence of an item based on occurrences of other items.,Rules Discovered: Milk - Coke Diaper, Milk - Beer,Association Rule Discovery Application,Marketing and Sales PromotionSupermarket Shelf ManagementInventory Management,Association Rule Discovery: Support and Confidence,Association Rule:,Support:,Confidence:,Example:,Handling Exponential Complexity,Given n transactions and m different items:number of possible association rules: computation complexity:Systematic search for all patterns, based on support constraint Agarwal & Srikant:If A,B has support at least a, then both A and B have support at least a.If either A or B has support less than a, then A,B has support less than a.Use patterns of size k-1 to find patterns of size k.,Illustrating Apriori Principle,Items (1-itemsets),Minimum Support = 3,If every subset is considered, 6C1 + 6C2 + 6C3 = 41With support-based pruning,6 + 6 + 2 = 14,Counting Candidates,Frequent Itemsets are found by counting candidates.Simple way: Search for each candidate in each transaction.Expensive!,Transactions,Candidates,N,M,Association Rule Discovery: Hash tree for fast access.,1,4,7,2,5,8,3,6,9,Hash Function,Candidate Hash Tree,Parallel Formulation of Association Rules,Need:Huge Transaction Datasets (10s of TB)Large Number of Candidates.Data Distribution:Partition the Transaction Database, orPartition the Candidates, orBoth,Parallel Association Rules: Count Distribution (CD),Each Processor has complete candidate hash tree.Each Processor updates its hash tree with local data.Each Processor participates in global reduction to get global counts of candidates in the hash tree.Multiple database scans are required if the hash tree is too big to fit in the memory.,CD: Illustration,P0,P1,P2,Global Reduction of Counts,N/p,N/p,N/p,Parallel Association Rules: Data Distribution (DD),Candidate set is partitioned among the processors.Once local data has been partitioned, it is broadcast to all other processors.High Communication Cost due to data movement.Redundant work due to multiple traversals of the hash trees.,DD: Illustration,All-to-All Broadcast of Candidates,9,1,3,1,2,10,3,4,2,3,12,10,P0,P1,P2,RemoteData,RemoteData,RemoteData,DataBroadcast,Parallel Association Rules: Intelligent Data Distribution (IDD),Data Distribution using point-to-point communication.Intelligent partitioning of candidate sets.Partitioning based on the first item of candidates.Bitmap to keep track of local candidate items.Pruning at the root of candidate hash tree using the bitmap.Suitable for single data source such as database server.With smaller candidate set, load balancing is difficult.,IDD: Illustration,All-to-All Broadcast of Candidates,9,1,3,1,2,10,3,4,2,3,12,10,P0,P1,P2,RemoteData,RemoteData,RemoteData,1,2,3,5,bitmask,DataShift,Count,Count,Count,Filtering Transactions in IDD,bitmask,Parallel Association Rules: Hybrid Distribution (HD),Candidate set is partitioned into G groups to just fit in main memoryEnsures Good load balance with smaller candidate set.Logical processor mesh G x P/G is formed.Perform IDD along the column processorsData movement among processors is minimized.Perform CD along the row processorsSmaller number of processors is global reduction operation.,HD: Illustration,G Groups of Processors,P/G Processors per Group,N/P,N/P,N/P,N/P,N/P,N/P,N/P,N/P,N/P,Parallel Association Rules: Experimental Setup,128-processor Cray T3E600 MHz DEC Alpha (EV4)512MB of main memory per processor3-D torus interconnection network with peak unidirectional bandwidth of 430 MB/sec.MPI used for communications.Synthetic data set: avg transaction size 15 and 1000 distinct items.For larger data sets, multiple read of transactions in blocks of 1000.HD switch to CD after 90.7% of the total computation is done.,Scaleup Results (50K, 0.1%),Speedup Results (N=
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025大蒜电商B2C平台供应链服务合同
- 2025年度绿色生态住宅项目认筹协议书
- 2025版石矿开采安全生产责任追究与赔偿承包合同
- 2025版淘宝店铺跨境贸易合作协议范本
- 2025版科技园区前期物业服务管理协议
- 2025版水渠施工合同履约保证金合同
- 2025版酒店员工培训与绩效管理合同范本下载
- 2025版涂料施工劳务分包合同(含施工安全)规范文本
- 2025年印刷企业委托加工印刷品采购合同
- 2025版肉类产品电商平台用户数据保护合同
- 2025内蒙古锡林郭勒盟公安局招聘警务辅助人员95人考试参考题库附答案解析
- 《一年级开学第一课》课件
- 2025 年小升初苏州市初一新生分班考试数学试卷(带答案解析)-(人教版)
- 2025年建筑工程管理与实务一级建造师考试冲刺押题卷
- 2025版建筑垃圾处理废弃物处理设施运营管理合同
- 会展推广的合同范本
- 2024年贵阳市南明区选聘社区工作者考试真题
- 武消院火灾调查B讲义01电气火灾调查
- 起搏器植入患者全程护理要点
- (2025年标准)会议代办协议书
- 2025年招录考试-工会招聘考试历年参考题库含答案解析(5套典型题)
评论
0/150
提交评论