




已阅读5页,还剩46页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
Scalable Parallel Data Mining,Eui-Hong (Sam) HanDepartment of Computer Science and EngineeringArmy High Performance Computing Research CenterUniversity of MinnesotaResearch Supported by NSF, DOE, Army Research Office, AHPCRC/ARL/hanJoint work with George Karypis, Vipin Kumar, Anurag Srivastava, and Vineet Singh,What is Data Mining?,Many DefinitionsSearch for Valuable Information in Large Volumes of Data.Exploration & Analysis, by Automatic or Semi-Automatic Means, of Large Quantities of Data in order to Discover Meaningful Patterns & Rules.A Step in the KDD Process,Why Mine Data? Commercial ViewPoint.,Lots of data is being collected and warehoused.Computing has become affordable.Competitive Pressure is Strong Provide better, customized services for an edge.Information is becoming product in its own right.,Why Mine Data?Scientific Viewpoint.,Data collected and stored at enormous speeds (Gbyte/hour)remote sensor on a satellitetelescope scanning the skiesmicroarrays generating gene expression datascientific simulations generating terabytes of dataTraditional techniques are infeasible for raw dataData mining for data reduction.cataloging, classifying, segmenting dataHelps scientists in Hypothesis Formation,Data Mining Tasks.,Classification PredictiveClustering DescriptiveAssociation Rule Discovery DescriptiveSequential Pattern Discovery DescriptiveRegression PredictiveDeviation Detection Predictive,Classification Example,categorical,categorical,continuous,class,Training Set,Learn Classifier,Classification Application,Direct MarketingFraud DetectionCustomer Attrition/ChurnSky Survey Cataloging,Example Decision Tree,categorical,categorical,continuous,class,Refund,MarSt,TaxInc,YES,NO,NO,NO,Yes,No,Married,Single, Divorced, 80K,Splitting Attributes,The splitting attribute at a node is determined based on the Gini index.,Hunts Method,An Example:Attributes: Refund (Yes, No), Marital Status (Single, Married, Divorced), Taxable Income (Continuous)Class: Cheat, Dont Cheat,Binary Attributes: Computing GINI Index,Splits into two partitionsEffect of Weighing partitions: Larger and Purer Partitions are sought for.,Continuous Attributes: Computing Gini Index.,For efficient computation: for each attribute,Sort the attribute on valuesLinearly scan these values, each time updating the count matrix and computing gini indexChoose the split position that has the least gini index,Need for Parallel Formulations,Need to handle very large datasets.Memory limitations of sequential computers cause sequential algorithms to make multiple expensive I/O passes over data.Need for scalable, efficient (fast) data mining computationsgain competitive advantage.Handle larger data for greater accuracy in shorter times.,Constructing a Decision Tree in Parallel,Partitioning of data onlyglobal reduction per node is requiredlarge number of classification tree nodes gives high communication cost,n records,m categorical attributes,Synchronous Tree Construction Approach,No data movement is requiredLoad imbalancecan be eliminated by breadth-first expansionHigh communication costbecomes too high in lower parts of the tree,Partition Data Across Processors,Constructing a Decision Tree in Parallel,Partitioning of classification tree nodesnatural concurrencyload imbalance as the amount of work associated with each node varieschild nodes use the same data as used by parent nodeloss of localityhigh data movement cost,7,000 records,10,000 training records,3,000 records,Partitioned Tree Construction Approach,Highly concurrentHigh communication cost due to excessive data movementsLoad imbalance,Partition Data and Nodes,Hybrid Parallel Formulation,Load Balancing,Splitting Criterion,Switch to Partitioned Tree Construction whenSplitting criterion ensuresG. Karypis and V. Kumar, IEEE Transactions on Parallel and Distributed Systems, Oct. 1994,Experimental Results,Data set function 2 data set discussed in SLIQ paper (Mehta, Agrawal and Rissanen, EDBT96)2 class labels, 3 categorical and 6 continuous attributesIBM SP2 with 128 processors66.7 MHz CPU with 256 MB real memoryAIX version 4high performance switch,Speedup Comparison of the Three Parallel Algorithms,0.8 million examples,1.6 million examples,Splitting Criterion Verification in the Hybrid Algorithm,0.8 million examples on 8 processors,1.6 million examples on 16 processors,Speedup of the Hybrid Algorithm with Different Size Data Sets,Summary of Algorithms for Categorical Attributes,Synchronous Tree Construction Approachno data movement requiredhigh communication cost as tree becomes bushyPartitioned Tree Construction Approachprocessors work independently once partitioned completelyload imbalance and high cost of data movementHybrid Algorithmcombines good features of two approachesadapts dynamically according to the size and shape of trees,Handling Continuous Attributes,Sort continuous attributes at each node of the tree (as in C4.5). Expensive, hence Undesirable!Discretize continuous attributesCLOUDS (Alsabti, Ranka, and Singh, 1998)SPEC (Srivastava, Han, Kumar, and Singh, 1997)Use a pre-sorted list for each continuous attributesSPRINT (Shafer, Agrawal, and Mehta, VLDB96)ScalParC (Joshi, Karypis, and Kumar, IPPS98),Association Rule Discovery: Definition,Given a set of records each of which contain some number of items from a given collection;Produce dependency rules which will predict occurrence of an item based on occurrences of other items.,Rules Discovered: Milk - Coke Diaper, Milk - Beer,Association Rule Discovery Application,Marketing and Sales PromotionSupermarket Shelf ManagementInventory Management,Association Rule Discovery: Support and Confidence,Association Rule:,Support:,Confidence:,Example:,Handling Exponential Complexity,Given n transactions and m different items:number of possible association rules: computation complexity:Systematic search for all patterns, based on support constraint Agarwal & Srikant:If A,B has support at least a, then both A and B have support at least a.If either A or B has support less than a, then A,B has support less than a.Use patterns of size k-1 to find patterns of size k.,Illustrating Apriori Principle,Items (1-itemsets),Minimum Support = 3,If every subset is considered, 6C1 + 6C2 + 6C3 = 41With support-based pruning,6 + 6 + 2 = 14,Counting Candidates,Frequent Itemsets are found by counting candidates.Simple way: Search for each candidate in each transaction.Expensive!,Transactions,Candidates,N,M,Association Rule Discovery: Hash tree for fast access.,1,4,7,2,5,8,3,6,9,Hash Function,Candidate Hash Tree,Parallel Formulation of Association Rules,Need:Huge Transaction Datasets (10s of TB)Large Number of Candidates.Data Distribution:Partition the Transaction Database, orPartition the Candidates, orBoth,Parallel Association Rules: Count Distribution (CD),Each Processor has complete candidate hash tree.Each Processor updates its hash tree with local data.Each Processor participates in global reduction to get global counts of candidates in the hash tree.Multiple database scans are required if the hash tree is too big to fit in the memory.,CD: Illustration,P0,P1,P2,Global Reduction of Counts,N/p,N/p,N/p,Parallel Association Rules: Data Distribution (DD),Candidate set is partitioned among the processors.Once local data has been partitioned, it is broadcast to all other processors.High Communication Cost due to data movement.Redundant work due to multiple traversals of the hash trees.,DD: Illustration,All-to-All Broadcast of Candidates,9,1,3,1,2,10,3,4,2,3,12,10,P0,P1,P2,RemoteData,RemoteData,RemoteData,DataBroadcast,Parallel Association Rules: Intelligent Data Distribution (IDD),Data Distribution using point-to-point communication.Intelligent partitioning of candidate sets.Partitioning based on the first item of candidates.Bitmap to keep track of local candidate items.Pruning at the root of candidate hash tree using the bitmap.Suitable for single data source such as database server.With smaller candidate set, load balancing is difficult.,IDD: Illustration,All-to-All Broadcast of Candidates,9,1,3,1,2,10,3,4,2,3,12,10,P0,P1,P2,RemoteData,RemoteData,RemoteData,1,2,3,5,bitmask,DataShift,Count,Count,Count,Filtering Transactions in IDD,bitmask,Parallel Association Rules: Hybrid Distribution (HD),Candidate set is partitioned into G groups to just fit in main memoryEnsures Good load balance with smaller candidate set.Logical processor mesh G x P/G is formed.Perform IDD along the column processorsData movement among processors is minimized.Perform CD along the row processorsSmaller number of processors is global reduction operation.,HD: Illustration,G Groups of Processors,P/G Processors per Group,N/P,N/P,N/P,N/P,N/P,N/P,N/P,N/P,N/P,Parallel Association Rules: Experimental Setup,128-processor Cray T3E600 MHz DEC Alpha (EV4)512MB of main memory per processor3-D torus interconnection network with peak unidirectional bandwidth of 430 MB/sec.MPI used for communications.Synthetic data set: avg transaction size 15 and 1000 distinct items.For larger data sets, multiple read of transactions in blocks of 1000.HD switch to CD after 90.7% of the total computation is done.,Scaleup Results (50K, 0.1%),Speedup Results (N=
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025年中国茉莉酮项目商业计划书
- 2025年重庆智能仪表项目可行性研究报告范文模板
- 中学生物学教学常用教学方法
- 中国肥料原料项目创业计划书
- 中国二苯甲醇项目经营分析报告
- 中国球形氧化铝导热粉体项目创业投资方案
- 2025年中国柔性树脂版项目创业计划书
- 2025年山东自行车配件项目投资分析报告模板参考
- 2025年昆山石油钻井工具项目实施方案
- 2025年工作情况汇报及总结
- 2023年吉林省金融控股集团股份有限公司招聘笔试题库及答案解析
- 类风湿关节炎的中医治疗演示文稿
- 食品安全BRCGS包装材料全球标准第六版管理手册及程序文件
- 热工保护联锁投退管理规定
- (中职)旅游概论第四章 旅游业课件
- 齐鲁医学可用于普通食品的新资源食品及药食两用原料名单
- GB∕T 12234-2019 石油、天然气工业用螺柱连接阀盖的钢制闸阀
- 礼堂舞台机械灯光、音响扩声系统工程设计方案
- 政务礼仪-PPT课件
- 三黄丸_千金卷二十一引巴郡太守方_方剂加减变化汇总
- 部编版语文五年级上册--第五单元习作单元教材材解读和教学目标课件
评论
0/150
提交评论