已阅读5页,还剩46页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
Scalable Parallel Data Mining,Eui-Hong (Sam) HanDepartment of Computer Science and EngineeringArmy High Performance Computing Research CenterUniversity of MinnesotaResearch Supported by NSF, DOE, Army Research Office, AHPCRC/ARL/hanJoint work with George Karypis, Vipin Kumar, Anurag Srivastava, and Vineet Singh,What is Data Mining?,Many DefinitionsSearch for Valuable Information in Large Volumes of Data.Exploration & Analysis, by Automatic or Semi-Automatic Means, of Large Quantities of Data in order to Discover Meaningful Patterns & Rules.A Step in the KDD Process,Why Mine Data? Commercial ViewPoint.,Lots of data is being collected and warehoused.Computing has become affordable.Competitive Pressure is Strong Provide better, customized services for an edge.Information is becoming product in its own right.,Why Mine Data?Scientific Viewpoint.,Data collected and stored at enormous speeds (Gbyte/hour)remote sensor on a satellitetelescope scanning the skiesmicroarrays generating gene expression datascientific simulations generating terabytes of dataTraditional techniques are infeasible for raw dataData mining for data reduction.cataloging, classifying, segmenting dataHelps scientists in Hypothesis Formation,Data Mining Tasks.,Classification PredictiveClustering DescriptiveAssociation Rule Discovery DescriptiveSequential Pattern Discovery DescriptiveRegression PredictiveDeviation Detection Predictive,Classification Example,categorical,categorical,continuous,class,Training Set,Learn Classifier,Classification Application,Direct MarketingFraud DetectionCustomer Attrition/ChurnSky Survey Cataloging,Example Decision Tree,categorical,categorical,continuous,class,Refund,MarSt,TaxInc,YES,NO,NO,NO,Yes,No,Married,Single, Divorced, 80K,Splitting Attributes,The splitting attribute at a node is determined based on the Gini index.,Hunts Method,An Example:Attributes: Refund (Yes, No), Marital Status (Single, Married, Divorced), Taxable Income (Continuous)Class: Cheat, Dont Cheat,Binary Attributes: Computing GINI Index,Splits into two partitionsEffect of Weighing partitions: Larger and Purer Partitions are sought for.,Continuous Attributes: Computing Gini Index.,For efficient computation: for each attribute,Sort the attribute on valuesLinearly scan these values, each time updating the count matrix and computing gini indexChoose the split position that has the least gini index,Need for Parallel Formulations,Need to handle very large datasets.Memory limitations of sequential computers cause sequential algorithms to make multiple expensive I/O passes over data.Need for scalable, efficient (fast) data mining computationsgain competitive advantage.Handle larger data for greater accuracy in shorter times.,Constructing a Decision Tree in Parallel,Partitioning of data onlyglobal reduction per node is requiredlarge number of classification tree nodes gives high communication cost,n records,m categorical attributes,Synchronous Tree Construction Approach,No data movement is requiredLoad imbalancecan be eliminated by breadth-first expansionHigh communication costbecomes too high in lower parts of the tree,Partition Data Across Processors,Constructing a Decision Tree in Parallel,Partitioning of classification tree nodesnatural concurrencyload imbalance as the amount of work associated with each node varieschild nodes use the same data as used by parent nodeloss of localityhigh data movement cost,7,000 records,10,000 training records,3,000 records,Partitioned Tree Construction Approach,Highly concurrentHigh communication cost due to excessive data movementsLoad imbalance,Partition Data and Nodes,Hybrid Parallel Formulation,Load Balancing,Splitting Criterion,Switch to Partitioned Tree Construction whenSplitting criterion ensuresG. Karypis and V. Kumar, IEEE Transactions on Parallel and Distributed Systems, Oct. 1994,Experimental Results,Data set function 2 data set discussed in SLIQ paper (Mehta, Agrawal and Rissanen, EDBT96)2 class labels, 3 categorical and 6 continuous attributesIBM SP2 with 128 processors66.7 MHz CPU with 256 MB real memoryAIX version 4high performance switch,Speedup Comparison of the Three Parallel Algorithms,0.8 million examples,1.6 million examples,Splitting Criterion Verification in the Hybrid Algorithm,0.8 million examples on 8 processors,1.6 million examples on 16 processors,Speedup of the Hybrid Algorithm with Different Size Data Sets,Summary of Algorithms for Categorical Attributes,Synchronous Tree Construction Approachno data movement requiredhigh communication cost as tree becomes bushyPartitioned Tree Construction Approachprocessors work independently once partitioned completelyload imbalance and high cost of data movementHybrid Algorithmcombines good features of two approachesadapts dynamically according to the size and shape of trees,Handling Continuous Attributes,Sort continuous attributes at each node of the tree (as in C4.5). Expensive, hence Undesirable!Discretize continuous attributesCLOUDS (Alsabti, Ranka, and Singh, 1998)SPEC (Srivastava, Han, Kumar, and Singh, 1997)Use a pre-sorted list for each continuous attributesSPRINT (Shafer, Agrawal, and Mehta, VLDB96)ScalParC (Joshi, Karypis, and Kumar, IPPS98),Association Rule Discovery: Definition,Given a set of records each of which contain some number of items from a given collection;Produce dependency rules which will predict occurrence of an item based on occurrences of other items.,Rules Discovered: Milk - Coke Diaper, Milk - Beer,Association Rule Discovery Application,Marketing and Sales PromotionSupermarket Shelf ManagementInventory Management,Association Rule Discovery: Support and Confidence,Association Rule:,Support:,Confidence:,Example:,Handling Exponential Complexity,Given n transactions and m different items:number of possible association rules: computation complexity:Systematic search for all patterns, based on support constraint Agarwal & Srikant:If A,B has support at least a, then both A and B have support at least a.If either A or B has support less than a, then A,B has support less than a.Use patterns of size k-1 to find patterns of size k.,Illustrating Apriori Principle,Items (1-itemsets),Minimum Support = 3,If every subset is considered, 6C1 + 6C2 + 6C3 = 41With support-based pruning,6 + 6 + 2 = 14,Counting Candidates,Frequent Itemsets are found by counting candidates.Simple way: Search for each candidate in each transaction.Expensive!,Transactions,Candidates,N,M,Association Rule Discovery: Hash tree for fast access.,1,4,7,2,5,8,3,6,9,Hash Function,Candidate Hash Tree,Parallel Formulation of Association Rules,Need:Huge Transaction Datasets (10s of TB)Large Number of Candidates.Data Distribution:Partition the Transaction Database, orPartition the Candidates, orBoth,Parallel Association Rules: Count Distribution (CD),Each Processor has complete candidate hash tree.Each Processor updates its hash tree with local data.Each Processor participates in global reduction to get global counts of candidates in the hash tree.Multiple database scans are required if the hash tree is too big to fit in the memory.,CD: Illustration,P0,P1,P2,Global Reduction of Counts,N/p,N/p,N/p,Parallel Association Rules: Data Distribution (DD),Candidate set is partitioned among the processors.Once local data has been partitioned, it is broadcast to all other processors.High Communication Cost due to data movement.Redundant work due to multiple traversals of the hash trees.,DD: Illustration,All-to-All Broadcast of Candidates,9,1,3,1,2,10,3,4,2,3,12,10,P0,P1,P2,RemoteData,RemoteData,RemoteData,DataBroadcast,Parallel Association Rules: Intelligent Data Distribution (IDD),Data Distribution using point-to-point communication.Intelligent partitioning of candidate sets.Partitioning based on the first item of candidates.Bitmap to keep track of local candidate items.Pruning at the root of candidate hash tree using the bitmap.Suitable for single data source such as database server.With smaller candidate set, load balancing is difficult.,IDD: Illustration,All-to-All Broadcast of Candidates,9,1,3,1,2,10,3,4,2,3,12,10,P0,P1,P2,RemoteData,RemoteData,RemoteData,1,2,3,5,bitmask,DataShift,Count,Count,Count,Filtering Transactions in IDD,bitmask,Parallel Association Rules: Hybrid Distribution (HD),Candidate set is partitioned into G groups to just fit in main memoryEnsures Good load balance with smaller candidate set.Logical processor mesh G x P/G is formed.Perform IDD along the column processorsData movement among processors is minimized.Perform CD along the row processorsSmaller number of processors is global reduction operation.,HD: Illustration,G Groups of Processors,P/G Processors per Group,N/P,N/P,N/P,N/P,N/P,N/P,N/P,N/P,N/P,Parallel Association Rules: Experimental Setup,128-processor Cray T3E600 MHz DEC Alpha (EV4)512MB of main memory per processor3-D torus interconnection network with peak unidirectional bandwidth of 430 MB/sec.MPI used for communications.Synthetic data set: avg transaction size 15 and 1000 distinct items.For larger data sets, multiple read of transactions in blocks of 1000.HD switch to CD after 90.7% of the total computation is done.,Scaleup Results (50K, 0.1%),Speedup Results (N=
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 小尾寒羊代养协议书
- 工程物资租赁协议书
- 扶贫对口援建协议书
- 扶贫资产使用协议书
- 批量茶叶订购协议书
- 承包出租合同协议书
- 承包方意见合同协议
- 承包矿山工程协议书
- 承包设备大线协议书
- 承包食堂饭店协议书
- 药物外渗的应急预案及处理
- 改性聚苯醚行业发展预测分析
- 大学课件-机电传动控制(完整)
- 中国各民族建筑风格英文介绍
- 六年级上册科学全册知识点(新改版苏教版)
- 大力弘扬新时代斗争精神PPT怎样弘扬新时代斗争精神PPT课件(带内容)
- 超市店长工作计划总结 超市店长年度工作计划
- 2023学年完整公开课版闽菜1
- 设备采购技术服务方案
- 安全监督先进个人主要事迹范文七篇
- GB/T 38661-2020电动汽车用电池管理系统技术条件
评论
0/150
提交评论