




已阅读5页,还剩46页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
Scalable Parallel Data Mining,Eui-Hong (Sam) HanDepartment of Computer Science and EngineeringArmy High Performance Computing Research CenterUniversity of MinnesotaResearch Supported by NSF, DOE, Army Research Office, AHPCRC/ARL/hanJoint work with George Karypis, Vipin Kumar, Anurag Srivastava, and Vineet Singh,What is Data Mining?,Many DefinitionsSearch for Valuable Information in Large Volumes of Data.Exploration & Analysis, by Automatic or Semi-Automatic Means, of Large Quantities of Data in order to Discover Meaningful Patterns & Rules.A Step in the KDD Process,Why Mine Data? Commercial ViewPoint.,Lots of data is being collected and warehoused.Computing has become affordable.Competitive Pressure is Strong Provide better, customized services for an edge.Information is becoming product in its own right.,Why Mine Data?Scientific Viewpoint.,Data collected and stored at enormous speeds (Gbyte/hour)remote sensor on a satellitetelescope scanning the skiesmicroarrays generating gene expression datascientific simulations generating terabytes of dataTraditional techniques are infeasible for raw dataData mining for data reduction.cataloging, classifying, segmenting dataHelps scientists in Hypothesis Formation,Data Mining Tasks.,Classification PredictiveClustering DescriptiveAssociation Rule Discovery DescriptiveSequential Pattern Discovery DescriptiveRegression PredictiveDeviation Detection Predictive,Classification Example,categorical,categorical,continuous,class,Training Set,Learn Classifier,Classification Application,Direct MarketingFraud DetectionCustomer Attrition/ChurnSky Survey Cataloging,Example Decision Tree,categorical,categorical,continuous,class,Refund,MarSt,TaxInc,YES,NO,NO,NO,Yes,No,Married,Single, Divorced, 80K,Splitting Attributes,The splitting attribute at a node is determined based on the Gini index.,Hunts Method,An Example:Attributes: Refund (Yes, No), Marital Status (Single, Married, Divorced), Taxable Income (Continuous)Class: Cheat, Dont Cheat,Binary Attributes: Computing GINI Index,Splits into two partitionsEffect of Weighing partitions: Larger and Purer Partitions are sought for.,Continuous Attributes: Computing Gini Index.,For efficient computation: for each attribute,Sort the attribute on valuesLinearly scan these values, each time updating the count matrix and computing gini indexChoose the split position that has the least gini index,Need for Parallel Formulations,Need to handle very large datasets.Memory limitations of sequential computers cause sequential algorithms to make multiple expensive I/O passes over data.Need for scalable, efficient (fast) data mining computationsgain competitive advantage.Handle larger data for greater accuracy in shorter times.,Constructing a Decision Tree in Parallel,Partitioning of data onlyglobal reduction per node is requiredlarge number of classification tree nodes gives high communication cost,n records,m categorical attributes,Synchronous Tree Construction Approach,No data movement is requiredLoad imbalancecan be eliminated by breadth-first expansionHigh communication costbecomes too high in lower parts of the tree,Partition Data Across Processors,Constructing a Decision Tree in Parallel,Partitioning of classification tree nodesnatural concurrencyload imbalance as the amount of work associated with each node varieschild nodes use the same data as used by parent nodeloss of localityhigh data movement cost,7,000 records,10,000 training records,3,000 records,Partitioned Tree Construction Approach,Highly concurrentHigh communication cost due to excessive data movementsLoad imbalance,Partition Data and Nodes,Hybrid Parallel Formulation,Load Balancing,Splitting Criterion,Switch to Partitioned Tree Construction whenSplitting criterion ensuresG. Karypis and V. Kumar, IEEE Transactions on Parallel and Distributed Systems, Oct. 1994,Experimental Results,Data set function 2 data set discussed in SLIQ paper (Mehta, Agrawal and Rissanen, EDBT96)2 class labels, 3 categorical and 6 continuous attributesIBM SP2 with 128 processors66.7 MHz CPU with 256 MB real memoryAIX version 4high performance switch,Speedup Comparison of the Three Parallel Algorithms,0.8 million examples,1.6 million examples,Splitting Criterion Verification in the Hybrid Algorithm,0.8 million examples on 8 processors,1.6 million examples on 16 processors,Speedup of the Hybrid Algorithm with Different Size Data Sets,Summary of Algorithms for Categorical Attributes,Synchronous Tree Construction Approachno data movement requiredhigh communication cost as tree becomes bushyPartitioned Tree Construction Approachprocessors work independently once partitioned completelyload imbalance and high cost of data movementHybrid Algorithmcombines good features of two approachesadapts dynamically according to the size and shape of trees,Handling Continuous Attributes,Sort continuous attributes at each node of the tree (as in C4.5). Expensive, hence Undesirable!Discretize continuous attributesCLOUDS (Alsabti, Ranka, and Singh, 1998)SPEC (Srivastava, Han, Kumar, and Singh, 1997)Use a pre-sorted list for each continuous attributesSPRINT (Shafer, Agrawal, and Mehta, VLDB96)ScalParC (Joshi, Karypis, and Kumar, IPPS98),Association Rule Discovery: Definition,Given a set of records each of which contain some number of items from a given collection;Produce dependency rules which will predict occurrence of an item based on occurrences of other items.,Rules Discovered: Milk - Coke Diaper, Milk - Beer,Association Rule Discovery Application,Marketing and Sales PromotionSupermarket Shelf ManagementInventory Management,Association Rule Discovery: Support and Confidence,Association Rule:,Support:,Confidence:,Example:,Handling Exponential Complexity,Given n transactions and m different items:number of possible association rules: computation complexity:Systematic search for all patterns, based on support constraint Agarwal & Srikant:If A,B has support at least a, then both A and B have support at least a.If either A or B has support less than a, then A,B has support less than a.Use patterns of size k-1 to find patterns of size k.,Illustrating Apriori Principle,Items (1-itemsets),Minimum Support = 3,If every subset is considered, 6C1 + 6C2 + 6C3 = 41With support-based pruning,6 + 6 + 2 = 14,Counting Candidates,Frequent Itemsets are found by counting candidates.Simple way: Search for each candidate in each transaction.Expensive!,Transactions,Candidates,N,M,Association Rule Discovery: Hash tree for fast access.,1,4,7,2,5,8,3,6,9,Hash Function,Candidate Hash Tree,Parallel Formulation of Association Rules,Need:Huge Transaction Datasets (10s of TB)Large Number of Candidates.Data Distribution:Partition the Transaction Database, orPartition the Candidates, orBoth,Parallel Association Rules: Count Distribution (CD),Each Processor has complete candidate hash tree.Each Processor updates its hash tree with local data.Each Processor participates in global reduction to get global counts of candidates in the hash tree.Multiple database scans are required if the hash tree is too big to fit in the memory.,CD: Illustration,P0,P1,P2,Global Reduction of Counts,N/p,N/p,N/p,Parallel Association Rules: Data Distribution (DD),Candidate set is partitioned among the processors.Once local data has been partitioned, it is broadcast to all other processors.High Communication Cost due to data movement.Redundant work due to multiple traversals of the hash trees.,DD: Illustration,All-to-All Broadcast of Candidates,9,1,3,1,2,10,3,4,2,3,12,10,P0,P1,P2,RemoteData,RemoteData,RemoteData,DataBroadcast,Parallel Association Rules: Intelligent Data Distribution (IDD),Data Distribution using point-to-point communication.Intelligent partitioning of candidate sets.Partitioning based on the first item of candidates.Bitmap to keep track of local candidate items.Pruning at the root of candidate hash tree using the bitmap.Suitable for single data source such as database server.With smaller candidate set, load balancing is difficult.,IDD: Illustration,All-to-All Broadcast of Candidates,9,1,3,1,2,10,3,4,2,3,12,10,P0,P1,P2,RemoteData,RemoteData,RemoteData,1,2,3,5,bitmask,DataShift,Count,Count,Count,Filtering Transactions in IDD,bitmask,Parallel Association Rules: Hybrid Distribution (HD),Candidate set is partitioned into G groups to just fit in main memoryEnsures Good load balance with smaller candidate set.Logical processor mesh G x P/G is formed.Perform IDD along the column processorsData movement among processors is minimized.Perform CD along the row processorsSmaller number of processors is global reduction operation.,HD: Illustration,G Groups of Processors,P/G Processors per Group,N/P,N/P,N/P,N/P,N/P,N/P,N/P,N/P,N/P,Parallel Association Rules: Experimental Setup,128-processor Cray T3E600 MHz DEC Alpha (EV4)512MB of main memory per processor3-D torus interconnection network with peak unidirectional bandwidth of 430 MB/sec.MPI used for communications.Synthetic data set: avg transaction size 15 and 1000 distinct items.For larger data sets, multiple read of transactions in blocks of 1000.HD switch to CD after 90.7% of the total computation is done.,Scaleup Results (50K, 0.1%),Speedup Results (N=
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 普工劳务合同
- 第5单元 第1章 第2节 病毒(新说课稿)2025-2026学年八年级上册生物(冀少版)
- 2025地铁项目居间合同
- 科技创新周活动试题及答案
- 文员毕业实践试题带答案
- 弱电工程师试用期试题带答案
- 2025年重庆购房贷款合同
- 挂车企业安全员考试题库及答案解析
- 肌肤护理 专业知识题库及答案解析
- 重庆市道路运输从业考试及答案解析
- 周大生珠宝知识培训课件
- 2025年中国科学院研究所招聘面试模拟题答案及解析版支撑岗
- 辽宁省名校联盟2025年高三10月份联合考试 语文试卷(含答案详解)
- 2025年政府采购评审专家考试试题及答案
- 四川省巴中市2025年下半年事业单位公开考试招聘工作人员(258人)考试参考试题及答案解析
- 2025年及未来5年中国止汗露行业市场运行现状及投资战略研究报告
- 2025年药品安全管理自查报告
- 2025-2030中国光纤传感技术在风电设备状态监测中的应用实践报告
- 带电作业培训课件
- 2025年下半年银行从业资格证考试风险管理复习题库及答案
- 燃气安全使用管理制度范本
评论
0/150
提交评论