高性能数据挖掘技术及其应用_第1页
高性能数据挖掘技术及其应用_第2页
高性能数据挖掘技术及其应用_第3页
高性能数据挖掘技术及其应用_第4页
高性能数据挖掘技术及其应用_第5页
已阅读5页,还剩66页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

刘莹博士副教授yingliu中国科学院研究生院信息科学与工程学院,高性能数据挖掘技术及其应用,2020/5/19,YingLiu,2,简介,1999/07,北京大学,计算机科学与技术,学士2001/12,美国西北大学(NorthwesternUniversity),计算机工程,硕士2005/06,美国西北大学(NorthwesternUniversity),计算机工程,博士2005/062005/11,助理研究员,美国西北大学2006/01今,副教授,中国科学院研究生院信息科学与工程学院,虚拟经济与数据科学研究中心,2020/5/19,YingLiu,3,科研经历,美国国家航空航天局(NASA):MassStoragePerformanceInformationSystem美国能源部(DOE):ScientificDataManagementIntegratedSoftwareInfrastructureCenterIntel公司:CharacterizingScalableDataMiningKernels/PrimitivesonSMPs美国国家科学基金(NSF):High-PerformanceTechniques,DesignsandImplementationofSoftwareInfrastructureforChangeDetectionandMining(IIS-0536994),2020/5/19,YingLiu,4,科研经历,负责中国人民银行横向课题个人信用评分系统研究主持自然科学基金创新群体项目子课题海量数据的挖掘技术的研究主持自然科学基金重点项目子课题可信软件过程的基本属性和度量模型主持教育部留学归国人员启动基金基于传感器网络的交通数据流挖掘主持中科院研究生院院长基金基于效用的数据挖掘理论与技术的研究,2020/5/19,YingLiu,5,科研成果,大规模科学模拟计算中的高性能数据挖掘天体物理模拟中的聚类算法HOP的并行方案适用于超大规模的科学模拟计算中,取得了非常好的加速比被美国圣地亚哥超级计算中心(SDSC)使用可扩展的数据挖掘算法的性能评估可扩展的数据挖掘算法的性能评估发布了NU-Minebench,第一个数据挖掘算法的基准组(benchmarksuite),被下载1666次(2005/06/15今)被Intel公司使用,2020/5/19,YingLiu,6,提纲,数据挖掘简介高性能(并行/分布式)数据挖掘应用实例介绍天体模拟(cosmologicalsimulation)天文(astronomy)航天(spaceoperation)生态系统(ecosystem)生物信息学(bioinformatics)总结,2020/5/19,YingLiu,7,数据挖掘,自动的、从”海量”数据中挖掘出隐藏的、潜在的、有价值的知识的技术挖掘的结果(知识)是用户感兴趣的,管理决策支持系统数据挖掘技术的特点海量数据从历史的数据中自动寻找高效可扩展性好模型更新快应用性强,2020/5/19,YingLiu,8,数据挖掘的动机商业角度,收集和存储的数据量太大电子商务商业交易数据信用卡交易保险CPU的处理速度每年增长15%,不能满足数据量增长的需要提供更好的个性化服务,先进的客户关系管理手段等,数据爆炸,知识贫乏,2020/5/19,YingLiu,9,数据挖掘的动机科学计算角度,海量数据(GB/hour)遥感数据天文望远镜巡天基因表达微阵列(Microarrays)科学模拟帮助科学家对数据进行多种分析,如分类、分层等,2020/5/19,YingLiu,10,数据挖掘的起源,交叉学科统计方法机器学习方法神经网络数据库并行计算传统方法的局限性在于海量数据高维数据异构数据复杂数据类型,2020/5/19,YingLiu,11,流程,DataCleaningandIntegration,Databases,DataWarehouse,Knowledge,SelectionandTransformation,DataMining,PatternEvaluation,Flatfiles,2020/5/19,YingLiu,12,数据挖掘的主要技术,聚类(clustering)异常点检测(anomalydetection)分类(classification)预测(prediction)关联规则(associationrulesmining)顺序模式(sequentialpattern)时间序列(time-series),2020/5/19,YingLiu,13,聚类,自动将数据分成若干簇,使得不同簇的数据项相似性最小,簇内数据项的相似性最大。(不依赖于预先定义好的类,不需要训练集)应用模式识别地理信息系统图像处理生物基因序列分析天体模拟文档聚类,常用算法K-means,BIRCH,DBSACN,EM,2020/5/19,YingLiu,14,异常点检测,从数据集中找出与正常行为有显著差异的数据项应用信用卡欺诈医疗数据分析网络入侵检测常用算法聚类Statistical-based,Distance-based,Deviation-based,2020/5/19,YingLiu,15,分类,根据从训练集数据(trainingdata)中分析得来的数据各域与已知类别间的函数关系,预测一个新的数据记录的类别应用市场预测客户关系管理(CRM)营销策略信用评分常用算法决策树贝叶斯分类神经网络K-近邻,2020/5/19,YingLiu,16,分类,class,2020/5/19,YingLiu,17,预测,根据历史数据建立数学模型,预测新的记录的一个属性的值。回归(Regression)方法,线性、非线性曲线拟合常用算法线性回归Logistic回归,y,2020/5/19,YingLiu,18,关联规则,从数据中找出频繁集(frequentitemsets),并且找出频繁集中数据项间的相互影响作用应用市场组合分析套装产品分析目录设计交叉销售常用算法AprioriDICFP-growth,A为“北京附近有冷涡”,B为“北京地区有降水”,A、B同时出现的概率较高(s=60%),P(B|A)高(c=75%),A导致B,2020/5/19,YingLiu,19,顺序模式,从与时间顺序有关的数据中找出频繁的(frequentevents),然后寻找出频繁集中数据项间的相互影响作用应用电信市场营销DNA序列分析常用算法GSPSPADE,买PC,买打印机,买墨盒,买新的CPU,Time,凡是购买了新电脑的顾客,9个月后很可能又要买新的CPU,营销手段:9个月后主动向用户推荐,以保持客户,2020/5/19,YingLiu,20,时间序列,随时间变化的数值序列,分析序列的周期,不同序列的相似度,以及预测序列的趋势应用股票价格医疗诊断电力消耗交通流,time,price,2020/5/19,YingLiu,21,WhyHighPerformanceDataMining?,Lotsofdatabeingcollectedincommercialandscientificworld,massivedatasetsStrongcompetitivepressuretoextractandusetheinformationfromthedata,e.g.ClimatesimulationAstrophysicsMolecularbiology,2020/5/19,YingLiu,22,WhyHighPerformanceDataMining?,Dataand/orcomputationalresourcesneededforanalysisareoftendistributedSometimesthechoiceisdistributeddataminingornodataminingOwnership,privacy,securityissues,AcceleratethecomputationUsemorememoryfrommultiplemachines,Solution:parallelcomputing!,2020/5/19,YingLiu,23,ProgressinHPC-past6decades,ENIACS1945100KHz5KAdditions/second357Multiplications/second,IBMBlueGene/L,CPUpowerincreasingbyafactorof30-100everydecadeMulti-GigaHz,multi-Gigabyte,multi-coreCPUsarecommodityTeraflopscomputersarecommonPetaflopsscalecomputingwithinreach,Jaguar-CrayXT4/XT3-OakRidgeNationalLaboratory,EKA(HPClusterPlatform3000BL)-ComputationalResearchLaboratories,2020/5/19,YingLiu,24,TOP10Machines7/2008,2020/5/19,YingLiu,25,2020/5/19,YingLiu,26,SupercomputersinChina,2004年6月,曙光超级服务器,每秒峰值运算速度万亿次,列全球第十,位于上海超级计算中心2008年6月,曙光5超级服务器,每秒峰值运算速度160万亿次,位于上海超级计算中心联想深腾6800网格超级计算机,265个四路节点机,1060个处理器芯片,每秒峰值运算速度5万亿次,列2003年11月世界TOP500第14名,位于中科院网络信息中心,2020/5/19,YingLiu,27,体系结构(Architectures),SharedAddressSpaceAllprocessorsshareasingleglobaladdressspaceSingleaddressspacefacilitatesasimpleprogrammingmodelExamples:SGIOrigin3000,IBMSP2,2020/5/19,YingLiu,28,体系结构(Architectures),MessagepassingplatformEachprocessorhaslocalmemorywithlocaladdressspaceOnlywaytoexchangedataisusingexplicitmessagepassingTimetakenformessagedependsontherelativelocationsofthesourceanddestinationprocessorsPerformanceofaparallelprogramdeterminedbyhowwellthelocationofdatamatchesitsuseExample:clusters,IBMSPandSGIOrigin3000supportit,2020/5/19,YingLiu,29,体系结构(Architectures),Clustersof4-waySMPs,HybridMostpopular,2020/5/19,YingLiu,30,ParallelProgramming,ConstructormodifyaseriesprogramforsolvingagivenproblemonaparallelmachineTheprogrammersresponsibilitytoidentifythewaystodecomposethecomputationandextractconcurrencyAnexactcopyoftheprogramoneachprocessorComplexprogramming,2020/5/19,YingLiu,31,ParallelProgramming,DataparallelismPartitionthedataacrossprocessorsEachprocessorperformsthesameoperationsonitslocaldatapartitioningTaskparallelismAssignindependentmodulestodifferentprocessorsEachprocessorperformsdifferentoperations,2020/5/19,YingLiu,32,提纲,数据挖掘简介高性能(并行/分布式)数据挖掘应用实例介绍天体模拟(cosmologicalsimulation)航天(spaceoperation)生态系统(ecosystem)生物信息学(bioinformatics)天文(astronomy)总结,2020/5/19,YingLiu,33,天体模拟(CosmologicalSimulation),N-bodysimulationnumericallyapproximatestheevolutionoftheuniverseEachbodyrepresentsagalaxyorastar,andbodiesattracteachotherthroughthegravitationalforceSimilarapplicationsProteinfoldingTurbulentfluidflowsimulation,2020/5/19,YingLiu,34,2020/5/19,YingLiu,35,HOPClusteringAlgorithm,Difficulttodiscernwhichparticlesbelongtothesamegrouporcluster,computationalintensiveHOP,density-basedclusteringalgorithmbyDanielJ.Eisenstein,PietHut,1998AutomaticallyidentifygroupsofparticlesinN-bodysimulationParticleattributes:mass,three-dimensioncoordinatesFourprocessingstages:ConstructingaKDtreeGeneratingdensityHoppingGrouping,2020/5/19,YingLiu,36,FindthemedianparticleonthelongestaxisRecursivelybisecttheparticlesalongthelongestaxisNearbyparticlesareinthesamesub-domainEachinternalnodecontainsboundary,Two-dimensionalKDTree,HOPClusteringAlgorithm,2020/5/19,YingLiu,37,GeneratingdensityTraversethetreetofindNdensneighborsforeveryparticleAssignanestimateddensitytoeveryparticleHoppingAssociateeachparticlewithitsdensestneighborEachparticle“hops”toitsdensestneighbortillitreachesaparticlethatisitsowndensestneighborGroupingDefineparticlesassociatedtothesamedensestneighborasagroupRefineandmergegroups,HOPClusteringAlgorithm,2020/5/19,YingLiu,38,KeyideaLoadbalanceAssignapproximateequalnumberofparticlestoeachprocessorMinimizecommunicationoverheadsRequestsforpotentialrequiredremoteparticlesarepackedintoasinglemessage,andtherequiredparticlesaretransferredtotherequestingprocessors,HOPClusteringParallelizationYingLiu,Wei-kengLiao,AlokChoudhary,NorthwesternUniversity,USA,2020/5/19,YingLiu,39,AssignapproximateequalnumberofparticlestoeachprocessorFindthemedianparticleinparallelonthelongestaxisBisectparticlesalongthelongestaxisExchangeparticlesbetweenbisectedprocessorsBuildlocalKDtreeMaintainaglobaltreeoneachprocessorwithnorealparticletransfer,ConstructParallelKDTree,HOPParallelization,2020/5/19,YingLiu,40,GenerateDensity,IntersectiontestSendoutasinglemessagetorequesttherequiredremoteparticlesTransfertherequiredparticlesSearchforneighborsCalculatedensity,HOPParallelization,2020/5/19,YingLiu,41,HoppingHoptoitshighestdensityneighborBooktherequiredremoteparticlesandsendoutrequestsTransfertherequiredparticlestorequestingprocessorsGroupingParticleslinkedtothesamedensestparticlearedefinedasagroupRefinegroups,HOPParallelization,2020/5/19,YingLiu,42,Experiment,ENZOAnadaptivemeshrefinement(AMR),grid-basedhybridcode(hydro+N-Body),simulatethecosmologicalstructureformationUsethealgorithmsofBerger&Collelatoimprovespatialandtemporalresolutioninregionsoflargegradients,suchasgravitationallycollapsingobjectsSoftwareisflexible,cansimulateawiderangeofcosmologicalsituationsParallelizedusingMPIandcanrunonanysharedordistributedmemoryparallelsupercomputerorclustersSimulationson1024processorshavebeencarriedoutontheSanDiegoSupercomputingCentersBlueHorizon,anIBMSP,2020/5/19,YingLiu,43,Dataset1,Dataset2,DataSource,Eachdatasetcontains491520particles,2020/5/19,YingLiu,44,DensitygenerationisthemosttimeconsumingstageDataset2takeslongerexecutiontime,TotalExecutionTime,Dataset1,Dataset2,PerformanceEvaluation,2020/5/19,YingLiu,45,TheoverallperformancescalesuponIBMSP2andSGIOrigin2000whenincreasingnumberofprocessorsItscalesupto32processorsonLinuxCluster,SpeedupsforTotalExecutionTime,Dataset1,Dataset2,PerformanceEvaluation,2020/5/19,YingLiu,46,GeneratingdensitystagescalesuponIBMSP2andSGIOrigin2000Itscalesupto32processorsonLinuxCluster,SpeedupsforGeneratingDensity,Dataset1,Dataset2,PerformanceEvaluation,2020/5/19,YingLiu,47,Dataset1,Dataset2,CommunicationtimedoesnotscalewellCommunicationtimeincreaseswhennumberofprocessorsgoesbeyond32,CommunicationCosts,PerformanceEvaluation,2020/5/19,YingLiu,48,提纲,数据挖掘简介应用实例介绍天体模拟(cosmologicalsimulation)天文(astronomy)航天(spaceoperation)生态系统(ecosystem)生物信息学(bioinformatics)高性能(并行/分布式)数据挖掘总结,2020/5/19,YingLiu,49,天文(Astronomy)UniversityofBaltimore,USA,PredictiveMiningofTimeSeriesDatainAstronomy发觉相同天体或者不同天体间有趣的周期性的行为或者巧合。应用这种周期性的行为来预测或者分析天体行为算法将每个望远镜收集的数据看成时间序列对时间序列用slidingwindow处理,得到子序列对这个数据的子序列使用聚类的算法进行分析,得到这个子序列中各种pattern用这些pattern来表示这段时间序列意义如果patternA出现在时间序列1当中,那么在此后T时间之内有c%的几率,patternB会出现在时间序列1,得到有意义的关联规则对不同的时间序列的pattern进行比较,2020/5/19,YingLiu,50,天文(Astronomy),分析天体周期性行为的框架,2020/5/19,YingLiu,51,天文(Astronomy)LawrenceLivermoreNationalLaboratory,USA,MiningtheFIRSTsurveyforgalaxieswithabent-doublemorphologyFIRST:FaintImagesoftheRadioSkyatTwentyCentimetersRadioequivalentofthePalomarObservatorySkySurvey(POSS)10,000squaredegreessurveyoftheNorthGalacticCapUsingtheNRAOVeryLargeArray(VLA),Bconfiguration,2020/5/19,YingLiu,52,天文(Astronomy),TheFIRSTdata1.8pixels,resolution5,rms0.15mJy90radiosourcespersquare-degreeat1mJythresholdThemorphologicaltypeofaradiosourceprovidescluestotheiremissionmechanism,sourceclass,andthepropertiesofthesurroundingmediumTherawdatafromthetelescopesisextensivelyprocessedImagesmapsandcatalogavailable(),2020/5/19,YingLiu,53,天文(Astronomy),Usedataminingtofind“bent-doubles”inFIRSTFIRSTastronomersinterestedin“bent-doubles”indicatespresenceofclustersofgalaxiesfirst“identify”usingavisualtechniquefollowedbyopticalobservationsandcheckswithothersurveysVisualidentificationisnolongerfeasiblesubjective,tedious,likelytomisscases.900,000galaxiesinthefullsurveyGoal:replacethevisualidentificationofbentdoublesbyasemi-automatedone,2020/5/19,YingLiu,54,天文(Astronomy),Detectingbent-doublegalaxiesin250GBimagedata,78MBcatalogdata(asof7/2000),2020/5/19,YingLiu,55,天文(Astronomy),MethodologyGroupthecatalogentriesintoa“galaxy”Separatesourcesbasedonnumberofcatalogentries1-entrysourcesunlikelytobebent-doubles3-entrysourcesall“interesting”studythe2-and3-entrysourcesseparatelyresultsinsplittingasmalltrainingset(313-118+195),2020/5/19,YingLiu,56,天文(Astronomy),Calculatefeaturesforagalaxy(103features)UsethefeaturestotrainadecisiontreeUsethetreetoclassifytheunlabeledgalaxiesandvalidatetheresultsUsevalidatedresultstoenhancetrainingset,2020/5/19,YingLiu,57,天文(Astronomy),Resultsusingasingletreefor3-entrysourcesweresatisfactoryLabeledtrainingset:167bents,28non-bentsPerformedseveralinneriterationsusingprunedtrees(c5.0decisiontreesoftware)Ten10-foldcross-validationerrors:mean(SE)usingallthefeatures:9.7%(0.3%)usingtriplefeaturesonly:10.7%(0.3%)Discriminatingfeaturesincludegeometricallycalculatedangles,relativedistances,ellipticityandsymmetrymeasures,2020/5/19,YingLiu,58,NewTrendsGPUs+CUDA,GPU(GraphicProcessingUnit),图形处理器,专用处理器,CPU和GPU每秒浮点运算数,2020/5/19,YingLiu,59,NewTrendsGPUs+CUDA,GPU与CPU结构区别更多的晶体管高内存带宽驱动的多核GPU优势成本低(几百美元)多线程(几百个线程)处理计算密集型数据的效率远高于CPUGPU缺点编程难,2020/5/19,YingLiu,60,NewTrendsGPUs+CUDA,CUDA一个基于业界标准的C语言的编程环境,用于开发GPU的计算应用程序GPU并行执行非常多线程CPU把计算密集的、并行度高的部分卸载给GPU易编程,软件层次结构,2020/5/19,YingLiu,61,NewTrendsGPUs+CUDA,原来只能由workstation完成的工作,可以由PC完成超级计算一次革命性的进步成功例子斯坦福大学利用CUDA开发了在GPU上运行的foldinghome,最高运行速度比CPU快140倍。Foldinghome进行蛋白质折叠模拟,找出蛋白质误折叠的后果。ElementalTechnologies利用CUDA开发了在使用基于GPU的Badaboom软件后,视频编码的转换过程最高比传统方法快了18倍。有了CUDA的帮助,地理信息系统中,从前需要20分钟才能完成的运算现在只需30秒即可完成,而从前需要30到40秒钟完成的运算现在能够实现实时运算。CUDA技术是自微处理器发明以来计算行业内所诞生的最具革命性的技术。”M,2020/5/19,YingLiu,62,NewTrendsGPUs+CUDA,伊利诺伊大学(UIUC)利用GPU进行并行分子动力学研究,用于分析大型生物分子系统。“未来计算性能的加强将直接来自多核GPU(图形处理器)大规模并行硬件。目前的最大挑战是将代码实现并行化,以便更好地利用相关的硬件,而CUDA取得了突破性的进展,推进了这一领域的发展。”胡文美教授金融分析、天体物理学、地震成像等各个领域的开发人员正在受益于CUDA的开发工具。“凭借CUDA,我们很容易地就可利用GPU的处理能力,减少时间和资金的投入。一台主机系统配备两块TeslaD870的成本要比组建16核集群低很多。”Technician“Volera只用了12个GPU(图形处理器)就能实时分析美国整个期权市场,延迟时间不超过10微秒。而达到这样的速度则通常会至少需要60个传统的1U服务器。通过使用GPU,我们的客户可以用更小的维护成本、更低的电能消耗以及更小的占地空间实现更好的效益。”HanweckAssociates,2020/5/19,YingLiu,63,DataMiningonGPU,UniversityofVirginiaUniversityofIllinoisatUrbana-ChampaignUniversityofCaliforniaatDavis中国科学院研究生院,2020/5/19,YingLiu,64,HighPerformanceScientificDataMiningProjects,HillolKargupta,UniversityofMaryland,USADistributedDataMiningforScalableAnalysisofDatafromVirtualObservatories,NASA,2007-2010Astronomersareunabletotaptherichesofthiscollectionofgigabyte,terabyte,and(eventually)petabytecatalogswithoutacomputationalbackbonethatincludessupportforqueriesanddataminingacrossdistributedvirtualtablesofdecentralized,joined,andintegratedskysurveycatalogs.(1)Designandimplementdistributedalgorithmsforcomputingstatisticalprimitives,principalcomponentanalysis,andoutlierdetectionfromdistributedastronomycatalogsandtheirpartialimagesstoredinuserslocaldatamanagementsystems.(2)Developaprototypesystemwhichwillofferarichcollectionofweb-servicesbasedonvariousDDMalgorithms.(3)Searchforunusualcorrelations,outliers,sub-clusters,andfundamentalplaneswithinthemulti-dimensionalparameterspacepresentedbyseverallargesurveys.,2020/5/19,YingLiu,65,HighPerformanceScientificDataMiningProjects,VipinKumar,UniversityofMinnesota,USADiscoveryofPatternsintheGlobalClimateSystemusingDataMining,NASA,NOAA,andNSFDataMiningforBio-medicalInformatics,2020/5/19,YingLiu,66,HighPerformanceScientificDataMiningProjects,DavidSkillicorn,QueensUniversity,CanadaTre

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论