版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
DataMining:
ConceptsandTechniques
—SlidesforTextbook—
—Chapter1—©JiaweiHanandMichelineKamberDepartmentofComputerScienceUniversityofIllinoisatUrbana-Champaign/~hanj1/12/20231DataMining:ConceptsandTechniques1/12/20232AcknowledgementsThissetofslidesstartedwithHan’stutorialforUCLAExtensioncourseinFebruary1998Othersubsequentcontributors:Dr.HongjunLu(HongKongUniv.ofScienceandTechnology)GraduatestudentsfromSimonFraserUniv.,Canada,notablyEugeneBelchev,JianPei,andOsmarR.ZaianeGraduatestudentsfromUniv.ofIllinoisatUrbana-Champaign1/12/20233CS497JHSchedule(Fall2002)Chapter1.Introduction{W1:L1}Chapter2.Datapre-processing{W4:L1-2}Homework#1distribution(SQLServer2000)Chapter3.DatawarehousingandOLAPtechnologyfordatamining{W2:L1-2,W3:L1-2}Homework#2distributionChapter4.Dataminingprimitives,languages,andsystemarchitectures{W5:L1}Chapter5.Conceptdescription:Characterizationandcomparison{W5:L2,W6:L1}Chapter6.Miningassociationrulesinlargedatabases{W6:L2,W7:L1-L21,W8:L1}Homework#3distributionChapter7.Classificationandprediction{W8:L2,W9:L2,W10:L1}Midterm{W9:L1}Chapter8.Clusteringanalysis{W10:L2,W11:L1-2}Homework#4distributionChapter9.Miningcomplextypesofdata{W12:L1-2,W13:L1-2}Chapter10.Dataminingapplicationsandtrendsindatamining{W14:L1}Research/Developmentprojectpresentation(W14-W15+finalexamperiod)FinalProjectDue1/12/20234WheretoFindtheSetofSlides?Bookpage:(MSPowerPointfiles):/~hanj/dmbookUpdatedcoursepresentationslides(.ppt):/~cs497jh/Researchpapers,DBMinersystem,andotherrelatedinformation:/~hanjordbminer1/12/20235Chapter1.IntroductionMotivation:Whydatamining?Whatisdatamining?DataMining:Onwhatkindofdata?DataminingfunctionalityAreallthepatternsinteresting?ClassificationofdataminingsystemsMajorissuesindatamining1/12/20236NecessityIstheMotherofInventionDataexplosionproblem
Automateddatacollectiontoolsandmaturedatabasetechnologyleadtotremendousamountsofdataaccumulatedand/ortobeanalyzedindatabases,datawarehouses,andotherinformationrepositoriesWearedrowningindata,butstarvingforknowledge!
Solution:DatawarehousinganddataminingDatawarehousingandon-lineanalyticalprocessingMiinginterestingknowledge(rules,regularities,patterns,constraints)fromdatainlargedatabases1/12/20237EvolutionofDatabaseTechnology1960s:Datacollection,databasecreation,IMSandnetworkDBMS1970s:Relationaldatamodel,relationalDBMSimplementation1980s:RDBMS,advanceddatamodels(extended-relational,OO,deductive,etc.)Application-orientedDBMS(spatial,scientific,engineering,etc.)1990s:Datamining,datawarehousing,multimediadatabases,andWebdatabases2000sStreamdatamanagementandminingDataminingwithavarietyofapplicationsWebtechnologyandglobalinformationsystems
1/12/20238WhatIsDataMining?Datamining(knowledgediscoveryfromdata)Extractionofinteresting(non-trivial,
implicit,previouslyunknownandpotentiallyuseful)
patternsorknowledgefromhugeamountofdataDatamining:amisnomer?AlternativenamesKnowledgediscovery(mining)indatabases(KDD),knowledgeextraction,data/patternanalysis,dataarcheology,datadredging,informationharvesting,businessintelligence,etc.Watchout:Iseverything“datamining”?(Deductive)queryprocessing.ExpertsystemsorsmallML/statisticalprograms1/12/20239WhyDataMining?—PotentialApplicationsDataanalysisanddecisionsupportMarketanalysisandmanagementTargetmarketing,customerrelationshipmanagement(CRM),marketbasketanalysis,crossselling,marketsegmentationRiskanalysisandmanagementForecasting,customerretention,improvedunderwriting,qualitycontrol,competitiveanalysisFrauddetectionanddetectionofunusualpatterns(outliers)OtherApplicationsTextmining(newsgroup,email,documents)andWebminingStreamdataminingDNAandbio-dataanalysis1/12/202310MarketAnalysisandManagementWheredoesthedatacomefrom?Creditcardtransactions,loyaltycards,discountcoupons,customercomplaintcalls,plus(public)lifestylestudiesTargetmarketingFindclustersof“model”customerswhosharethesamecharacteristics:interest,incomelevel,spendinghabits,etc.DeterminecustomerpurchasingpatternsovertimeCross-marketanalysisAssociations/co-relationsbetweenproductsales,&predictionbasedonsuchassociationCustomerprofilingWhattypesofcustomersbuywhatproducts(clusteringorclassification)CustomerrequirementanalysisidentifyingthebestproductsfordifferentcustomerspredictwhatfactorswillattractnewcustomersProvisionofsummaryinformationmultidimensionalsummaryreportsstatisticalsummaryinformation(datacentraltendencyandvariation)1/12/202311CorporateAnalysis&RiskManagementFinanceplanningandassetevaluationcashflowanalysisandpredictioncontingentclaimanalysistoevaluateassetscross-sectionalandtimeseriesanalysis(financial-ratio,trendanalysis,etc.)ResourceplanningsummarizeandcomparetheresourcesandspendingCompetitionmonitorcompetitorsandmarketdirectionsgroupcustomersintoclassesandaclass-basedpricingproceduresetpricingstrategyinahighlycompetitivemarket1/12/202312FraudDetection&MiningUnusualPatternsApproaches:Clustering&modelconstructionforfrauds,outlieranalysisApplications:Healthcare,retail,creditcardservice,telecomm.Autoinsurance:ringofcollisionsMoneylaundering:suspiciousmonetarytransactionsMedicalinsuranceProfessionalpatients,ringofdoctors,andringofreferencesUnnecessaryorcorrelatedscreeningtestsTelecommunications:phone-callfraudPhonecallmodel:destinationofthecall,duration,timeofdayorweek.AnalyzepatternsthatdeviatefromanexpectednormRetailindustryAnalystsestimatethat38%ofretailshrinkisduetodishonestemployeesAnti-terrorism1/12/202313OtherApplicationsSportsIBMAdvancedScoutanalyzedNBAgamestatistics(shotsblocked,assists,andfouls)togaincompetitiveadvantageforNewYorkKnicksandMiamiHeatAstronomyJPLandthePalomarObservatorydiscovered22quasarswiththehelpofdataminingInternetWebSurf-AidIBMSurf-AidappliesdataminingalgorithmstoWebaccesslogsformarket-relatedpagestodiscovercustomerpreferenceandbehaviorpages,analyzingeffectivenessofWebmarketing,improvingWebsiteorganization,etc.1/12/202314DataMining:AKDDProcessDatamining—coreofknowledgediscoveryprocessDataCleaningDataIntegrationDatabasesDataWarehouseKnowledgeTask-relevantDataSelectionDataMiningPatternEvaluation1/12/202315StepsofaKDDProcess
LearningtheapplicationdomainrelevantpriorknowledgeandgoalsofapplicationCreatingatargetdataset:dataselectionDatacleaningandpreprocessing:(maytake60%ofeffort!)DatareductionandtransformationFindusefulfeatures,dimensionality/variablereduction,invariantrepresentation.Choosingfunctionsofdataminingsummarization,classification,regression,association,clustering.Choosingtheminingalgorithm(s)Datamining:searchforpatternsofinterestPatternevaluationandknowledgepresentationvisualization,transformation,removingredundantpatterns,etc.Useofdiscoveredknowledge1/12/202316DataMiningandBusinessIntelligence
IncreasingpotentialtosupportbusinessdecisionsEndUserBusinessAnalystDataAnalystDBA
MakingDecisionsDataPresentationVisualizationTechniquesDataMiningInformationDiscoveryDataExplorationOLAP,MDAStatisticalAnalysis,QueryingandReportingDataWarehouses/DataMartsDataSourcesPaper,Files,InformationProviders,DatabaseSystems,OLTP1/12/202317Architecture:TypicalDataMiningSystemDataWarehouseDatacleaning&dataintegrationFilteringDatabasesDatabaseordatawarehouseserverDataminingenginePatternevaluationGraphicaluserinterfaceKnowledge-base1/12/202318DataMining:OnWhatKindsofData?RelationaldatabaseDatawarehouseTransactionaldatabaseAdvanceddatabaseandinformationrepositoryObject-relationaldatabaseSpatialandtemporaldataTime-seriesdataStreamdataMultimediadatabaseHeterogeneousandlegacydatabaseTextdatabases&WWW1/12/202319DataMiningFunctionalitiesConceptdescription:CharacterizationanddiscriminationGeneralize,summarize,andcontrastdatacharacteristics,e.g.,dryvs.wetregionsAssociation(correlationandcausality)DiaperàBeer[0.5%,75%]ClassificationandPrediction
Constructmodels(functions)thatdescribeanddistinguishclassesorconceptsforfuturepredictionE.g.,classifycountriesbasedonclimate,orclassifycarsbasedongasmileagePresentation:decision-tree,classificationrule,neuralnetworkPredictsomeunknownormissingnumericalvalues1/12/202320DataMiningFunctionalities(2)ClusteranalysisClasslabelisunknown:Groupdatatoformnewclasses,e.g.,clusterhousestofinddistributionpatternsMaximizingintra-classsimilarity&minimizinginterclasssimilarityOutlieranalysisOutlier:adataobjectthatdoesnotcomplywiththegeneralbehaviorofthedataNoiseorexception?No!usefulinfrauddetection,rareeventsanalysisTrendandevolutionanalysisTrendanddeviation:regressionanalysisSequentialpatternmining,periodicityanalysisSimilarity-basedanalysisOtherpattern-directedorstatisticalanalyses1/12/202321AreAllthe“Discovered”PatternsInteresting?Dataminingmaygeneratethousandsofpatterns:NotallofthemareinterestingSuggestedapproach:Human-centered,query-based,focusedminingInterestingnessmeasuresApatternisinterestingifitiseasilyunderstoodbyhumans,valid
onnew
ortestdatawithsomedegreeofcertainty,potentiallyuseful,novel,orvalidatessomehypothesisthatauserseekstoconfirmObjectivevs.subjectiveinterestingnessmeasuresObjective:basedonstatisticsandstructuresofpatterns,e.g.,support,confidence,etc.Subjective:basedonuser’sbeliefinthedata,e.g.,unexpectedness,novelty,actionability,etc.1/12/202322CanWeFindAllandOnlyInterestingPatterns?Findalltheinterestingpatterns:CompletenessCanadataminingsystemfindall
theinterestingpatterns?Heuristicvs.exhaustivesearchAssociationvs.classificationvs.clusteringSearchforonlyinterestingpatterns:AnoptimizationproblemCanadataminingsystemfindonlytheinterestingpatterns?ApproachesFirstgeneralallthepatternsandthenfilterouttheuninterestingones.Generateonlytheinterestingpatterns—miningqueryoptimization1/12/202323DataMining:ConfluenceofMultipleDisciplines
DataMiningDatabaseSystemsStatisticsOtherDisciplinesAlgorithmMachineLearningVisualization1/12/202324DataMining:ClassificationSchemesGeneralfunctionalityDescriptivedataminingPredictivedataminingDifferentviews,differentclassificationsKindsofdatatobeminedKindsofknowledgetobediscoveredKindsoftechniquesutilizedKindsofapplicationsadapted1/12/202325Multi-DimensionalViewofDataMiningDatatobeminedRelational,datawarehouse,transactional,stream,object-oriented/relational,active,spatial,time-series,text,multi-media,heterogeneous,legacy,WWWKnowledgetobeminedCharacterization,discrimination,association,classification,clustering,trend/deviation,outlieranalysis,etc.Multiple/integratedfunctionsandminingatmultiplelevelsTechniquesutilizedDatabase-oriented,datawarehouse(OLAP),machinelearning,statistics,visualization,etc.ApplicationsadaptedRetail,telecommunication,banking,fraudanalysis,bio-datamining,stockmarketanalysis,Webmining,etc.1/12/202326OLAPMining:IntegrationofDataMiningandDataWarehousingDataminingsystems,DBMS,DatawarehousesystemscouplingNocoupling,loose-coupling,semi-tight-coupling,tight-couplingOn-lineanalyticalminingdataintegrationofminingandOLAPtechnologiesInteractiveminingmulti-levelknowledgeNecessityofminingknowledgeandpatternsatdifferentlevelsofabstractionbydrilling/rolling,pivoting,slicing/dicing,etc.IntegrationofmultipleminingfunctionsCharacterizedclassification,firstclusteringandthenassociation1/12/202327AnOLAMArchitectureDataWarehouseMetaDataMDDBOLAMEngineOLAPEngineUserGUIAPIDataCubeAPIDatabaseAPIDatacleaningDataintegrationLayer3OLAP/OLAMLayer2MDDBLayer1DataRepositoryLayer4UserInterfaceFiltering&IntegrationFilteringDatabasesMiningqueryMiningresult1/12/202328MajorIssuesinDataMiningMiningmethodologyMiningdifferentkindsofknowledgefromdiversedatatypes,e.g.,bio,stream,WebPerformance:efficiency,effectiveness,andscalabilityPatternevaluation:theinterestingnessproblemIncorporationofbackgroundknowledgeHandlingnoiseandincompletedataParallel,distributedandincrementalminingmethodsIntegrationofthediscoveredknowledgewithexistingone:knowledgefusionUserinteractionDataminingquerylanguagesandad-hocminingExpressionandvisualizationofdataminingresultsInteractiveminingof
knowledgeatmultiplelevelsofabstractionApplicationsandsocialimpactsDomain-specificdatamining&invisibledataminingProtectionofdatasecurity,integrity,andprivacy1/12/202329SummaryDatamining:discoveringinterestingpatternsfromlargeamountsofdataAnaturalevolutionofdatabasetechnology,ingreatdemand,withwideapplicationsAKDDprocessincludesdatacleaning,dataintegration,dataselection,transformation,datamining,patternevaluation,andknowledgepresentationMiningcanbeperformedinavarietyofinformationrepositoriesDataminingfunctionalities:characterization,discrimination,association,classification,clustering,outlierandtrendanalysis,etc.DataminingsystemsandarchitecturesMajorissuesindatamining1/12/202330ABriefHistoryofDataMiningSociety1989IJCAIWorkshoponKnowledgeDiscoveryinDatabases(Piatetsky-Shapiro)KnowledgeDiscoveryinDatabases(G.Piatetsky-ShapiroandW.Frawley,1991)1991-1994WorkshopsonKnowledgeDiscoveryinDatabasesAdvancesinKnowledgeDiscoveryandDataMining(U.Fayyad,G.Piatetsky-Shapiro,P.Smyth,andR.Uthurusamy,1996)1995-1998InternationalConferencesonKnowledgeDiscoveryinDatabasesandDataMining(KDD’95-98)JournalofDataMiningandKnowledgeDiscovery(1997)1998ACMSIGKDD,SIGKDD’1999-2001conferences,andSIGKDDExplorationsMoreconferencesondataminingPAKDD(1997),PKDD(1997),SIAM-DataMining(2001),(IEEE)ICDM(2001),etc.1/12/202331WheretoFindReferences?DataminingandKDD(SIGKDD:CDROM)Conferences:ACM-SIGKDD,IEEE-ICDM,SIAM-DM,PKDD,PAKDD,etc.Journal:DataMiningandKnowledgeDis
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025年度团队目标完成情况汇报
- 2025贵州省中考物理试题(解析版)
- 2026年一次性医用耗材管理制度
- 2026年失智老人照护者技能培训计划
- AI在戏曲表演中的应用
- AI在物流管理中的应用
- 2026年高考地理等值线图判读技巧与实践
- 2026年幼儿意外伤害预防与处理
- 上海立达学院《安全系统工程学》2025-2026学年第一学期期末试卷(A卷)
- 2026年某公司监事会工作实施细则
- 2026-2030中国摩卡咖啡壶行业市场发展趋势与前景展望战略分析研究报告
- Unit5OldtoysPartALet'sspell(课件)人教PEP版英语三年级下册
- 2026年民法典宣传月专题知识竞答
- 2025年西部计划高频考点公基训练题库(附解析)
- 2026年深度学习及其应用-复旦大学中国大学mooc课后章节答案期末练习题(典型题)附答案详解
- 2026云南昆华医院投资管理有限公司(云南新昆华医院)招聘5人备考题库及答案详解参考
- 2026届陕西省西安市五校中考三模语文试题含解析
- 2026小升初语文专项冲刺辅导
- 2025年东莞市康复实验学校招聘笔试真题
- 2026年医师定期考核业务水平测评理论(人文医学)考试卷含答案
- 交通运输工程全流程工作手册
评论
0/150
提交评论