已阅读5页,还剩34页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
PartI,DataMiningFundamentals,Chapter1:DataMining:AFirstView,2020/5/5,BUPTAI&DM,2,Content,1.1WhatisDataMining?Definition1.2WhatcancomputersLearn?1.3IsDataMiningAppropriateforMyProblem?1.4ExpertSystemsorDataMining?1.6WhyNotSimpleSearch?,2020/5/5,BUPTAI&DM,3,1.1Whatisdatamining:Motivation,DataexplosionproblemAutomateddatacollectiontoolsandmaturedatabasetechnologyleadtotremendousamountsofdatastoredindatabases,datawarehousesandotherinformationrepositories.Suchamountofdatabeyondhumanunderstanding.Wearedrowningindata,butstarvingforknowledge!Solution:DatawarehousinganddataminingDatawarehousing:fordatastorageDatamining:forExtractionofinterestingknowledge(rules,regularities,patterns,constraints)fromdatainlargedatabases,2020/5/5,BUPTAI&DM,4,1.1DataMiningisaresultofnaturalevolutionofinformationtechnology,1960s:Datacollectionanddatabasecreation1970s-early1980s:DatabaseManagementSystemsMid-1980s-present:DatawarehouseDataanalysisandunderstanding(datamining),2020/5/5,BUPTAI&DM,5,DataAnalysis:NewTrend,Thisisatimethatonemustspeakwithdata.未来属于运算师(SuperCrunchers超级运算师,IanAyres,2009):日常决策将变得越来越自动化,人的判断作用将局限于为计算提供数据葡萄酒味道和香味的预测:奥利.阿申费尔特是普林斯顿大学的经济学家,完全不懂葡萄酒的制作,但可以预测波尔多葡萄酒的价格基于天气(炎热、干燥的年份酒会非常好),准确率高于葡萄酒专家本书原计划叫“理论的终结”,后来利用google改书名而不是与出版社编辑讨论,因为发现用此名点击率高63%放贷员曾经收入优厚、职责最大,现在只是呼叫中心的接线员,重复电脑提示的问题,报酬很低,2020/5/5,BUPTAI&DM,6,DataAnalysis:NewTrend(cont.),Thisisatimethatonemustspeakwithdata.基因测序和新物种:克雷格.文特尔使用能够分析数据的高速计算机,从给单个生物基因排序,2003年开始给海洋测序,2005年给空气测序。这个过程中发现了数千种以前不知道的细菌和其它生命形式。他对生物学的推进比同辈所有人都大。,2020/5/5,BUPTAI&DM,7,在过去,上海通用保修问题分析主要依靠简单的纯手工处理的计算方式,每次只能产生寥寥几篇问题报告。尽管汽车生产量远不如现在大,但这个耗时费力的分析周期却在根本上导致了保修成本居高不下。在非自动操作环境下,从保修索赔出现到找出问题原因平均要花费612个月的时间,且在此间往往还需要借助于通用全球的支持,解决问题的整个过程也主要建立在经验分析的基础上。另外,不准确的数据导致上海通用难以准确预测保修成本,从而合理准备下一周期的保修预算,导致大量运营资金被占用、现金流降低。采用SAS的保修分析解决方案后,上海通用的保修分析周期在头6个月里就缩短了70%,有效地降低了保修成本,实现了该系统使用的预期目标。同时,这些显著的改善效果帮助上海通用在短短半年内就收回了保修分析系统所有的软硬件投资,共为公司节省了1,800万人民币的成本。警察地理信息系统,2020/5/5,BUPTAI&DM,8,DataMiningDefinitions,(1)Theprocessofemployingoneormorecomputerlearningtechniquestoautomaticallyanalyzeandextractknowledgefromdata.(inthistextbook)(2)Extractionofinteresting(non-trivial,implicit,previouslyunknownandpotentiallyuseful)informationorpatternsfromdatainlargedatabases.(generallyaccepted),2020/5/5,BUPTAI&DM,9,Induction-basedLearning(基于归纳的学习),Dataminingmethodsuseinduction-basedlearningTheprocessofforminggeneralconceptdefinitionsbyobservingspecificexamplesofconceptstobelearned.,2020/5/5,BUPTAI&DM,10,WhatIsDataMining?,Alternativenames:Dataminingorknowledgemining?Goldmining-pooranalogyKnowledgediscoveryindatabases(KDD),businessintelligence,2020/5/5,BUPTAI&DM,11,WhyDataMining?PotentialApplications(orp4),DatabaseanalysisanddecisionsupportMarketanalysisandmanagementtargetmarketing,crossselling,marketsegmentationRiskanalysisandmanagementForecasting,customerretention,qualitycontrolFrauddetectionandmanagementOtherApplicationsTextmining(newsgroup,email,documents)andWebanalysis.,2020/5/5,BUPTAI&DM,12,Content,1.1WhatisDataMining?Definition1.2WhatcancomputersLearn?FourLevelsofLearning(略)ThreeConceptViews(略)SupervisedLearningUnsupervisedLearning1.3IsDataMiningAppropriateforMyProblem?1.4ExpertSystemsorDataMining?1.6WhyNotSimpleSearch?,2020/5/5,BUPTAI&DM,13,1.2.1SupervisedLearning,Buildalearnermodelusingdatainstancesofknownorigin.Usethemodeltodeterminetheoutcomeofnewinstancesofunknownorigin.,2020/5/5,BUPTAI&DM,14,Attributes:inputattributes,outputattributesProcess:TrainingData,TestDataLearningoutcome:tree,productionrules,2020/5/5,BUPTAI&DM,15,2020/5/5,BUPTAI&DM,16,Decisiontree:Atreestructurewherenon-terminalnodesrepresenttestsononeormoreattributesandterminalnodes(leafnodes)reflectdecisionoutcomes.rootnode,2020/5/5,BUPTAI&DM,17,ProductionRules(产生式规则),IFSwollenGlands=YesTHENDiagnosis=StrepThroatIFSwollenGlands=No&Fever=YesTHENDiagnosis=ColdIFSwollenGlands=No&Fever=NoTHENDiagnosis=Allergy,Antecedentconditions:先决条件Consequentconditions:结论,2020/5/5,BUPTAI&DM,18,1.2.2UnsupervisedClustering,Adataminingmethodthatbuildsmodelsfromdatawithoutpredefinedclasses.,2020/5/5,BUPTAI&DM,19,TheAcmeInvestorsDataset,TheAcmeInvestorsDataset&SupervisedLearning,CanIdevelopageneralprofileofanonlineinvestor?CanIdetermineifanewcustomerislikelytoopenamarginaccount?CanIbuildamodeltoaccuratelypredicttheaveragenumberoftradespermonthforanewinvestor?Whatcharacteristicsdifferentiatefemaleandmaleinvestors?,WhatattributesimilaritiesgroupcustomersofAcmeInvestorstogether?Whatdifferencesinattributevaluessegmentthecustomerdatabase?,TheAcmeInvestorsDataset&UnsupervisedClustering,2020/5/5,BUPTAI&DM,22,IFMarginAccount=Yes&Age=20-29&AnnualIncome=40-59kTHENCluster=1accuracy=0.80,coverage=0.50IFAccountType=Custodial&FavoriteRecreation=Skiing&AnnualIncome=80-90kTHENCluster=2accuracy=0.95,coverage=0.35IFAccountType=Joint&Trades/Month5&TransactionMethod=OnlineTHENCluster=3accuracy=0.82,coverage=0.65,(seeexampleclustersonp13),2020/5/5,BUPTAI&DM,23,Content,1.1WhatisDataMining?Definition1.2WhatcancomputersLearn?1.3IsDataMiningAppropriateforMyProblem?(DataMiningvsDataQuery)1.4ExpertSystemsorDataMining?1.6WhyNotSimpleSearch?,2020/5/5,BUPTAI&DM,24,DataMiningorDataQuery?,ShallowKnowledge:Shallowknowledgeisfactual.Itcanbeeasilystoredandmanipulatedinadatabase.MultidimensionalKnowledge:Multidimensionalknowledgeisalsofactual.On-lineanalyticalProcessing(OLAP)toolsareusedtomanipulatemultidimensionalknowledge.HiddenKnowledge:Hiddenknowledgerepresentspatternsorregularitiesindatathatcannotbeeasilyfoundusingdatabasequery.However,dataminingalgorithmscanfindsuchpatternswithease(examplep15).DeepKnowledge:Deepknowledgeisknowledgestoredinadatabasethatcanonlybefoundifwearegivensomedirectionaboutwhatwearelookingfor.,DataMiningvs.DataQuery:AnExample(p16),Usedataqueryifyoualreadyalmostknowwhatyouarelookingfor.Usedataminingtofindregularitiesindatathatarenotobvious.,2020/5/5,BUPTAI&DM,26,Content,1.1WhatisDataMining?Definition1.2WhatcancomputersLearn?1.3IsDataMiningAppropriateforMyProblem?(DataMiningvsDataQuery)1.4ExpertSystemsorDataMining?(DataMiningvsES)1.6WhyNotSimpleSearch?,2020/5/5,BUPTAI&DM,27,1.4ExpertSystemsorDataMining?,ExpertSystem(ES):Acomputerprogramthatemulatestheproblem-solvingskillsofoneormorehumanexperts.Usedwhenno(quality)dataavailable,or,inthefieldwherehumanhasgoodknowledgeinit.Expertslearntheirskillsbyeducationandexperience.Humanexpertsoftenuserulestodescribewhattheyknow.ESandDMcanworktogether.,2020/5/5,BUPTAI&DM,28,2020/5/5,BUPTAI&DM,29,Content,1.1WhatisDataMining?Definition1.2WhatcancomputersLearn?1.3IsDataMiningAppropriateforMyProblem?(DataMiningvsDataQuery)1.4ExpertSystemsorDataMining?(DataMiningvsES)1.6WhyNotSimpleSearch?(DataMiningvsNearestNeighborApproach),2020/5/5,BUPTAI&DM,30,1.6WhyNotSimpleSearch?,Storesinstancesorgeneralizedmodelofthedata.NearestNeighborClassifierClassificationisperformedbysearchingthetrainingdatafortheinstanceclosestindistancetotheunknowninstance.Advantage:suitableforareaswherehumanhaslimitedknowledgeP
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025年会计继续教育网上考试试题及答案5篇
- 2025年给水排水专业知识下午真题及答案解析
- 2012年考研政治真题及答案解析(完整版)
- 2025年二级建造师考试试卷(考点提分)附答案详解
- 2025年吉林省安全员A证考试题库附答案
- 全国一等奖七年级地理上学期人教版(2024)《海陆的变迁》获奖精美公开课课件
- 大学寒假安全教育课件
- 寒假女生安全教育课件
- 心理咨询师三级考试真题及答案解析
- 手术部位医院感染的预防与控制试题及答案
- 旅行社安全生产工作会议记录
- 心脑血管疾病防治健康教育
- 企业三体系培训
- 远程心电图管理制度
- 公司党建宣传片策划方案
- 建信金科面试题及答案
- 支气管哮喘防治指南(2024年版)解读课件
- 2025-2030年中国公安信息化行业市场深度调研及竞争格局与投资策略研究报告
- 山姆合同协议书
- 燃气公司笔试题及答案
- 走进爸爸妈妈的80年代童年生活
评论
0/150
提交评论