




已阅读5页,还剩34页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
PartI,DataMiningFundamentals,Chapter1:DataMining:AFirstView,2020/5/5,BUPTAI&DM,2,Content,1.1WhatisDataMining?Definition1.2WhatcancomputersLearn?1.3IsDataMiningAppropriateforMyProblem?1.4ExpertSystemsorDataMining?1.6WhyNotSimpleSearch?,2020/5/5,BUPTAI&DM,3,1.1Whatisdatamining:Motivation,DataexplosionproblemAutomateddatacollectiontoolsandmaturedatabasetechnologyleadtotremendousamountsofdatastoredindatabases,datawarehousesandotherinformationrepositories.Suchamountofdatabeyondhumanunderstanding.Wearedrowningindata,butstarvingforknowledge!Solution:DatawarehousinganddataminingDatawarehousing:fordatastorageDatamining:forExtractionofinterestingknowledge(rules,regularities,patterns,constraints)fromdatainlargedatabases,2020/5/5,BUPTAI&DM,4,1.1DataMiningisaresultofnaturalevolutionofinformationtechnology,1960s:Datacollectionanddatabasecreation1970s-early1980s:DatabaseManagementSystemsMid-1980s-present:DatawarehouseDataanalysisandunderstanding(datamining),2020/5/5,BUPTAI&DM,5,DataAnalysis:NewTrend,Thisisatimethatonemustspeakwithdata.未来属于运算师(SuperCrunchers超级运算师,IanAyres,2009):日常决策将变得越来越自动化,人的判断作用将局限于为计算提供数据葡萄酒味道和香味的预测:奥利.阿申费尔特是普林斯顿大学的经济学家,完全不懂葡萄酒的制作,但可以预测波尔多葡萄酒的价格基于天气(炎热、干燥的年份酒会非常好),准确率高于葡萄酒专家本书原计划叫“理论的终结”,后来利用google改书名而不是与出版社编辑讨论,因为发现用此名点击率高63%放贷员曾经收入优厚、职责最大,现在只是呼叫中心的接线员,重复电脑提示的问题,报酬很低,2020/5/5,BUPTAI&DM,6,DataAnalysis:NewTrend(cont.),Thisisatimethatonemustspeakwithdata.基因测序和新物种:克雷格.文特尔使用能够分析数据的高速计算机,从给单个生物基因排序,2003年开始给海洋测序,2005年给空气测序。这个过程中发现了数千种以前不知道的细菌和其它生命形式。他对生物学的推进比同辈所有人都大。,2020/5/5,BUPTAI&DM,7,在过去,上海通用保修问题分析主要依靠简单的纯手工处理的计算方式,每次只能产生寥寥几篇问题报告。尽管汽车生产量远不如现在大,但这个耗时费力的分析周期却在根本上导致了保修成本居高不下。在非自动操作环境下,从保修索赔出现到找出问题原因平均要花费612个月的时间,且在此间往往还需要借助于通用全球的支持,解决问题的整个过程也主要建立在经验分析的基础上。另外,不准确的数据导致上海通用难以准确预测保修成本,从而合理准备下一周期的保修预算,导致大量运营资金被占用、现金流降低。采用SAS的保修分析解决方案后,上海通用的保修分析周期在头6个月里就缩短了70%,有效地降低了保修成本,实现了该系统使用的预期目标。同时,这些显著的改善效果帮助上海通用在短短半年内就收回了保修分析系统所有的软硬件投资,共为公司节省了1,800万人民币的成本。警察地理信息系统,2020/5/5,BUPTAI&DM,8,DataMiningDefinitions,(1)Theprocessofemployingoneormorecomputerlearningtechniquestoautomaticallyanalyzeandextractknowledgefromdata.(inthistextbook)(2)Extractionofinteresting(non-trivial,implicit,previouslyunknownandpotentiallyuseful)informationorpatternsfromdatainlargedatabases.(generallyaccepted),2020/5/5,BUPTAI&DM,9,Induction-basedLearning(基于归纳的学习),Dataminingmethodsuseinduction-basedlearningTheprocessofforminggeneralconceptdefinitionsbyobservingspecificexamplesofconceptstobelearned.,2020/5/5,BUPTAI&DM,10,WhatIsDataMining?,Alternativenames:Dataminingorknowledgemining?Goldmining-pooranalogyKnowledgediscoveryindatabases(KDD),businessintelligence,2020/5/5,BUPTAI&DM,11,WhyDataMining?PotentialApplications(orp4),DatabaseanalysisanddecisionsupportMarketanalysisandmanagementtargetmarketing,crossselling,marketsegmentationRiskanalysisandmanagementForecasting,customerretention,qualitycontrolFrauddetectionandmanagementOtherApplicationsTextmining(newsgroup,email,documents)andWebanalysis.,2020/5/5,BUPTAI&DM,12,Content,1.1WhatisDataMining?Definition1.2WhatcancomputersLearn?FourLevelsofLearning(略)ThreeConceptViews(略)SupervisedLearningUnsupervisedLearning1.3IsDataMiningAppropriateforMyProblem?1.4ExpertSystemsorDataMining?1.6WhyNotSimpleSearch?,2020/5/5,BUPTAI&DM,13,1.2.1SupervisedLearning,Buildalearnermodelusingdatainstancesofknownorigin.Usethemodeltodeterminetheoutcomeofnewinstancesofunknownorigin.,2020/5/5,BUPTAI&DM,14,Attributes:inputattributes,outputattributesProcess:TrainingData,TestDataLearningoutcome:tree,productionrules,2020/5/5,BUPTAI&DM,15,2020/5/5,BUPTAI&DM,16,Decisiontree:Atreestructurewherenon-terminalnodesrepresenttestsononeormoreattributesandterminalnodes(leafnodes)reflectdecisionoutcomes.rootnode,2020/5/5,BUPTAI&DM,17,ProductionRules(产生式规则),IFSwollenGlands=YesTHENDiagnosis=StrepThroatIFSwollenGlands=No&Fever=YesTHENDiagnosis=ColdIFSwollenGlands=No&Fever=NoTHENDiagnosis=Allergy,Antecedentconditions:先决条件Consequentconditions:结论,2020/5/5,BUPTAI&DM,18,1.2.2UnsupervisedClustering,Adataminingmethodthatbuildsmodelsfromdatawithoutpredefinedclasses.,2020/5/5,BUPTAI&DM,19,TheAcmeInvestorsDataset,TheAcmeInvestorsDataset&SupervisedLearning,CanIdevelopageneralprofileofanonlineinvestor?CanIdetermineifanewcustomerislikelytoopenamarginaccount?CanIbuildamodeltoaccuratelypredicttheaveragenumberoftradespermonthforanewinvestor?Whatcharacteristicsdifferentiatefemaleandmaleinvestors?,WhatattributesimilaritiesgroupcustomersofAcmeInvestorstogether?Whatdifferencesinattributevaluessegmentthecustomerdatabase?,TheAcmeInvestorsDataset&UnsupervisedClustering,2020/5/5,BUPTAI&DM,22,IFMarginAccount=Yes&Age=20-29&AnnualIncome=40-59kTHENCluster=1accuracy=0.80,coverage=0.50IFAccountType=Custodial&FavoriteRecreation=Skiing&AnnualIncome=80-90kTHENCluster=2accuracy=0.95,coverage=0.35IFAccountType=Joint&Trades/Month5&TransactionMethod=OnlineTHENCluster=3accuracy=0.82,coverage=0.65,(seeexampleclustersonp13),2020/5/5,BUPTAI&DM,23,Content,1.1WhatisDataMining?Definition1.2WhatcancomputersLearn?1.3IsDataMiningAppropriateforMyProblem?(DataMiningvsDataQuery)1.4ExpertSystemsorDataMining?1.6WhyNotSimpleSearch?,2020/5/5,BUPTAI&DM,24,DataMiningorDataQuery?,ShallowKnowledge:Shallowknowledgeisfactual.Itcanbeeasilystoredandmanipulatedinadatabase.MultidimensionalKnowledge:Multidimensionalknowledgeisalsofactual.On-lineanalyticalProcessing(OLAP)toolsareusedtomanipulatemultidimensionalknowledge.HiddenKnowledge:Hiddenknowledgerepresentspatternsorregularitiesindatathatcannotbeeasilyfoundusingdatabasequery.However,dataminingalgorithmscanfindsuchpatternswithease(examplep15).DeepKnowledge:Deepknowledgeisknowledgestoredinadatabasethatcanonlybefoundifwearegivensomedirectionaboutwhatwearelookingfor.,DataMiningvs.DataQuery:AnExample(p16),Usedataqueryifyoualreadyalmostknowwhatyouarelookingfor.Usedataminingtofindregularitiesindatathatarenotobvious.,2020/5/5,BUPTAI&DM,26,Content,1.1WhatisDataMining?Definition1.2WhatcancomputersLearn?1.3IsDataMiningAppropriateforMyProblem?(DataMiningvsDataQuery)1.4ExpertSystemsorDataMining?(DataMiningvsES)1.6WhyNotSimpleSearch?,2020/5/5,BUPTAI&DM,27,1.4ExpertSystemsorDataMining?,ExpertSystem(ES):Acomputerprogramthatemulatestheproblem-solvingskillsofoneormorehumanexperts.Usedwhenno(quality)dataavailable,or,inthefieldwherehumanhasgoodknowledgeinit.Expertslearntheirskillsbyeducationandexperience.Humanexpertsoftenuserulestodescribewhattheyknow.ESandDMcanworktogether.,2020/5/5,BUPTAI&DM,28,2020/5/5,BUPTAI&DM,29,Content,1.1WhatisDataMining?Definition1.2WhatcancomputersLearn?1.3IsDataMiningAppropriateforMyProblem?(DataMiningvsDataQuery)1.4ExpertSystemsorDataMining?(DataMiningvsES)1.6WhyNotSimpleSearch?(DataMiningvsNearestNeighborApproach),2020/5/5,BUPTAI&DM,30,1.6WhyNotSimpleSearch?,Storesinstancesorgeneralizedmodelofthedata.NearestNeighborClassifierClassificationisperformedbysearchingthetrainingdatafortheinstanceclosestindistancetotheunknowninstance.Advantage:suitableforareaswherehumanhaslimitedknowledgeP
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- tcp ip协议书有哪些
- 承包犁地协议书
- 担保免责协议书
- 服务期限协议书
- 载种树苗协议书
- 公检法协议书
- 合作协议书促销
- 财产保险协议书
- 二、设置段落格式说课稿初中信息技术沪科版七年级下册-沪科版
- 3.1多变的天气 教学设计 2023-2024学年地理人教版七年级上册
- 劳动课冰箱清洁课件
- 2025年公共基础知识考试试题及参考答案详解
- 建筑设计数字化协同工作方案
- 新入行员工安全教育培训课件
- 原生家庭探索课件
- 人教版音乐八年级上册-《学习项目二探索旋律结构的规律》-课堂教学设计
- 《中国人民站起来了》课件 (共50张)2025-2026学年统编版高中语文选择性必修上册
- 中国企业供应链金融白皮书(2025)-清华五道口
- 医院常用消毒液的使用及配置方法
- 2022英威腾MH600交流伺服驱动说明书手册
- 分期支付欠薪协议书范本
评论
0/150
提交评论