




已阅读5页,还剩33页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
Chapter20:DataAnalysis,Chapter20:DataAnalysis,DecisionSupportSystemsDataWarehousingDataMiningClassificationAssociationRulesClustering,DecisionSupportSystems,Decision-supportsystemsareusedtomakebusinessdecisions,oftenbasedondatacollectedbyon-linetransaction-processingsystems.Examplesofbusinessdecisions:Whatitemstostock?Whatinsurancepremiumtochange?Towhomtosendadvertisements?ExamplesofdatausedformakingdecisionsRetailsalestransactiondetailsCustomerprofiles(income,age,gender,etc.),Decision-SupportSystems:Overview,DataanalysistasksaresimplifiedbyspecializedtoolsandSQLextensionsExampletasksForeachproductcategoryandeachregion,whatwerethetotalsalesinthelastquarterandhowdotheycomparewiththesamequarterlastyearAsabove,foreachproductcategoryandeachcustomercategoryStatisticalanalysispackages(e.g.,:S+)canbeinterfacedwithdatabasesStatisticalanalysisisalargefield,butnotcoveredhereDataminingseekstodiscoverknowledgeautomaticallyintheformofstatisticalrulesandpatternsfromlargedatabases.Adatawarehousearchivesinformationgatheredfrommultiplesources,andstoresitunderaunifiedschema,atasinglesite.Importantforlargebusinessesthatgeneratedatafrommultipledivisions,possiblyatmultiplesitesDatamayalsobepurchasedexternally,DataWarehousing,Datasourcesoftenstoreonlycurrentdata,nothistoricaldataCorporatedecisionmakingrequiresaunifiedviewofallorganizationaldata,includinghistoricaldataAdatawarehouseisarepository(archive)ofinformationgatheredfrommultiplesources,storedunderaunifiedschema,atasinglesiteGreatlysimplifiesquerying,permitsstudyofhistoricaltrendsShiftsdecisionsupportqueryloadawayfromtransactionprocessingsystems,DataWarehousing,DesignIssues,WhenandhowtogatherdataSourcedrivenarchitecture:datasourcestransmitnewinformationtowarehouse,eithercontinuouslyorperiodically(e.g.,atnight)Destinationdrivenarchitecture:warehouseperiodicallyrequestsnewinformationfromdatasourcesKeepingwarehouseexactlysynchronizedwithdatasources(e.g.,usingtwo-phasecommit)istooexpensiveUsuallyOKtohaveslightlyout-of-datedataatwarehouseData/updatesareperiodicallydownloadedformonlinetransactionprocessing(OLTP)systems.WhatschematouseSchemaintegration,MoreWarehouseDesignIssues,DatacleansingE.g.,correctmistakesinaddresses(misspellings,zipcodeerrors)MergeaddresslistsfromdifferentsourcesandpurgeduplicatesHowtopropagateupdatesWarehouseschemamaybea(materialized)viewofschemafromdatasourcesWhatdatatosummarizeRawdatamaybetoolargetostoreon-lineAggregatevalues(totals/subtotals)oftensufficeQueriesonrawdatacanoftenbetransformedbyqueryoptimizertouseaggregatevalues,WarehouseSchemas,DimensionvaluesareusuallyencodedusingsmallintegersandmappedtofullvaluesviadimensiontablesResultantschemaiscalledastarschemaMorecomplicatedschemastructuresSnowflakeschema:multiplelevelsofdimensiontablesConstellation:multiplefacttables,DataWarehouseSchema,DataMining,Dataminingistheprocessofsemi-automaticallyanalyzinglargedatabasestofindusefulpatternsPredictionbasedonpasthistoryPredictifacreditcardapplicantposesagoodcreditrisk,basedonsomeattributes(income,jobtype,age,.)andpasthistoryPredictifapatternofphonecallingcardusageislikelytobefraudulentSomeexamplesofpredictionmechanisms:ClassificationGivenanewitemwhoseclassisunknown,predicttowhichclassitbelongsRegressionformulaeGivenasetofmappingsforanunknownfunction,predictthefunctionresultforanewparametervalue,DataMining(Cont.),DescriptivePatternsAssociationsFindbooksthatareoftenboughtby“similar”customers.Ifanewsuchcustomerbuysonesuchbook,suggesttheotherstoo.AssociationsmaybeusedasafirststepindetectingcausationE.g.,associationbetweenexposuretochemicalXandcancer,ClustersE.g.,typhoidcaseswereclusteredinanareasurroundingacontaminatedwellDetectionofclustersremainsimportantindetectingepidemics,ClassificationRules,Classificationruleshelpassignnewobjectstoclasses.E.g.,givenanewautomobileinsuranceapplicant,shouldheorshebeclassifiedaslowrisk,mediumriskorhighrisk?Classificationrulesforaboveexamplecoulduseavarietyofdata,suchaseducationallevel,salary,age,etc.personP,P.degree=mastersandP.income75,000P.credit=excellentpersonP,P.degree=bachelorsand(P.income25,000andP.income75,000)P.credit=goodRulesarenotnecessarilyexact:theremaybesomemisclassificationsClassificationrulescanbeshowncompactlyasadecisiontree.,DecisionTree,ConstructionofDecisionTrees,Trainingset:adatasampleinwhichtheclassificationisalreadyknown.Greedytopdowngenerationofdecisiontrees.Eachinternalnodeofthetreepartitionsthedataintogroupsbasedonapartitioningattribute,andapartitioningconditionforthenodeLeafnode:all(ormost)oftheitemsatthenodebelongtothesameclass,orallattributeshavebeenconsidered,andnofurtherpartitioningispossible.,BestSplits,PickbestattributesandconditionsonwhichtopartitionThepurityofasetSoftraininginstancescanbemeasuredquantitativelyinseveralways.Notation:numberofclasses=k,numberofinstances=|S|,fractionofinstancesinclassi=pi.TheGinimeasureofpurityisdefinedasGini(S)=1-Whenallinstancesareinasingleclass,theGinivalueis0Itreachesitsmaximum(of11/k)ifeachclassthesamenumberofinstances.,BestSplits(Cont.),Anothermeasureofpurityistheentropymeasure,whichisdefinedasentropy(S)=WhenasetSissplitintomultiplesetsSi,I=1,2,r,wecanmeasurethepurityoftheresultantsetofsetsas:purity(S1,S2,.,Sr)=TheinformationgainduetoparticularsplitofSintoSi,i=1,2,.,rInformation-gain(S,S1,S2,.,Sr)=purity(S)purity(S1,S2,Sr),BestSplits(Cont.),Measureof“cost”ofasplit:Information-content(S,S1,S2,.,Sr)=Information-gainratio=Information-gain(S,S1,S2,Sr)Information-content(S,S1,S2,.,Sr)Thebestsplitistheonethatgivesthemaximuminformationgainratio,FindingBestSplits,Categoricalattributes(withnomeaningfulorder):Multi-waysplit,onechildforeachvalueBinarysplit:tryallpossiblebreakupofvaluesintotwosets,andpickthebestContinuous-valuedattributes(canbesortedinameaningfulorder)Binarysplit:Sortvalues,tryeachasasplitpointE.g.,ifvaluesare1,10,15,25,splitat1,10,15PickthevaluethatgivesbestsplitMulti-waysplit:Aseriesofbinarysplitsonthesameattributehasroughlyequivalenteffect,Decision-TreeConstructionAlgorithm,ProcedureGrowTree(S)Partition(S);ProcedurePartition(S)if(purity(S)por|S|s)thenreturn;foreachattributeAevaluatesplitsonattributeA;Usebestsplitfound(acrossallattributes)topartitionSintoS1,S2,.,Sr,fori=1,2,.,rPartition(Si);,OtherTypesofClassifiers,NeuralnetclassifiersarestudiedinartificialintelligenceandarenotcoveredhereBayesianclassifiersuseBayestheorem,whichsaysp(cj|d)=p(d|cj)p(cj)p(d)wherep(cj|d)=probabilityofinstancedbeinginclasscj,p(d|cj)=probabilityofgeneratinginstancedgivenclasscj,p(cj)=probabilityofoccurrenceofclasscj,andp(d)=probabilityofinstancedoccuring,NaveBayesianClassifiers,Bayesianclassifiersrequirecomputationofp(d|cj)precomputationofp(cj)p(d)canbeignoredsinceitisthesameforallclassesTosimplifythetask,naveBayesianclassifiersassumeattributeshaveindependentdistributions,andtherebyestimatep(d|cj)=p(d1|cj)*p(d2|cj)*.*(p(dn|cj)Eachofthep(di|cj)canbeestimatedfromahistogramondivaluesforeachclasscjthehistogramiscomputedfromthetraininginstancesHistogramsonmultipleattributesaremoreexpensivetocomputeandstore,Regression,Regressiondealswiththepredictionofavalue,ratherthanaclass.Givenvaluesforasetofvariables,X1,X2,Xn,wewishtopredictthevalueofavariableY.Onewayistoinfercoefficientsa0,a1,a1,ansuchthatY=a0+a1*X1+a2*X2+an*XnFindingsuchalinearpolynomialiscalledlinearregression.Ingeneral,theprocessoffindingacurvethatfitsthedataisalsocalledcurvefitting.Thefitmayonlybeapproximatebecauseofnoiseinthedata,orbecausetherelationshipisnotexactlyapolynomialRegressionaimstofindcoefficientsthatgivethebestpossiblefit.,AssociationRules,Retailshopsareofteninterestedinassociationsbetweendifferentitemsthatpeoplebuy.SomeonewhobuysbreadisquitelikelyalsotobuymilkApersonwhoboughtthebookDatabaseSystemConceptsisquitelikelyalsotobuythebookOperatingSystemConcepts.Associationsinformationcanbeusedinseveralways.E.g.,whenacustomerbuysaparticularbook,anonlineshopmaysuggestassociatedbooks.Associationrules:breadmilkDB-Concepts,OS-ConceptsNetworksLefthandside:antecedent,righthandside:consequentAnassociationrulemusthaveanassociatedpopulation;thepopulationconsistsofasetofinstancesE.g.,eachtransaction(sale)atashopisaninstance,andthesetofalltransactionsisthepopulation,AssociationRules(Cont.),Ruleshaveanassociatedsupport,aswellasanassociatedconfidence.Supportisameasureofwhatfractionofthepopulationsatisfiesboththeantecedentandtheconsequentoftherule.E.g.,supposeonly0.001percentofallpurchasesincludemilkandscrewdrivers.Thesupportfortheruleismilkscrewdriversislow.Confidenceisameasureofhowoftentheconsequentistruewhentheantecedentistrue.E.g.,therulebreadmilkhasaconfidenceof80percentif80percentofthepurchasesthatincludebreadalsoincludemilk.,FindingAssociationRules,Wearegenerallyonlyinterestedinassociationruleswithreasonablyhighsupport(e.g.,supportof2%orgreater)NavealgorithmConsiderallpossiblesetsofrelevantitems.Foreachsetfinditssupport(i.e.,counthowmanytransactionspurchaseallitemsintheset).Largeitemsets:setswithsufficientlyhighsupportUselargeitemsetstogenerateassociationrules.FromitemsetAgeneratetheruleA-bbforeachbA.Supportofrule=support(A).Confidenceofrule=support(A)/support(A-b),FindingSupport,DeterminesupportofitemsetsviaasinglepassonsetoftransactionsLargeitemsets:setswithahighcountattheendofthepassIfmemorynotenoughtoholdallcountsforallitemsetsusemultiplepasses,consideringonlysomeitemsetsineachpass.Optimization:Onceanitemsetiseliminatedbecauseitscount(support)istoosmallnoneofitssupersetsneedstobeconsidered.Theaprioritechniquetofindlargeitemsets:Pass1:countsupportofallsetswithjust1item.EliminatethoseitemswithlowsupportPassi:candidates:everysetofiitemssuchthatallitsi-1itemsubsetsarelargeCountsupportofallcandidatesStopiftherearenocandidates,OtherTypesofAssociations,BasicassociationruleshaveseverallimitationsDeviationsfromtheexpectedprobabilityaremoreinterestingE.g.,ifmanypeoplepurchasebread,andmanypeoplepurchasecereal,quiteafewwouldbeexpectedtopurchasebothWeareinterestedinpositiveaswellasnegativecorrelationsbetweensetsofitemsPositivecorrelation:co-occurrenceishigherthanpredictedNegativecorrelation:co-occurrenceislowerthanpredictedSequenceassociations/correlationsE.g.,wheneverbondsgoup,stockpricesgodownin2daysDeviationsfromtemporalpatternsE.g.,deviationfromasteadygrowthE.g.,salesofwinterweargodowninsummerNotsurprising,partofaknownpattern.Lookfordeviationfromvaluepredictedusingpastpatterns,Clustering,Clustering:Intuitively,findingclustersofpointsinthegivendatasuchthatsimilarpointslieinthesameclusterCanbeformalizedusingdistancemetricsinseveralwaysGrouppointsintoksets(foragivenk)suchthattheaveragedistanceofpointsfromthecentroidoftheirassignedgroupisminimizedCentroid:pointdefinedbytakingaverageofcoordinatesineachdimension.Anothermetric:minimizeaveragedistancebetweeneverypairofpointsinaclusterHasbeenstudiedextensivelyinstatistics,butonsmalldatasetsDataminingsystemsaimatclusteringtechniquesthatcanhandleverylargedatasetsE.g.,theBirchclusteringalgorithm(moreshortly),HierarchicalClustering,Examplefrombiologicalclassification(thewordclassificationheredoesnotmeanapredictionmechanism)chordatamammaliareptilialeopardshumanssnakescrocodilesOtherexamples:Internetdirectorysystems(e.g.,Yahoo,moreonthislater)AgglomerativeclusteringalgorithmsBuildsmallclusters,thenclustersmallclustersintobiggerclusters,andsoonDivisiveclusteringalgorithmsStartwithallitemsinasinglecluster,repeatedlyrefine(break)clustersintosmallerones,ClusteringAlgorithms,ClusteringalgorithmshavebeendesignedtohandleverylargedatasetsE.g.,theBirchalgorithmMainidea:useanin-memoryR-treetostorepointsthatarebeingclusteredInsertpointsoneatatimeintotheR-tree,m
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025-2030中国工业自动化技术应用前景与未来市场预测报告
- 酒店餐饮管理温馨提示
- 烧伤科烧伤科痂皮护理规范
- 2025年中学教师资格考试《综合素质》教育教学反思与题库及真题模拟解析(含答案)试卷
- 2025年中学教师资格考试《综合素质》教育教学能力提升模拟冲刺试题型(含答案)
- 2025年健身教练职业技能考核试卷:健身器材使用与维护试题型
- 基于GIS技术剖析长沙市肾综合征出血热的时空分布与风险因子
- 2025年征信考试模拟试题:征信法规与征信报告
- 上海医用装备有限公司2024-2025年综合报告
- 基于FPGA的三维扫描测头反光标志点中心定位及匹配算法的深度剖析与实践
- 《人与动物的关系》课件
- 注射相关感染预防与控制
- 二年级语文《坐井观天》说课课件
- DL∕T 741-2019 架空输电线路运行规程
- 数学同步练习册基础模块(上册)参考答案
- DZ∕T 0173-2022 大地电磁测深法技术规程(正式版)
- 锅炉专工试题
- 医疗质量与安全教育培训
- 2024年江苏省生态环境监测专业技术人员大比武竞赛备考试题库(含答案)
- 《手足口病》课件
- 山东省地震安全性评价收费项目及标准
评论
0/150
提交评论