数据仓库与数据挖掘(2)_第1页
数据仓库与数据挖掘(2)_第2页
数据仓库与数据挖掘(2)_第3页
数据仓库与数据挖掘(2)_第4页
数据仓库与数据挖掘(2)_第5页
已阅读5页,还剩33页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Chapter20:DataAnalysis,Chapter20:DataAnalysis,DecisionSupportSystemsDataWarehousingDataMiningClassificationAssociationRulesClustering,DecisionSupportSystems,Decision-supportsystemsareusedtomakebusinessdecisions,oftenbasedondatacollectedbyon-linetransaction-processingsystems.Examplesofbusinessdecisions:Whatitemstostock?Whatinsurancepremiumtochange?Towhomtosendadvertisements?ExamplesofdatausedformakingdecisionsRetailsalestransactiondetailsCustomerprofiles(income,age,gender,etc.),Decision-SupportSystems:Overview,DataanalysistasksaresimplifiedbyspecializedtoolsandSQLextensionsExampletasksForeachproductcategoryandeachregion,whatwerethetotalsalesinthelastquarterandhowdotheycomparewiththesamequarterlastyearAsabove,foreachproductcategoryandeachcustomercategoryStatisticalanalysispackages(e.g.,:S+)canbeinterfacedwithdatabasesStatisticalanalysisisalargefield,butnotcoveredhereDataminingseekstodiscoverknowledgeautomaticallyintheformofstatisticalrulesandpatternsfromlargedatabases.Adatawarehousearchivesinformationgatheredfrommultiplesources,andstoresitunderaunifiedschema,atasinglesite.Importantforlargebusinessesthatgeneratedatafrommultipledivisions,possiblyatmultiplesitesDatamayalsobepurchasedexternally,DataWarehousing,Datasourcesoftenstoreonlycurrentdata,nothistoricaldataCorporatedecisionmakingrequiresaunifiedviewofallorganizationaldata,includinghistoricaldataAdatawarehouseisarepository(archive)ofinformationgatheredfrommultiplesources,storedunderaunifiedschema,atasinglesiteGreatlysimplifiesquerying,permitsstudyofhistoricaltrendsShiftsdecisionsupportqueryloadawayfromtransactionprocessingsystems,DataWarehousing,DesignIssues,WhenandhowtogatherdataSourcedrivenarchitecture:datasourcestransmitnewinformationtowarehouse,eithercontinuouslyorperiodically(e.g.,atnight)Destinationdrivenarchitecture:warehouseperiodicallyrequestsnewinformationfromdatasourcesKeepingwarehouseexactlysynchronizedwithdatasources(e.g.,usingtwo-phasecommit)istooexpensiveUsuallyOKtohaveslightlyout-of-datedataatwarehouseData/updatesareperiodicallydownloadedformonlinetransactionprocessing(OLTP)systems.WhatschematouseSchemaintegration,MoreWarehouseDesignIssues,DatacleansingE.g.,correctmistakesinaddresses(misspellings,zipcodeerrors)MergeaddresslistsfromdifferentsourcesandpurgeduplicatesHowtopropagateupdatesWarehouseschemamaybea(materialized)viewofschemafromdatasourcesWhatdatatosummarizeRawdatamaybetoolargetostoreon-lineAggregatevalues(totals/subtotals)oftensufficeQueriesonrawdatacanoftenbetransformedbyqueryoptimizertouseaggregatevalues,WarehouseSchemas,DimensionvaluesareusuallyencodedusingsmallintegersandmappedtofullvaluesviadimensiontablesResultantschemaiscalledastarschemaMorecomplicatedschemastructuresSnowflakeschema:multiplelevelsofdimensiontablesConstellation:multiplefacttables,DataWarehouseSchema,DataMining,Dataminingistheprocessofsemi-automaticallyanalyzinglargedatabasestofindusefulpatternsPredictionbasedonpasthistoryPredictifacreditcardapplicantposesagoodcreditrisk,basedonsomeattributes(income,jobtype,age,.)andpasthistoryPredictifapatternofphonecallingcardusageislikelytobefraudulentSomeexamplesofpredictionmechanisms:ClassificationGivenanewitemwhoseclassisunknown,predicttowhichclassitbelongsRegressionformulaeGivenasetofmappingsforanunknownfunction,predictthefunctionresultforanewparametervalue,DataMining(Cont.),DescriptivePatternsAssociationsFindbooksthatareoftenboughtby“similar”customers.Ifanewsuchcustomerbuysonesuchbook,suggesttheotherstoo.AssociationsmaybeusedasafirststepindetectingcausationE.g.,associationbetweenexposuretochemicalXandcancer,ClustersE.g.,typhoidcaseswereclusteredinanareasurroundingacontaminatedwellDetectionofclustersremainsimportantindetectingepidemics,ClassificationRules,Classificationruleshelpassignnewobjectstoclasses.E.g.,givenanewautomobileinsuranceapplicant,shouldheorshebeclassifiedaslowrisk,mediumriskorhighrisk?Classificationrulesforaboveexamplecoulduseavarietyofdata,suchaseducationallevel,salary,age,etc.personP,P.degree=mastersandP.income75,000P.credit=excellentpersonP,P.degree=bachelorsand(P.income25,000andP.income75,000)P.credit=goodRulesarenotnecessarilyexact:theremaybesomemisclassificationsClassificationrulescanbeshowncompactlyasadecisiontree.,DecisionTree,ConstructionofDecisionTrees,Trainingset:adatasampleinwhichtheclassificationisalreadyknown.Greedytopdowngenerationofdecisiontrees.Eachinternalnodeofthetreepartitionsthedataintogroupsbasedonapartitioningattribute,andapartitioningconditionforthenodeLeafnode:all(ormost)oftheitemsatthenodebelongtothesameclass,orallattributeshavebeenconsidered,andnofurtherpartitioningispossible.,BestSplits,PickbestattributesandconditionsonwhichtopartitionThepurityofasetSoftraininginstancescanbemeasuredquantitativelyinseveralways.Notation:numberofclasses=k,numberofinstances=|S|,fractionofinstancesinclassi=pi.TheGinimeasureofpurityisdefinedasGini(S)=1-Whenallinstancesareinasingleclass,theGinivalueis0Itreachesitsmaximum(of11/k)ifeachclassthesamenumberofinstances.,BestSplits(Cont.),Anothermeasureofpurityistheentropymeasure,whichisdefinedasentropy(S)=WhenasetSissplitintomultiplesetsSi,I=1,2,r,wecanmeasurethepurityoftheresultantsetofsetsas:purity(S1,S2,.,Sr)=TheinformationgainduetoparticularsplitofSintoSi,i=1,2,.,rInformation-gain(S,S1,S2,.,Sr)=purity(S)purity(S1,S2,Sr),BestSplits(Cont.),Measureof“cost”ofasplit:Information-content(S,S1,S2,.,Sr)=Information-gainratio=Information-gain(S,S1,S2,Sr)Information-content(S,S1,S2,.,Sr)Thebestsplitistheonethatgivesthemaximuminformationgainratio,FindingBestSplits,Categoricalattributes(withnomeaningfulorder):Multi-waysplit,onechildforeachvalueBinarysplit:tryallpossiblebreakupofvaluesintotwosets,andpickthebestContinuous-valuedattributes(canbesortedinameaningfulorder)Binarysplit:Sortvalues,tryeachasasplitpointE.g.,ifvaluesare1,10,15,25,splitat1,10,15PickthevaluethatgivesbestsplitMulti-waysplit:Aseriesofbinarysplitsonthesameattributehasroughlyequivalenteffect,Decision-TreeConstructionAlgorithm,ProcedureGrowTree(S)Partition(S);ProcedurePartition(S)if(purity(S)por|S|s)thenreturn;foreachattributeAevaluatesplitsonattributeA;Usebestsplitfound(acrossallattributes)topartitionSintoS1,S2,.,Sr,fori=1,2,.,rPartition(Si);,OtherTypesofClassifiers,NeuralnetclassifiersarestudiedinartificialintelligenceandarenotcoveredhereBayesianclassifiersuseBayestheorem,whichsaysp(cj|d)=p(d|cj)p(cj)p(d)wherep(cj|d)=probabilityofinstancedbeinginclasscj,p(d|cj)=probabilityofgeneratinginstancedgivenclasscj,p(cj)=probabilityofoccurrenceofclasscj,andp(d)=probabilityofinstancedoccuring,NaveBayesianClassifiers,Bayesianclassifiersrequirecomputationofp(d|cj)precomputationofp(cj)p(d)canbeignoredsinceitisthesameforallclassesTosimplifythetask,naveBayesianclassifiersassumeattributeshaveindependentdistributions,andtherebyestimatep(d|cj)=p(d1|cj)*p(d2|cj)*.*(p(dn|cj)Eachofthep(di|cj)canbeestimatedfromahistogramondivaluesforeachclasscjthehistogramiscomputedfromthetraininginstancesHistogramsonmultipleattributesaremoreexpensivetocomputeandstore,Regression,Regressiondealswiththepredictionofavalue,ratherthanaclass.Givenvaluesforasetofvariables,X1,X2,Xn,wewishtopredictthevalueofavariableY.Onewayistoinfercoefficientsa0,a1,a1,ansuchthatY=a0+a1*X1+a2*X2+an*XnFindingsuchalinearpolynomialiscalledlinearregression.Ingeneral,theprocessoffindingacurvethatfitsthedataisalsocalledcurvefitting.Thefitmayonlybeapproximatebecauseofnoiseinthedata,orbecausetherelationshipisnotexactlyapolynomialRegressionaimstofindcoefficientsthatgivethebestpossiblefit.,AssociationRules,Retailshopsareofteninterestedinassociationsbetweendifferentitemsthatpeoplebuy.SomeonewhobuysbreadisquitelikelyalsotobuymilkApersonwhoboughtthebookDatabaseSystemConceptsisquitelikelyalsotobuythebookOperatingSystemConcepts.Associationsinformationcanbeusedinseveralways.E.g.,whenacustomerbuysaparticularbook,anonlineshopmaysuggestassociatedbooks.Associationrules:breadmilkDB-Concepts,OS-ConceptsNetworksLefthandside:antecedent,righthandside:consequentAnassociationrulemusthaveanassociatedpopulation;thepopulationconsistsofasetofinstancesE.g.,eachtransaction(sale)atashopisaninstance,andthesetofalltransactionsisthepopulation,AssociationRules(Cont.),Ruleshaveanassociatedsupport,aswellasanassociatedconfidence.Supportisameasureofwhatfractionofthepopulationsatisfiesboththeantecedentandtheconsequentoftherule.E.g.,supposeonly0.001percentofallpurchasesincludemilkandscrewdrivers.Thesupportfortheruleismilkscrewdriversislow.Confidenceisameasureofhowoftentheconsequentistruewhentheantecedentistrue.E.g.,therulebreadmilkhasaconfidenceof80percentif80percentofthepurchasesthatincludebreadalsoincludemilk.,FindingAssociationRules,Wearegenerallyonlyinterestedinassociationruleswithreasonablyhighsupport(e.g.,supportof2%orgreater)NavealgorithmConsiderallpossiblesetsofrelevantitems.Foreachsetfinditssupport(i.e.,counthowmanytransactionspurchaseallitemsintheset).Largeitemsets:setswithsufficientlyhighsupportUselargeitemsetstogenerateassociationrules.FromitemsetAgeneratetheruleA-bbforeachbA.Supportofrule=support(A).Confidenceofrule=support(A)/support(A-b),FindingSupport,DeterminesupportofitemsetsviaasinglepassonsetoftransactionsLargeitemsets:setswithahighcountattheendofthepassIfmemorynotenoughtoholdallcountsforallitemsetsusemultiplepasses,consideringonlysomeitemsetsineachpass.Optimization:Onceanitemsetiseliminatedbecauseitscount(support)istoosmallnoneofitssupersetsneedstobeconsidered.Theaprioritechniquetofindlargeitemsets:Pass1:countsupportofallsetswithjust1item.EliminatethoseitemswithlowsupportPassi:candidates:everysetofiitemssuchthatallitsi-1itemsubsetsarelargeCountsupportofallcandidatesStopiftherearenocandidates,OtherTypesofAssociations,BasicassociationruleshaveseverallimitationsDeviationsfromtheexpectedprobabilityaremoreinterestingE.g.,ifmanypeoplepurchasebread,andmanypeoplepurchasecereal,quiteafewwouldbeexpectedtopurchasebothWeareinterestedinpositiveaswellasnegativecorrelationsbetweensetsofitemsPositivecorrelation:co-occurrenceishigherthanpredictedNegativecorrelation:co-occurrenceislowerthanpredictedSequenceassociations/correlationsE.g.,wheneverbondsgoup,stockpricesgodownin2daysDeviationsfromtemporalpatternsE.g.,deviationfromasteadygrowthE.g.,salesofwinterweargodowninsummerNotsurprising,partofaknownpattern.Lookfordeviationfromvaluepredictedusingpastpatterns,Clustering,Clustering:Intuitively,findingclustersofpointsinthegivendatasuchthatsimilarpointslieinthesameclusterCanbeformalizedusingdistancemetricsinseveralwaysGrouppointsintoksets(foragivenk)suchthattheaveragedistanceofpointsfromthecentroidoftheirassignedgroupisminimizedCentroid:pointdefinedbytakingaverageofcoordinatesineachdimension.Anothermetric:minimizeaveragedistancebetweeneverypairofpointsinaclusterHasbeenstudiedextensivelyinstatistics,butonsmalldatasetsDataminingsystemsaimatclusteringtechniquesthatcanhandleverylargedatasetsE.g.,theBirchclusteringalgorithm(moreshortly),HierarchicalClustering,Examplefrombiologicalclassification(thewordclassificationheredoesnotmeanapredictionmechanism)chordatamammaliareptilialeopardshumanssnakescrocodilesOtherexamples:Internetdirectorysystems(e.g.,Yahoo,moreonthislater)AgglomerativeclusteringalgorithmsBuildsmallclusters,thenclustersmallclustersintobiggerclusters,andsoonDivisiveclusteringalgorithmsStartwithallitemsinasinglecluster,repeatedlyrefine(break)clustersintosmallerones,ClusteringAlgorithms,ClusteringalgorithmshavebeendesignedtohandleverylargedatasetsE.g.,theBirchalgorithmMainidea:useanin-memoryR-treetostorepointsthatarebeingclusteredInsertpointsoneatatimeintotheR-tree,m

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论