数据挖掘导论第4课数据分类和预测.ppt_第1页
数据挖掘导论第4课数据分类和预测.ppt_第2页
数据挖掘导论第4课数据分类和预测.ppt_第3页
数据挖掘导论第4课数据分类和预测.ppt_第4页
数据挖掘导论第4课数据分类和预测.ppt_第5页
已阅读5页,还剩40页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

第4课数据分类和预测 徐从富 副教授浙江大学人工智能研究所 浙江大学本科生 数据挖掘导论 课件 内容提纲 Whatisclassification Whatisprediction IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianClassificationPredictionSummaryReference Classificationpredictscategoricalclasslabels discreteornominal classifiesdata constructsamodel basedonthetrainingsetandthevalues classlabels inaclassifyingattributeandusesitinclassifyingnewdataPredictionmodelscontinuous valuedfunctions i e predictsunknownormissingvaluesTypicalapplicationsCreditapprovalTargetmarketingMedicaldiagnosisFrauddetection Classificationvs Prediction Classification ATwo StepProcess Modelconstruction describingasetofpredeterminedclassesEachtuple sampleisassumedtobelongtoapredefinedclass asdeterminedbytheclasslabelattributeThesetoftuplesusedformodelconstructionistrainingsetThemodelisrepresentedasclassificationrules decisiontrees ormathematicalformulaeModelusage forclassifyingfutureorunknownobjectsEstimateaccuracyofthemodelTheknownlabeloftestsampleiscomparedwiththeclassifiedresultfromthemodelAccuracyrateisthepercentageoftestsetsamplesthatarecorrectlyclassifiedbythemodelTestsetisindependentoftrainingset otherwiseover fittingwilloccurIftheaccuracyisacceptable usethemodeltoclassifydatatupleswhoseclasslabelsarenotknown ClassificationProcess 1 ModelConstruction TrainingData ClassificationAlgorithms IFrank professor ORyears 6THENtenured yes Classifier Model ClassificationProcess 2 UsetheModelinPrediction Classifier TestingData UnseenData Jeff Professor 4 Tenured Supervisedvs UnsupervisedLearning Supervisedlearning classification Supervision Thetrainingdata observations measurements etc areaccompaniedbylabelsindicatingtheclassoftheobservationsNewdataisclassifiedbasedonthetrainingsetUnsupervisedlearning clustering TheclasslabelsoftrainingdataisunknownGivenasetofmeasurements observations etc withtheaimofestablishingtheexistenceofclassesorclustersinthedata IssuesRegardingClassificationandPrediction 1 DataPreparation DatacleaningPreprocessdatainordertoreducenoiseandhandlemissingvaluesRelevanceanalysis featureselection RemovetheirrelevantorredundantattributesDatatransformationGeneralizeand ornormalizedata Issuesregardingclassificationandprediction 2 Evaluatingclassificationmethods Accuracy classifieraccuracyandpredictoraccuracySpeedandscalabilitytimetoconstructthemodel trainingtime timetousethemodel classification predictiontime RobustnesshandlingnoiseandmissingvaluesScalabilityefficiencyindisk residentdatabasesInterpretabilityunderstandingandinsightprovidedbythemodelOthermeasures e g goodnessofrules suchasdecisiontreesizeorcompactnessofclassificationrules DecisionTreeInduction TrainingDataset ThisfollowsanexampleofQuinlan sID3 PlayingTennis Output ADecisionTreefor buys computer AlgorithmforDecisionTreeInduction Basicalgorithm agreedyalgorithm Treeisconstructedinatop downrecursivedivide and conquermannerAtstart allthetrainingexamplesareattherootAttributesarecategorical ifcontinuous valued theyarediscretizedinadvance ExamplesarepartitionedrecursivelybasedonselectedattributesTestattributesareselectedonthebasisofaheuristicorstatisticalmeasure e g informationgain ConditionsforstoppingpartitioningAllsamplesforagivennodebelongtothesameclassTherearenoremainingattributesforfurtherpartitioning majorityvotingisemployedforclassifyingtheleafTherearenosamplesleft AttributeSelectionMeasure InformationGain ID3 C4 5 SelecttheattributewiththehighestinformationgainScontainssituplesofclassCifori 1 m informationmeasuresinforequiredtoclassifyanyarbitrarytupleentropyofattributeAwithvalues a1 a2 av informationgainedbybranchingonattributeA AttributeSelectionbyInformationGainComputation ClassP buys computer yes ClassN buys computer no I p n I 9 5 0 940Computetheentropyforage means age 30 has5outof14samples with2yes esand3no s Hence Similarly ComputingInformation GainforContinuous ValueAttributes LetattributeAbeacontinuous valuedattributeMustdeterminethebestsplitpointforASortthevalueAinincreasingorderTypically themidpointbetweeneachpairofadjacentvaluesisconsideredasapossiblesplitpoint ai ai 1 2isthemidpointbetweenthevaluesofaiandai 1ThepointwiththeminimumexpectedinformationrequirementforAisselectedasthesplit pointforASplit D1isthesetoftuplesinDsatisfyingA split point andD2isthesetoftuplesinDsatisfyingA split point ExtractingClassificationRulesfromTrees RepresenttheknowledgeintheformofIF THENrulesOneruleiscreatedforeachpathfromtheroottoaleafEachattribute valuepairalongapathformsaconjunctionTheleafnodeholdstheclasspredictionRulesareeasierforhumanstounderstandExampleIFage 40 ANDcredit rating excellent THENbuys computer yes IFage 30 ANDcredit rating fair THENbuys computer no AvoidOverfittinginClassification Overfitting AninducedtreemayoverfitthetrainingdataToomanybranches somemayreflectanomaliesduetonoiseoroutliersPooraccuracyforunseensamplesTwoapproachestoavoidoverfittingPrepruning Halttreeconstructionearly donotsplitanodeifthiswouldresultinthegoodnessmeasurefallingbelowathresholdDifficulttochooseanappropriatethresholdPostpruning Removebranchesfroma fullygrown tree getasequenceofprogressivelyprunedtreesUseasetofdatadifferentfromthetrainingdatatodecidewhichisthe bestprunedtree ApproachestoDeterminetheFinalTreeSize Separatetraining 2 3 andtesting 1 3 setsUsecrossvalidationUseallthedatafortrainingbutapplyastatisticaltest e g chi square toestimatewhetherexpandingorpruninganodemayimprovetheentiredistribution EnhancementstoBasicDecisionTreeInduction Allowforcontinuous valuedattributesDynamicallydefinenewdiscrete valuedattributesthatpartitionthecontinuousattributevalueintoadiscretesetofintervalsHandlemissingattributevaluesAssignthemostcommonvalueoftheattributeAssignprobabilitytoeachofthepossiblevaluesAttributeconstructionCreatenewattributesbasedonexistingonesthataresparselyrepresentedThisreducesfragmentation repetition andreplication ClassificationinLargeDatabases Classification aclassicalproblemextensivelystudiedbystatisticiansandmachinelearningresearchersScalability ClassifyingdatasetswithmillionsofexamplesandhundredsofattributeswithreasonablespeedWhydecisiontreeinductionindatamining relativelyfasterlearningspeed thanotherclassificationmethods convertibletosimpleandeasytounderstandclassificationrulescanuseSQLqueriesforaccessingdatabasescomparableclassificationaccuracywithothermethods ScalableDecisionTreeInductionMethods SLIQ EDBT 96 Mehtaetal buildsanindexforeachattributeandonlyclasslistandthecurrentattributelistresideinmemorySPRINT VLDB 96 J Shaferetal constructsanattributelistdatastructurePUBLIC VLDB 98 Rastogi Shim integratestreesplittingandtreepruning stopgrowingthetreeearlierRainForest VLDB 98 Gehrke Ramakrishnan Ganti separatesthescalabilityaspectsfromthecriteriathatdeterminethequalityofthetreebuildsanAVC list attribute value classlabel PresentationofClassificationResults VisualizationofaDecisionTreeinSGI MineSet3 0 InteractiveVisualMiningbyPerception BasedClassification PBC BayesianClassification Why Probabilisticlearning Calculateexplicitprobabilitiesforhypothesis amongthemostpracticalapproachestocertaintypesoflearningproblemsIncremental Eachtrainingexamplecanincrementallyincrease decreasetheprobabilitythatahypothesisiscorrect Priorknowledgecanbecombinedwithobserveddata Probabilisticprediction Predictmultiplehypotheses weightedbytheirprobabilitiesStandard EvenwhenBayesianmethodsarecomputationallyintractable theycanprovideastandardofoptimaldecisionmakingagainstwhichothermethodscanbemeasured BayesianTheorem Basics LetXbeadatasamplewhoseclasslabelisunknownLetHbeahypothesisthatXbelongstoclassCForclassificationproblems determineP H X theprobabilitythatthehypothesisholdsgiventheobserveddatasampleXP H priorprobabilityofhypothesisH i e theinitialprobabilitybeforeweobserveanydata reflectsthebackgroundknowledge P X probabilitythatsampledataisobservedP X H probabilityofobservingthesampleX giventhatthehypothesisholds BayesianTheorem GiventrainingdataX posterioriprobabilityofahypothesisH P H X followstheBayestheoremInformally thiscanbewrittenasposteriori likelihoodxprior evidenceMAP maximumposteriori hypothesisPracticaldifficulty requireinitialknowledgeofmanyprobabilities significantcomputationalcost Na veBayesClassifier Asimplifiedassumption attributesareconditionallyindependent Theproductofoccurrenceofsay2elementsx1andx2 giventhecurrentclassisC istheproductoftheprobabilitiesofeachelementtakenseparately giventhesameclassP y1 y2 C P y1 C P y2 C NodependencerelationbetweenattributesGreatlyreducesthecomputationcost onlycounttheclassdistribution OncetheprobabilityP X Ci isknown assignXtotheclasswithmaximumP X Ci P Ci Trainingdataset Class C1 buys computer yes C2 buys computer no DatasampleX age 30 Income medium Student yesCredit rating Fair Na veBayesianClassifier AnExample ComputeP X Ci foreachclassP age 30 buys computer yes 2 9 0 222P age 30 buys computer no 3 5 0 6P income medium buys computer yes 4 9 0 444P income medium buys computer no 2 5 0 4P student yes buys computer yes 6 9 0 667P student yes buys computer no 1 5 0 2P credit rating fair buys computer yes 6 9 0 667P credit rating fair buys computer no 2 5 0 4X age 30 income medium student yes credit rating fair P X Ci P X buys computer yes 0 222x0 444x0 667x0 0 667 0 044P X buys computer no 0 6x0 4x0 2x0 4 0 019P X Ci P Ci P X buys computer yes P buys computer yes 0 028P X buys computer no P buys computer no 0 007Therefore Xbelongstoclass buys computer yes Na veBayesianClassifier Comments AdvantagesEasytoimplementGoodresultsobtainedinmostofthecasesDisadvantagesAssumption classconditionalindependence thereforelossofaccuracyPractically dependenciesexistamongvariablesE g hospitals patients Profile age familyhistoryetcSymptoms fever coughetc Disease lungcancer diabetesetcDependenciesamongthesecannotbemodeledbyNa veBayesianClassifierHowtodealwiththesedependencies BayesianBeliefNetworks BayesianBeliefNetworks BayesianbeliefnetworkallowsasubsetofthevariablesconditionallyindependentAgraphicalmodelofcausalrelationshipsRepresentsdependencyamongthevariablesGivesaspecificationofjointprobabilitydistribution Nodes randomvariablesLinks dependencyX YaretheparentsofZ andYistheparentofPNodependencybetweenZandPHasnoloopsorcycles BayesianBeliefNetwork AnExample FamilyHistory LungCancer PositiveXRay Smoker Emphysema Dyspnea LC LC FH S FH S FH S FH S 0 8 0 2 0 5 0 5 0 7 0 3 0 1 0 9 BayesianBeliefNetworks TheconditionalprobabilitytableforthevariableLungCancer Showstheconditionalprobabilityforeachpossiblecombinationofitsparents LearningBayesianNetworks SeveralcasesGivenboththenetworkstructureandallvariablesobservable learnonlytheCPTsNetworkstructureknown somehiddenvariables methodofgradientdescent analogoustoneuralnetworklearningNetworkstructureunknown allvariablesobservable searchthroughthemodelspacetoreconstructgraphtopologyUnknownstructure allhiddenvariables nogoodalgorithmsknownforthispurposeD Heckerman Bayesiannetworksfordatamining WhatIsPrediction Numerical predictionissimilartoclassificationconstructamodelusemodeltopredictcontinuousororderedvalueforagiveninputPredictionisdifferentfromclassificationClassificationreferstopredictcategoricalclasslabelPredictionmodelscontinuous valuedfunctionsMajormethodforprediction regressionmodeltherelationshipbetweenoneormoreindependentorpredictorvariablesandadependentorresponsevariableRegressionanalysisLinearandmultipleregressionNon linearregressionOtherregressionmethods generalizedlinearmodel Poissonregression log linearmodels regressiontrees LinearRegression Linearregression involvesaresponsevariableyandasinglepredictorvariablex y w0 w1xwherew0 y intercept andw1 slope areregressioncoefficientsMethodofleastsquares estimatesthebest fittingstraightlineMultiplelinearregression involvesmorethanonepredictorvariableTrainingdataisoftheform X1 y1 X2 y2 X D y D Ex For2 Ddata wemayhave y w0 w1x1 w2x2SolvablebyextensionofleastsquaremethodorusingSAS S PlusManynonlinearfunctionscanbetransformedintotheabove SomenonlinearmodelscanbemodeledbyapolynomialfunctionApolynomialregressionmodelcanbetransformedintolinearregressionmodel Forexample y w0 w1x w2x2 w3x3convertibletolinearwithnewvariables x2 x2 x3 x3y w0 w1x w2x2 w3x3Otherfunctions suchaspowerfunction canalsobetransformedtolinearmodelSomemodelsareintractablenonlinear e g sumofexponentialterms possibletoobtainleastsquareestimatesthroughextensivecalculationonmorecomplexformulae NonlinearRegression Generalizedlinearmodel FoundationonwhichlinearregressioncanbeappliedtomodelingcategoricalresponsevariablesVarianceofyisafunctionofthemeanvalueofy notaconstantLogisticregression modelstheprob ofsomeeventoccurringasalinearfunctionofasetofpredictorvariablesPoissonregression modelsthedatathatexhibitaPoissondistributionLog linearmodels forcategoricaldata Approximatediscretemultidimensionalprob distributionsAlsousefulfordatacompressionandsmoothingRegressiontreesandmodeltreesTreestopredictcontinuousvaluesratherthanclasslabels OtherRegression BasedModels RegressionTreesandModelTrees Regressiontree proposedinCARTsystem Breimanetal 1984 CART ClassificationAndRegressionTreesEachleafstoresacontinuous valuedpredictionItistheaveragevalueofthepredictedattributeforthetrainingtuplesthatreachtheleafModeltree proposedbyQuinlan 1992 Eachleafholdsaregressionmodel amultivariatelinearequationforthepredictedattributeAmoregeneralcasethanregressiontreeRegressionandmodeltreestendtobemoreaccuratethanlinearregressionwhenthedataarenotrepresentedwellbyasimplelinearmodel Predictivemodeling Predictdatavaluesorconstructgeneralizedline

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论