




已阅读5页,还剩101页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
september15,2008,datamining:conceptsandtechniques,1,chapter6.classificationandprediction,whatisclassification?whatisprediction?classificationbydecisiontreeinductionbayesianclassificationrule-basedclassificationclassificationbybackpropagation,supportvectormachines(svm)associativeclassificationlazylearners(orlearningfromyourneighbors)otherclassificationmethodspredictionaccuracyanderrormeasuresensemblemethodsmodelselectionsummary,september15,2008,datamining:conceptsandtechniques,2,classificationpredictscategoricalclasslabels(discreteornominal)classifiesdata(constructsamodel)basedonthetrainingsetandthevalues(classlabels)inaclassifyingattributeandusesitinclassifyingnewdata,classification,september15,2008,datamining:conceptsandtechniques,3,classificationatwo-stepprocess,modelconstruction:describingasetofpredeterminedclasseseachtuple/sampleisassumedtobelongtoapredefinedclass,asdeterminedbytheclasslabelattributethesetoftuplesusedformodelconstructionistrainingsetthemodelisrepresentedasclassificationrules,decisiontrees,ormathematicalformulaemodelusage:forclassifyingfutureorunknownobjectsestimateaccuracyofthemodeltheknownlabeloftestsampleiscomparedwiththeclassifiedresultfromthemodelaccuracyrateisthepercentageoftestsetsamplesthatarecorrectlyclassifiedbythemodeltestsetisindependentoftrainingset,otherwiseover-fittingwilloccuriftheaccuracyisacceptable,usethemodeltoclassifydatatupleswhoseclasslabelsarenotknown,september15,2008,datamining:conceptsandtechniques,4,process(1):modelconstruction,trainingdata,classificationalgorithms,ifrank=professororyears6thentenured=yes,classifier(model),september15,2008,datamining:conceptsandtechniques,5,process(2):usingthemodelinprediction,classifier,testingdata,unseendata,(jeff,professor,4),tenured?,september15,2008,datamining:conceptsandtechniques,6,supervisedvs.unsupervisedlearning,supervisedlearning(classification)supervision:thetrainingdata(observations,measurements,etc.)areaccompaniedbylabelsindicatingtheclassoftheobservationsnewdataisclassifiedbasedonthetrainingsetunsupervisedlearning(clustering)theclasslabelsoftrainingdataisunknowngivenasetofmeasurements,observations,etc.withtheaimofestablishingtheexistenceofclassesorclustersinthedata,september15,2008,datamining:conceptsandtechniques,7,decisiontreeinduction:trainingdataset,thisfollowsanexampleofquinlansid3(playingtennis),september15,2008,datamining:conceptsandtechniques,8,output:adecisiontreefor“buys_computer”,september15,2008,datamining:conceptsandtechniques,9,algorithmfordecisiontreeinduction,basicalgorithmtreeisconstructedinatop-downrecursivedivide-and-conquermanneratstart,allthetrainingexamplesareattherootattributesarecategorical(ifcontinuous-valued,theyarediscretizedinadvance)examplesarepartitionedrecursivelybasedonselectedattributestestattributesareselectedonthebasisofaheuristicorstatisticalmeasure(e.g.,informationgain)conditionsforstoppingpartitioningallsamplesforagivennodebelongtothesameclasstherearenoremainingattributesforfurtherpartitioningmajorityvotingisemployedforclassifyingtheleaftherearenosamplesleft,september15,2008,datamining:conceptsandtechniques,10,attributeselectionmeasure:informationgain(id3/c4.5),selecttheattributewiththehighestinformationgainletpibetheprobabilitythatanarbitrarytupleindbelongstoclassci,estimatedby|ci,d|/|d|expectedinformation(entropy)neededtoclassifyatupleind:informationneeded(afterusingatosplitdintovpartitions)toclassifyd:informationgainedbybranchingonattributea,september15,2008,datamining:conceptsandtechniques,11,attributeselection:informationgain,classp:buys_computer=“yes”classn:buys_computer=“no”,means“age=30”has5outof14samples,with2yesesand3nos.hencesimilarly,september15,2008,datamining:conceptsandtechniques,12,chapter6.classificationandprediction,whatisclassification?whatisprediction?issuesregardingclassificationandpredictionclassificationbydecisiontreeinductionbayesianclassificationrule-basedclassificationclassificationbybackpropagation,supportvectormachines(svm)associativeclassificationlazylearners(orlearningfromyourneighbors)otherclassificationmethodspredictionaccuracyanderrormeasuresensemblemethodsmodelselectionsummary,september15,2008,datamining:conceptsandtechniques,13,bayesianclassification:why?,astatisticalclassifier:performsprobabilisticprediction,i.e.,predictsclassmembershipprobabilitiesfoundation:basedonbayestheorem.performance:asimplebayesianclassifier,navebayesianclassifier,hascomparableperformancewithdecisiontreeandselectedneuralnetworkclassifiersincremental:eachtrainingexamplecanincrementallyincrease/decreasetheprobabilitythatahypothesisiscorrectpriorknowledgecanbecombinedwithobserveddatastandard:evenwhenbayesianmethodsarecomputationallyintractable,theycanprovideastandardofoptimaldecisionmakingagainstwhichothermethodscanbemeasured,september15,2008,datamining:conceptsandtechniques,14,bayesiantheorem:basics,letxbeadatasample(trainingdata)lethbeahypothesisthatxbelongstoclasscclassificationistodeterminep(h|x),theprobabilitythatthehypothesisholdsgiventheobserveddatasamplexp(h)(priorprobability),theinitialprobabilitye.g.,xwillbuycomputer,regardlessofage,income,p(x):probabilitythatsampledataisobservedp(x|h)(posterioriprobability),theprobabilityofobservingthesamplex,giventhatthehypothesisholdse.g.,giventhatxwillbuycomputer,theprobabilitythatxis31.40,andhasmediumincome,september15,2008,datamining:conceptsandtechniques,15,bayesiantheorem,giventrainingdatax,posterioriprobabilityofahypothesish,p(h|x),followsthebayestheoreminformally,thiscanbewrittenasposteriori=likelihoodxprior/evidencepredictsxbelongstociifftheprobabilityp(ci|x)isthehighestamongallthep(ck|x)forallthekclassespracticaldifficulty:requiresinitialknowledgeofmanyprobabilities,significantcomputationalcost,september15,2008,datamining:conceptsandtechniques,16,towardsnavebayesianclassifier,letdbeatrainingsetoftuplesandtheirassociatedclasslabels,andeachtupleisrepresentedbyannattributevectorx=(x1,x2,xn)supposetherearemclassesc1,c2,cm.classificationistoderivethemaximumposteriori,i.e.,themaximalp(ci|x)thiscanbederivedfrombayestheoremsincep(x)isconstantforallclasses,onlyp(x|ci)p(ci)needstobemaximized,september15,2008,datamining:conceptsandtechniques,17,derivationofnavebayesclassifier,asimplifiedassumption:attributesareconditionallyindependent(i.e.,nodependencerelationbetweenattributes),september15,2008,datamining:conceptsandtechniques,18,navebayesianclassifier:trainingdataset,class:c1:buys_computer=yesc2:buys_computer=nodatasamplex=(age=30,income=medium,student=yescredit_rating=fair),september15,2008,datamining:conceptsandtechniques,19,navebayesianclassifier:anexample,p(ci):p(buys_computer=“yes”)=9/14=0.643p(buys_computer=“no”)=5/14=0.357computep(x|ci)foreachclassp(age=“=30”|buys_computer=“yes”)=2/9=0.222p(age=“1),#ofunitsineachhiddenlayer,and#ofunitsintheoutputlayernormalizingtheinputvaluesforeachattributemeasuredinthetrainingtuplesto0.01.0oneinputunitperdomainvalue,eachinitializedto0output,ifforclassificationandmorethantwoclasses,oneoutputunitperclassisusedonceanetworkhasbeentrainedanditsaccuracyisunacceptable,repeatthetrainingprocesswithadifferentnetworktopologyoradifferentsetofinitialweights,september15,2008,datamining:conceptsandtechniques,41,backpropagation,iterativelyprocessasetoftrainingtuples&comparethenetworkspredictionwiththeactualknowntargetvalueforeachtrainingtuple,theweightsaremodifiedtominimizethemeansquarederrorbetweenthenetworkspredictionandtheactualtargetvaluemodificationsaremadeinthe“backwards”direction:fromtheoutputlayer,througheachhiddenlayerdowntothefirsthiddenlayer,hence“backpropagation”stepsinitializeweights(tosmallrandom#s)andbiasesinthenetworkpropagatetheinputsforward(byapplyingactivationfunction)backpropagatetheerror(byupdatingweightsandbiases)terminatingcondition(whenerrorisverysmall,etc.),september15,2008,datamining:conceptsandtechniques,42,backpropagationandinterpretability,efficiencyofbackpropagation:eachepoch(oneinterationthroughthetrainingset)takeso(|d|*w),with|d|tuplesandwweights,but#ofepochscanbeexponentialton,thenumberofinputs,intheworstcaseruleextractionfromnetworks:networkpruningsimplifythenetworkstructurebyremovingweightedlinksthathavetheleasteffectonthetrainednetworkthenperformlink,unit,oractivationvalueclusteringthesetofinputandactivationvaluesarestudiedtoderiverulesdescribingtherelationshipbetweentheinputandhiddenunitlayerssensitivityanalysis:assesstheimpactthatagiveninputvariablehasonanetworkoutput.theknowledgegainedfromthisanalysiscanberepresentedinrules,september15,2008,datamining:conceptsandtechniques,43,chapter6.classificationandprediction,whatisclassification?whatisprediction?issuesregardingclassificationandpredictionclassificationbydecisiontreeinductionbayesianclassificationrule-basedclassificationclassificationbybackpropagation,supportvectormachines(svm)associativeclassificationlazylearners(orlearningfromyourneighbors)otherclassificationmethodspredictionaccuracyanderrormeasuresensemblemethodsmodelselectionsummary,september15,2008,datamining:conceptsandtechniques,44,svmsupportvectormachines,anewclassificationmethodforbothlinearandnonlineardataitusesanonlinearmappingtotransformtheoriginaltrainingdataintoahigherdimensionwiththenewdimension,itsearchesforthelinearoptimalseparatinghyperplane(i.e.,“decisionboundary”)withanappropriatenonlinearmappingtoasufficientlyhighdimension,datafromtwoclassescanalwaysbeseparatedbyahyperplanesvmfindsthishyperplaneusingsupportvectors(“essential”trainingtuples)andmargins(definedbythesupportvectors),september15,2008,datamining:conceptsandtechniques,45,svmhistoryandapplications,vapnikandcolleagues(1992)groundworkfromvapnik&chervonenkisstatisticallearningtheoryin1960sfeatures:trainingcanbeslowbutaccuracyishighowingtotheirabilitytomodelcomplexnonlineardecisionboundaries(marginmaximization)usedbothforclassificationandpredictionapplications:handwrittendigitrecognition,objectrecognition,speakeridentification,benchmarkingtime-seriespredictiontests,september15,2008,datamining:conceptsandtechniques,46,svmgeneralphilosophy,september15,2008,datamining:conceptsandtechniques,47,svmmarginsandsupportvectors,september15,2008,datamining:conceptsandtechniques,48,svmwhendataislinearlyseparable,m,letdatadbe(x1,y1),(x|d|,y|d|),wherexiisthesetoftrainingtuplesassociatedwiththeclasslabelsyithereareinfinitelines(hyperplanes)separatingthetwoclassesbutwewanttofindthebestone(theonethatminimizesclassificationerroronunseendata)svmsearchesforthehyperplanewiththelargestmargin,i.e.,maximummarginalhyperplane(mmh),september15,2008,datamining:conceptsandtechniques,49,svmlinearlyseparable,aseparatinghyperplanecanbewrittenaswx+b=0wherew=w1,w2,wnisaweightvectorandbascalar(bias)for2-ditcanbewrittenasw0+w1x1+w2x2=0thehyperplanedefiningthesidesofthemargin:h1:w0+w1x1+w2x21foryi=+1,andh2:w0+w1x1+w2x21foryi=1anytrainingtuplesthatfallonhyperplanesh1orh2(i.e.,thesidesdefiningthemargin)aresupportvectorsthisbecomesaconstrained(convex)quadraticoptimizationproblem:quadraticobjectivefunctionandlinearconstraintsquadraticprogramming(qp)lagrangianmultipliers,september15,2008,datamining:conceptsandtechniques,50,whyissvmeffectiveonhighdimensionaldata?,thecomplexityoftrainedclassifierischaracterizedbythe#ofsupportvectorsratherthanthedimensionalityofthedatathesupportvectorsaretheessentialorcriticaltrainingexamplestheylieclosesttothedecisionboundary(mmh)ifallothertrainingexamplesareremovedandthetrainingisrepeated,thesameseparatinghyperplanewouldbefoundthenumberofsupportvectorsfoundcanbeusedtocomputean(upper)boundontheexpectederrorrateofthesvmclassifier,whichisindependentofthedatadimensionalitythus,ansvmwithasmallnumberofsupportvectorscanhavegoodgeneralization,evenwhenthedimensionalityofthedataishigh,september15,2008,datamining:conceptsandtechniques,51,svmlinearlyinseparable,transformtheoriginalinputdataintoahigherdimensionalspacesearchforalinearseparatinghyperplaneinthenewspace,september15,2008,datamining:conceptsandtechniques,52,svmkernelfunctions,insteadofcomputingthedotproductonthetransformeddatatuples,itismathematicallyequivalenttoinsteadapplyingakernelfunctionk(xi,xj)totheoriginaldata,i.e.,k(xi,xj)=(xi)(xj)typicalkernelfunctionssvmcanalsobeusedforclassifyingmultiple(2)classesandforregressionanalysis(withadditionaluserparameters),september15,2008,datamining:conceptsandtechniques,53,scalingsvmbyhierarchicalmicro-clustering,svmisnotscalabletothenumberofdataobjectsintermsoftrainingtimeandmemoryusage“classifyinglargedatasetsusingsvmswithhierarchicalclustersproblem”byhwanjoyu,jiongyang,jiaweihan,kdd03cb-svm(clustering-basedsvm)givenlimitedamountofsystemresources(e.g.,memory),maximizethesvmperformanceintermsofaccuracyandthetrainingspeedusemicro-clusteringtoeffectivelyreducethenumberofpointstobeconsideredatderivingsupportvectors,de-clustermicro-clustersnear“candidatevector”toensurehighclassificationaccuracy,september15,2008,datamining:conceptsandtechniques,54,cb-svm:clustering-basedsvm,trainingdatasetsmaynotevenfitinmemoryreadthedatasetonce(minimizingdiskaccess)constructastatisticalsummaryofthedata(i.e.,hierarchicalclusters)givenalimitedamountofmemorythestatisticalsummarymaximizesthebenefitoflearningsvmthesummaryplaysaroleinindexingsvmsessenceofmicro-clustering(hierarchicalindexingstructure)usemicro-clusterhierarchicalindexingstructureprovidefinersamplesclosertotheboundaryandcoarsersamplesfartherfromtheboundaryselectivede-clusteringtoensurehighaccuracy,september15,2008,datamining:conceptsandtechniques,55,cf-tree:hierarchicalmicro-cluster,september15,2008,datamining:conceptsandtechniques,56,cb-svmalgorithm:outline,constructtwocf-treesfrompositiveandnegativedatasetsindependentlyneedonescanofthedatasettrainansvmfromthecentroidsoftherootentriesde-clustertheentriesneartheboundaryintothenextlevelthechildrenentriesde-clusteredfromtheparententriesareaccumulatedintothetrainingsetwiththenon-declusteredparententriestrainansvmagainfromthecentroidsoftheentriesinthetrainingsetrepeatuntilnothingisaccumulated,september15,2008,datamining:conceptsandtechniques,57,selectivedeclustering,cftreeisasuitablebasestructureforselectivedeclusteringde-clusteronlytheclustereisuchthatdiri2values,kbitscanbeusedbasedonthenotionofsurvivalofthefittest,anewpopulationisformedtoconsistofthefittestrulesandtheiroffspringsthefitnessofaruleisrepresentedbyitsclassificationaccuracyonasetoftrainingexamplesoffspringsaregeneratedbycrossoverandmutationtheprocesscontinuesuntilapopulationpevolveswheneachruleinpsatisfiesaprespecifiedthresholdslowbuteasilyparallelizable,september15,2008,datamining:conceptsandtechniques,76,roughsetapproach,roughsetsareusedtoapproximatelyor“roughly”defineequivalentclassesaroughsetforagivenclasscisapproximatedbytwosets:alowerapproximation(certaintobeinc)andanupperapproximation(cannotbedescribedasnotbelongingtoc)findingtheminimalsubsets(reducts)ofattributesforfeaturereductionisnp-hardbutadiscernibilitymatrix(whichstoresthedifferencesbetweenattributevaluesforeachpairofdatatuples)isusedtoreducethecomputationintensity,september15,2008,datamining:conceptsandtechniques,77,fuzzysetapproaches,fuzzylogicusestruthvaluesbetween0.0and1.0torepresentthedegreeofmembership(suchasusingfuzzymembershipgraph)attributevaluesareconvertedtofuzzyvaluese.g.,incomeismappedintothediscretecategorieslow,medium,highwithfuzzyvaluescalculatedforagivennewsample,morethanonefuzzyvaluemayapplyeachapplicablerulecontributesavoteformembershipinthecategoriestypically,thetruthvaluesforeachpredictedcategoryaresummed,andthesesumsarecombined,september15,2008,datamining:conceptsandtechniques,78,chapter6.classificationandprediction,whatiscla
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 会计购买服务合同范例
- 代理代发合同范例
- 低压配电检测服务合同标准文本
- 保姆劳务关系合同范例
- 农副产品购买合同(范文)
- 供热工程承包施工合同范例
- 电力系统维护与优化工程师岗位聘用及智能化升级合同
- 艺人个人形象打造化妆定制合同
- 留学住宿安排与生活设施介绍合同
- 社交媒体账号代运营与跨平台内容整合合同
- 中国历史地理智慧树知到期末考试答案章节答案2024年北京大学
- MOOC 跨文化交际通识通论-扬州大学 中国大学慕课答案
- GA 53-2015爆破作业人员资格条件和管理要求
- 金属学及热处理练习题答案
- 新部编版四年级语文下册课件(精美版)习作6
- 超声引导下针刀精准治疗膝骨关节炎课件
- 常见典型心电图诊断规培-课件
- 国内旅客临时住宿登记表格式
- 八年级期末质量分析-课件
- 费森4008s常见故障排除
- 积极心态与消极心态
评论
0/150
提交评论