已阅读5页,还剩274页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1 modelbuildingtraining maxkuhnkjelljohnsonglobalnonclinicalstatistics 2 overview typicaldatascenariosexampleswe llbeusinggeneralapproachestomodelbuildingdatapre processingregression typemodelsclassification typemodelsotherconsiderations 3 typicaldata responsemaybecontinuousorcategoricalpredictorsmaybecontinuous count and orbinarydenseorsparseobservedand orcalculated 4 predictivemodels whatisa predictivemodel amodelwhoseprimarypurposeisforprediction asopposedtoinference wewouldliketoknowwhythemodelworks aswellastherelationshipbetweenpredictorsandtheoutcome butthesearesecondaryexamples blood glucosemonitoring spamdetection computationalchemistry etc 5 whataretheynotgoodfor theyarenotasubstituteforsubjectspecificknowledgescience hard yikes models easy let sdotheseinstead tomakeagoodmodelthatpredictswellonfuturesamples youneedtoknowalotaboutyourpredictorsandhowtheyrelatetoeachotherthemechanismthatgeneratedthedata sampling technologyetc 6 whataretheynotgoodfor anexample anoncologistcollectssomedatafromasmallclinicaltrialandwantsamodelthatwouldusegeneexpressiondatatopredicttherapeuticresponse beneficialornot in4typesofcancertherewereabout54kpredictorsanddatawascollectedon 20subjectsifthereisalotofknowledgeofhowthetherapyworks pathwaysetc someeffortmustbeputintousingthatinformationtohelpbuildthemodel 7 thebigpicture intheend predictivemodeling isnotasubstituteforintuition butacompliment ianayres insupercrunchers 8 references statisticalmodeling thetwocultures byleobreiman statisticalscience vol16 3 2001 199 231 theelementsofstatisticallearningbyhastie tibshiraniandfriedmanregressionmodelingstrategiesbyharrellsupercrunchersbyayres 9 regressionmethods multiplelinearregressionpartialleastsquaresneuralnetworksmultivariateadaptiveregressionsplinessupportvectormachinesregressiontreesensemblesoftrees bagging boosting andrandomforests 10 classificationmethods discriminantanalysisframeworklinear quadratic regularized flexible andpartialleastsquaresdiscriminantanalysismodernclassificationmethodsclassificationtreesensemblesoftreesboostingandrandomforestsneuralnetworkssupportvectormachinesk nearestneighborsnaivebayes 11 interestingmodelswedon thavetimefor l1penaltymethodsthelasso theelasticnet nearestshrunkencentroidsotherboostedmodelslinearmodels generalizedadditivemodels etcothermodels conditionalinferencetrees c4 5 c5 cubist othertreemodelslearnedvectorquantizationself organizingmapsactivelearningtechniques 12 exampledatasets 13 bostonhousingdata thisisaclassicbenchmarkdatasetforregression itincludeshousingdatafor506censustractsofbostonfromthe1970census crim percapitacrimerateindus proportionofnon retailbusinessacrespertownchas charlesriverdummyvariable 1iftractboundsriver 0otherwise nox nitricoxidesconcentrationrm averagenumberofroomsperdwellingage proportionofowner occupiedunitsbuiltpriorto1940 dis weighteddistancestofivebostonemploymentcentersrad indexofaccessibilitytoradialhighwaystax full valueproperty taxrateptratio pupil teacherratiobytownb proportionofminoritiesmedv medianvaluehomes outcome 14 toyclassificationexample asimulateddatasetwillbeusedtodemonstrateclassificationmodelstwopredictorswithacorrelationcoefficientof0 5weresimulatedtwoclassesweresimulated active and inactive aprobabilitymodelwasusedtoassignaprobabilityofbeingactivetoeachsamplethe25 50 and75 probabilitylinesareshownontheright 15 toyclassificationexample theclasseswererandomlyassignedbasedontheprobabilitythetrainingdatahad250compounds plotonright thetestsetalsocontained250compoundswithtwopredictors theclassboundariescanbeshownforeachmodelthiscanbeasignificantaidinunderstandinghowthemodelswork butweacknowledgehowunrealisticthissituationis 16 modelbuildingtraining generalstrategies 17 objective toconstructamodelofpredictorsthatcanbeusedtopredictaresponse 18 modelbuildingsteps commonstepsduringmodelbuildingare estimatingmodelparameters i e trainingmodels determiningthevaluesoftuningparametersthatcannotbedirectlycalculatedfromthedatacalculatingtheperformanceofthefinalmodelthatwillgeneralizetonewdatathemodelerhasafiniteamountofdata whichtheymust spend toaccomplishthesestepshowdowe spend thedatatofindanoptimalmodel 19 spending data wetypically spend dataontrainingandtestdatasetstrainingset thesedataareusedtoestimatemodelparametersandtopickthevaluesofthecomplexityparameter s forthemodel testset akavalidationset thesedatacanbeusedtogetanindependentassessmentofmodelefficacy theyshouldnotbeusedduringmodeltraining themoredatawespend thebetterestimateswe llget providedthedataisaccurate givenafixedamountofdata toomuchspentintrainingwon tallowustogetagoodassessmentofpredictiveperformance wemayfindamodelthatfitsthetrainingdataverywell butisnotgeneralizable overfitting toomuchspentintestingwon tallowustogetagoodassessmentofmodelparameters 20 methodsforcreatingatestset howshouldwesplitthedataintoatrainingandtestset often therewillbeascientificrationalforthesplitandinothercases thesplitscanbemadeempirically severalempiricalsplittingoptions completelyrandomstratifiedrandommaximumdissimilarityinpredictorspace 21 creatingatestset completelyrandomsplits acompletelyrandom cr splitrandomlypartitionsthedataintoatrainingandtestsetforlargedatasets acrsplithasverylowbiastowardsanycharacteristic predictororresponse forclassificationproblems acrsplitisappropriatefordatathatisbalancedintheresponsehowever acrsplitisnotappropriateforunbalanceddataacrsplitmayselecttoofewobservations andperhapsnone ofthelessfrequentclassintooneofthesplits 22 creatingatestset stratifiedrandomsplits astratifiedrandomsplitmakesarandomsplitwithinstratificationgroupsinclassification theclassesareusedasstratainregression groupsbasedonthequantilesoftheresponseareusedasstratastratificationattemptstopreservethedistributionoftheoutcomebetweenthetrainingandtestsetsasrsplitismoreappropriateforunbalanceddata 23 over fitting over fittingoccurswhenamodelhasextremelygoodpredictionforthetrainingdatabutpredictspoorlywhenthedataareslightlyperturbednewdata i e testdata areusedcomplexregressionandclassificationmodelsassumethattherearepatternsinthedata withoutsomecontrolmanymodelscanfindveryintricaterelationshipsbetweenthepredictorandtheresponsethesepatternsmaynotbevalidfortheentirepopulation 24 over fittingexample theplotsbelowshowclassificationboundariesfortwomodelsbuiltonthesamedataoneofthemisover fit predictorb predictorb predictora predictora 25 over fittinginregression historically weevaluatethequalityofaregressionmodelbyit smeansquarederror supposethatarepredictionfunctionisparameterizedbysomevector 26 over fittinginregression msecanbedecomposedintothreeterms irreduciblenoisesquaredbiasoftheestimatorfromit sexpectedvaluethevarianceoftheestimatorthebiasandvarianceareinverselyrelatedasoneincreases theotherdecreasesdifferentratesofchange 27 over fittinginregression whenthemodelunder fits thebiasisgenerallyhighandthevarianceislowover fittingistypicallycharacterizedbyhighvariance lowbiasestimatorsinmanycases smallincreasesinbiasresultinlargedecreasesinvariance 28 over fittinginregression generally controllingthemseyieldsagoodtrade offbetweenover andunder fittingasimilarstatementcanbemadeaboutclassificationmodels althoughthemetricsaredifferent i e notmse howcanweaccuratelyestimatethemsefromthetrainingdata thena vemsefromthetrainingdatacanbeaverypoorestimateresamplingcanhelpestimatethesemetrics 29 howdoweestimateover fitting somemodelshavespecific knobs tocontrolover fittingneighborhoodsizeinnearestneighbormodelsisanexamplethenumberifsplitsinatreemodeloften poorchoicesfortheseparameterscanresultinover fittingresamplingthetrainingcompoundsallowsustoknowwhenwearemakingpoorchoicesforthevaluesoftheseparameters 30 howdoweestimateover fitting resamplingonlyaffectsthetrainingdatathetestsetisnotusedinthisprocedureresamplingmethodstryto embedvariation inthedatatoapproximatethemodel sperformanceonfuturecompoundscommonresamplingmethods k foldcrossvalidationleavegroupoutcrossvalidationbootstrapping 31 k foldcrossvalidation here werandomlysplitthedataintokblocksofroughlyequalsizeweleaveoutthefirstblockofdataandfitamodel thismodelisusedtopredicttheheld outblockwecontinuethisprocessuntilwe vepredictedallkhold outblocksthefinalperformanceisbasedonthehold outpredictions 32 k foldcrossvalidation theschematicbelowshowstheprocessfork 3groups kisusuallytakentobe5or10leaveoneoutcross validationhaseachsampleasablock 33 leavegroupoutcrossvalidation arandomproportionofdata say80 areusedtotrainamodeltheremainderisusedtopredictperformancethisprocessisrepeatedmanytimesandtheaverageperformanceisused 34 bootstrapping bootstrappingtakesarandomsamplewithreplacementtherandomsampleisthesamesizeastheoriginaldatasetcompoundsmaybeselectedmorethanonceeachcompoundhasa63 2 changeofshowingupatleastoncesomesampleswon tbeselectedthesesampleswillbeusedtopredictperformancetheprocessisrepeatedmultipletimes say30 35 thebootstrap withbootstrapping thenumberofheld outsamplesisrandomsomemodels suchasrandomforest usebootstrappingwithinthemodelingprocesstoreduceover fitting 36 trainingmodelswithtuningparameters asingletraining testsplitisoftennotenoughformodelswithtuningparameterswemustuseresamplingtechniquestogetgoodestimatesofmodelperformanceovermultiplevaluesoftheseparameterswepickthecomplexityparameter s withthebestperformanceandre fitthemodelusingallofthedata 37 simulateddataexample let sfitanearestneighborsmodeltothesimulatedclassificationdata theoptimalnumberofneighborsmustbechosenifweuseleavegroupoutcross validationandsetaside20 wewillfitmodelstoarandom200samplesandpredict50samples30iterationswereusedwe lltrainover11oddvaluesforthenumberofneighborswealsohavea250pointtestset 38 toydataexample theplotontherightshowstheclassificationaccuracyforeachvalueofthetuningparameterthegreypointsarethe30resampledestimatestheblacklineshowstheaverageaccuracythebluelineisthe250sampletestsetitlookslike7ormoreneighborsisoptimalwithanestimatedaccuracyof86 39 toydataexample whatifwedidn tresampleandusedthewholedataset theplotontherightshowstheaccuracyacrossthetuningparametersthiswouldpickamodelthatover fitsandhasoptimisticperformance 40 modelbuildingtraining datapre processing 41 whypre process inordertogeteffectiveandstableresults manymodelsrequirecertainassumptionsaboutthedatathisismodeldependentwewilllisteachmodel spre processingrequirementsattheendingeneral pre processingrarelyhurtsmodelperformance butcouldmakemodelinterpretationmoredifficult 42 commonpre processingsteps formostmodels weapplythreepre processingprocedures removalofpredictorswithvarianceclosetozeroeliminationofhighlycorrelatedpredictorscenteringandscalingofeachpredictor 43 zerovariancepredictors mostmodelsrequirethateachpredictorhaveatleasttwouniquevalueswhy apredictorwithonlyoneuniquevaluehasavarianceofzeroandcontainsnoinformationabouttheresponse itisgenerallyagoodideatoremovethem 44 nearzerovariance predictors additionally ifthedistributionsofthepredictorsareverysparse thiscanhaveadrasticeffectonthestabilityofthemodelsolutionzerovariancedescriptorscouldbeinducedduringresamplingbutwhatdoesa nearzerovariance predictorlooklike 45 nearzerovariance predictor therearetwoconditionsforan nzv predictoralownumberofpossiblevalues andahighimbalanceinthefrequencyofthevaluesforexample alownumberofpossiblevaluescouldoccurbyusingfingerprintsaspredictorsonlytwopossiblevaluescanoccur 0or1 butwhatifthereare999zerovaluesinthedataandasinglevalueof1 thisisahighlyunbalancedcaseandcouldbetrouble 46 nzvexample incomputationalchemistrywecreatedpredictorsbasedonstructuralcharacteristicsofcompounds asanexample thedescriptor nr11 isthenumberof11 memberringsthetabletotherightisthedistributionofnr11fromatrainingsetthedistinctvaluepercentageis5 535 0 0093thefrequencyratiois501 23 21 8 47 detectingnzvs twocriteriafordetectingnzvsarethediscretevaluepercentagedefinedasthenumberofuniquevaluesdividedbythenumberofobservationsrule of thumb discretevaluepercentage19couldindicateaproblemifbothcriteriaareviolated theneliminatethepredictor 48 highlycorrelatedpredictors somemodelscanbenegativelyaffectedbyhighlycorrelatedpredictorscertaincalculations e g matrixinversion canbecomeseverelyunstablehowcanwedetectthesepredictors varianceinflationfactor vif inlinearregressionor alternativelycomputethecorrelationmatrixofthepredictorspredictorswith absolute pair wisecorrelationsaboveathresholdcanbeflaggedforremovalrule of thumbthreshold 0 85 49 highlycorrelatedpredictorsandresampling recallthatresamplingslightlyperturbsthetrainingdatasettoincreasevariationifamodelisadverselyaffectedbyhighcorrelationsbetweenpredictors theresamplingperformanceestimatescanbepoorincomparisontothetestsetinthiscase resamplingdoesabetterjobatpredictinghowthemodelworksonfuturesamples 50 centeringandscaling standardizingthepredictorscangreatlyimprovethestabilityofmodelcalculations moreimportantly thereareseveralmodels e g partialleastsquares thatimplicitlyassumethatallofthepredictorsareonthesamescaleapartfromthelossoftheoriginalunits thereisnorealdownsideofcenteringandscaling 51 modelbuildingtraining regression typemodels 52 setting responseiscontinuous 53 objective toconstructamodelofpredictorsthatcanbeusedtopredictaresponse 54 regressionmethods multiplelinearregressionpartialleastsquaresneuralnetworksmultivariateadaptiveregressionsplinessupportvectormachinesregressiontreesensemblesoftrees bagging boosting andrandomforestseachofthesemethodsseektofindarelationshipbetweenthepredictorsandresponsethatminimizeserrorbetweentheobservedandpredictedresponse 55 additivemodels inthebeginningtherewerelinearmodels andhastieandtibshirani 1990 said lettherebegeneralizedadditivemodels andnelderandwedderburn 1972 said lettherebegeneralizedlinearmodels andlinkfunctionsappeared andscatterplotsmoothersandbackfittingalgorithmsappeared 56 familiesofadditivemodels glm gam recursivepartitioning trees boosting randomforests bagging multivariateadaptiveregressionsplines neuralnets supportvectormachines pls flexibility additivitydependsonmodelparameters 57 assessingmodelperformance 58 assessingmodelperformance howwelldoesaregressionmodelperform answeringthisquestiondependsonhowwewanttousethemodel possiblegoalsare tounderstandtherelationshipbetweenthepredictorandtheresponse tousethemodeltopredictfutureobservations response ineithercase wecanuseseveralofdifferentmeasurestoevaluatemodelperformance wewillfocusontwo coefficientofdetermination r2 rootmeansquareerror rmse however thesetofdatathatweusetoevaluateperformancewillchangedependingonourpurpose 59 whichsetofdatatousetoevaluateperformance ifweareonlyinterestedinunderstandingtheunderlyingrelationshipbetweenthepredictorandtheresponse thenwecancomputer2andrmseonthedataforwhichthemodelwasbuilt i ethetrainingdata however thesevalueswillbeoverlyoptimisticofthemodel sabilitytopredictfutureobservations ifweareinterestedinunderstandingthemodel sabilitytopredictfutureobservations thenweneedtocomputer2andrmseondataforwhichthemodelwasnotbuilt i e atestsetorcross validationset foraheld outsetofdata r2iscommonlyreferredtoasq2andrmseiscommonlyreferredtoasrootmeansquaredpredictionerror rmspe 60 rootmeansquarederror rmse androotmeansquaredpredictionerror rmspe rmsemeasurestheaveragedeviationofanobservationtothebest fitplanermspemeasurestheaveragedeviationofanobservationtoitspredictedvalueforthetestorcross validationset n thenumberofobservationsinthetestorcross validationset 61 computingq2 process partitionthedataintoatrainingandtestingset orblockstobeusedfortrainingandtestingbuildthemodelonthetrainingdataandpredictthetestingdataq2 r2oftherelationshipbetweentheobservedandpredictedvaluesforthetestingdata 62 multiplelinearregression aquickreview 63 multiplelinearregression objective findtheplanethroughthedatathatminimizesthesum of squareserror 64 thebestplane tofindthebestplane wesolve whereynx1 xnx p 1 and p 1 x1thebest is 65 aside abitmoreabout xtx xtx isacriticalmatrixformanystatisticalmodelingtechniquesafewfunfacts xtx isproportionaltothecovariancematrix sscontainsthevariancesandcovariancesofallpredictorstechniquesthatdependon xtx alsorequirethatitisinvertible 66 assumptions diagnosticplots 67 whendoesregressionfail whenaplanedoesnotcapturethestructureinthedatawhenthevariance covariancematrixisoverdeterminedrecall theplanethatminimizessseis tofindthebestplane wemustcomputetheinverseofthevariance covariancematrixthevariance covariancematrixisnotalwaysinvertible twocommonconditionsthatcauseittobeuninvertibleare twoormoreofthepredictorsarecorrelated multicollinearity therearemorepredictorsthanobservations 68 a trivial exampleofmulticollinearity supposethatwehaveoneobservation 3 5 andwewishtofindthe best lineforthedata inthisexample thenumberofobservations 1 islessthanthenumberofparameters 2 slopeandintercept whenthenumberofparametersisgreaterthanthenumberofobservations wecanfindaninfinitenumberof best solutions inthepresenceofmulticollinearity thebestsolutionwillbeunstable 69 bostonhousingdata let susealinearregressionmodeltopredictthemedianhousepriceinboston process splitthedataintoatrainingset n 337 andtestingset n 169 forthetrainingset usethebootstraptodeterminethermspeandq2forthetestdatadeterminermspeandq2iftheunderlyingmodelisstable thevaluesofrmspeandq2shouldbesimilarbetweenthebootstrapandtestingdata 70 results theresultsarefairlysimilar atleastwithinthevariationofresamplingonereasonyoumayseedifferences multicollinearitymulticollinearityinthepredictorscanproducesomewhatunstablesolutionsforeachresamplewhenthedataareslightlychanged themodelcandrasticallychangethetestsetisasingle staticsetofdataforverificationthebootstrapestimateofperformancemaybebetterwithcollinearity 71 partialleastsquaresregression 72 solutionsforoverdeterminedcovariancematrices variablereductiontrytoaccomplishthisthroughthepre processingstepspartialleastsquares pls othermethodsapplyageneralizedinverseridgeregression adjuststhevariance covariancematrixsothatwecanfindauniqueinverse principalcomponentregression pcr notrecommended butit sagoodwaytounderstandpls 73 understandingpartialleastsquares principalcomponentsanalysis pcaseekstofindlinearcombinationsofth
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025湖南振湘实业发展集团有限公司招聘高管人员考试笔试备考试题及答案解析
- 2025贵州贵阳市观山湖区第三十六幼儿园(第一分园)招聘1人笔试考试备考试题及答案解析
- 2025湖南邵阳市武冈市城乡供水有限公司招聘笔试历年参考题库附带答案详解
- 2026水发集团有限公司高校毕业生秋季校园招聘138人笔试考试备考试题及答案解析
- 2025广东工贸职业技术学院招聘事业编制工作人员11人考试笔试备考题库及答案解析
- 2025年山东省枣庄市薛城区辅警招聘考试题库附答案解析
- 2025年阜新市清河门区辅警招聘考试题库附答案解析
- 2025年乌鲁木齐市新市区辅警招聘考试题库附答案解析
- 2025年四川省泸州市古蔺县保安员招聘考试题库附答案解析
- 2025年台中市辅警招聘考试题库附答案解析
- 2025至2030中国团膳行业市场发展分析及发展趋势与投资机会报告
- 2025年新员工入职医疗器械知识培训试题及答案
- 光伏屋面施工资源配置方案
- GB/T 46729-2025纺织品智能纺织品术语和分类
- 桥架安装作业指导书方案
- 2025年武汉市黄陂区公开招聘工会协理员4人笔试考试参考题库及答案解析
- 2025亳州利辛县产业发展集团有限公司2025年公开招聘工作人员10人备考题库附答案
- 分式计算题强化训练(12大题型96道)解析版-八年级数学上册
- 【飞瓜数据】2025年休闲零食线上消费市场洞察
- 2025年吉林事业单位招聘考试职业能力倾向测验试卷(石油化工)
- 水利渠道安装光伏施工方案
评论
0/150
提交评论