




已阅读5页,还剩274页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1 modelbuildingtraining maxkuhnkjelljohnsonglobalnonclinicalstatistics 2 overview typicaldatascenariosexampleswe llbeusinggeneralapproachestomodelbuildingdatapre processingregression typemodelsclassification typemodelsotherconsiderations 3 typicaldata responsemaybecontinuousorcategoricalpredictorsmaybecontinuous count and orbinarydenseorsparseobservedand orcalculated 4 predictivemodels whatisa predictivemodel amodelwhoseprimarypurposeisforprediction asopposedtoinference wewouldliketoknowwhythemodelworks aswellastherelationshipbetweenpredictorsandtheoutcome butthesearesecondaryexamples blood glucosemonitoring spamdetection computationalchemistry etc 5 whataretheynotgoodfor theyarenotasubstituteforsubjectspecificknowledgescience hard yikes models easy let sdotheseinstead tomakeagoodmodelthatpredictswellonfuturesamples youneedtoknowalotaboutyourpredictorsandhowtheyrelatetoeachotherthemechanismthatgeneratedthedata sampling technologyetc 6 whataretheynotgoodfor anexample anoncologistcollectssomedatafromasmallclinicaltrialandwantsamodelthatwouldusegeneexpressiondatatopredicttherapeuticresponse beneficialornot in4typesofcancertherewereabout54kpredictorsanddatawascollectedon 20subjectsifthereisalotofknowledgeofhowthetherapyworks pathwaysetc someeffortmustbeputintousingthatinformationtohelpbuildthemodel 7 thebigpicture intheend predictivemodeling isnotasubstituteforintuition butacompliment ianayres insupercrunchers 8 references statisticalmodeling thetwocultures byleobreiman statisticalscience vol16 3 2001 199 231 theelementsofstatisticallearningbyhastie tibshiraniandfriedmanregressionmodelingstrategiesbyharrellsupercrunchersbyayres 9 regressionmethods multiplelinearregressionpartialleastsquaresneuralnetworksmultivariateadaptiveregressionsplinessupportvectormachinesregressiontreesensemblesoftrees bagging boosting andrandomforests 10 classificationmethods discriminantanalysisframeworklinear quadratic regularized flexible andpartialleastsquaresdiscriminantanalysismodernclassificationmethodsclassificationtreesensemblesoftreesboostingandrandomforestsneuralnetworkssupportvectormachinesk nearestneighborsnaivebayes 11 interestingmodelswedon thavetimefor l1penaltymethodsthelasso theelasticnet nearestshrunkencentroidsotherboostedmodelslinearmodels generalizedadditivemodels etcothermodels conditionalinferencetrees c4 5 c5 cubist othertreemodelslearnedvectorquantizationself organizingmapsactivelearningtechniques 12 exampledatasets 13 bostonhousingdata thisisaclassicbenchmarkdatasetforregression itincludeshousingdatafor506censustractsofbostonfromthe1970census crim percapitacrimerateindus proportionofnon retailbusinessacrespertownchas charlesriverdummyvariable 1iftractboundsriver 0otherwise nox nitricoxidesconcentrationrm averagenumberofroomsperdwellingage proportionofowner occupiedunitsbuiltpriorto1940 dis weighteddistancestofivebostonemploymentcentersrad indexofaccessibilitytoradialhighwaystax full valueproperty taxrateptratio pupil teacherratiobytownb proportionofminoritiesmedv medianvaluehomes outcome 14 toyclassificationexample asimulateddatasetwillbeusedtodemonstrateclassificationmodelstwopredictorswithacorrelationcoefficientof0 5weresimulatedtwoclassesweresimulated active and inactive aprobabilitymodelwasusedtoassignaprobabilityofbeingactivetoeachsamplethe25 50 and75 probabilitylinesareshownontheright 15 toyclassificationexample theclasseswererandomlyassignedbasedontheprobabilitythetrainingdatahad250compounds plotonright thetestsetalsocontained250compoundswithtwopredictors theclassboundariescanbeshownforeachmodelthiscanbeasignificantaidinunderstandinghowthemodelswork butweacknowledgehowunrealisticthissituationis 16 modelbuildingtraining generalstrategies 17 objective toconstructamodelofpredictorsthatcanbeusedtopredictaresponse 18 modelbuildingsteps commonstepsduringmodelbuildingare estimatingmodelparameters i e trainingmodels determiningthevaluesoftuningparametersthatcannotbedirectlycalculatedfromthedatacalculatingtheperformanceofthefinalmodelthatwillgeneralizetonewdatathemodelerhasafiniteamountofdata whichtheymust spend toaccomplishthesestepshowdowe spend thedatatofindanoptimalmodel 19 spending data wetypically spend dataontrainingandtestdatasetstrainingset thesedataareusedtoestimatemodelparametersandtopickthevaluesofthecomplexityparameter s forthemodel testset akavalidationset thesedatacanbeusedtogetanindependentassessmentofmodelefficacy theyshouldnotbeusedduringmodeltraining themoredatawespend thebetterestimateswe llget providedthedataisaccurate givenafixedamountofdata toomuchspentintrainingwon tallowustogetagoodassessmentofpredictiveperformance wemayfindamodelthatfitsthetrainingdataverywell butisnotgeneralizable overfitting toomuchspentintestingwon tallowustogetagoodassessmentofmodelparameters 20 methodsforcreatingatestset howshouldwesplitthedataintoatrainingandtestset often therewillbeascientificrationalforthesplitandinothercases thesplitscanbemadeempirically severalempiricalsplittingoptions completelyrandomstratifiedrandommaximumdissimilarityinpredictorspace 21 creatingatestset completelyrandomsplits acompletelyrandom cr splitrandomlypartitionsthedataintoatrainingandtestsetforlargedatasets acrsplithasverylowbiastowardsanycharacteristic predictororresponse forclassificationproblems acrsplitisappropriatefordatathatisbalancedintheresponsehowever acrsplitisnotappropriateforunbalanceddataacrsplitmayselecttoofewobservations andperhapsnone ofthelessfrequentclassintooneofthesplits 22 creatingatestset stratifiedrandomsplits astratifiedrandomsplitmakesarandomsplitwithinstratificationgroupsinclassification theclassesareusedasstratainregression groupsbasedonthequantilesoftheresponseareusedasstratastratificationattemptstopreservethedistributionoftheoutcomebetweenthetrainingandtestsetsasrsplitismoreappropriateforunbalanceddata 23 over fitting over fittingoccurswhenamodelhasextremelygoodpredictionforthetrainingdatabutpredictspoorlywhenthedataareslightlyperturbednewdata i e testdata areusedcomplexregressionandclassificationmodelsassumethattherearepatternsinthedata withoutsomecontrolmanymodelscanfindveryintricaterelationshipsbetweenthepredictorandtheresponsethesepatternsmaynotbevalidfortheentirepopulation 24 over fittingexample theplotsbelowshowclassificationboundariesfortwomodelsbuiltonthesamedataoneofthemisover fit predictorb predictorb predictora predictora 25 over fittinginregression historically weevaluatethequalityofaregressionmodelbyit smeansquarederror supposethatarepredictionfunctionisparameterizedbysomevector 26 over fittinginregression msecanbedecomposedintothreeterms irreduciblenoisesquaredbiasoftheestimatorfromit sexpectedvaluethevarianceoftheestimatorthebiasandvarianceareinverselyrelatedasoneincreases theotherdecreasesdifferentratesofchange 27 over fittinginregression whenthemodelunder fits thebiasisgenerallyhighandthevarianceislowover fittingistypicallycharacterizedbyhighvariance lowbiasestimatorsinmanycases smallincreasesinbiasresultinlargedecreasesinvariance 28 over fittinginregression generally controllingthemseyieldsagoodtrade offbetweenover andunder fittingasimilarstatementcanbemadeaboutclassificationmodels althoughthemetricsaredifferent i e notmse howcanweaccuratelyestimatethemsefromthetrainingdata thena vemsefromthetrainingdatacanbeaverypoorestimateresamplingcanhelpestimatethesemetrics 29 howdoweestimateover fitting somemodelshavespecific knobs tocontrolover fittingneighborhoodsizeinnearestneighbormodelsisanexamplethenumberifsplitsinatreemodeloften poorchoicesfortheseparameterscanresultinover fittingresamplingthetrainingcompoundsallowsustoknowwhenwearemakingpoorchoicesforthevaluesoftheseparameters 30 howdoweestimateover fitting resamplingonlyaffectsthetrainingdatathetestsetisnotusedinthisprocedureresamplingmethodstryto embedvariation inthedatatoapproximatethemodel sperformanceonfuturecompoundscommonresamplingmethods k foldcrossvalidationleavegroupoutcrossvalidationbootstrapping 31 k foldcrossvalidation here werandomlysplitthedataintokblocksofroughlyequalsizeweleaveoutthefirstblockofdataandfitamodel thismodelisusedtopredicttheheld outblockwecontinuethisprocessuntilwe vepredictedallkhold outblocksthefinalperformanceisbasedonthehold outpredictions 32 k foldcrossvalidation theschematicbelowshowstheprocessfork 3groups kisusuallytakentobe5or10leaveoneoutcross validationhaseachsampleasablock 33 leavegroupoutcrossvalidation arandomproportionofdata say80 areusedtotrainamodeltheremainderisusedtopredictperformancethisprocessisrepeatedmanytimesandtheaverageperformanceisused 34 bootstrapping bootstrappingtakesarandomsamplewithreplacementtherandomsampleisthesamesizeastheoriginaldatasetcompoundsmaybeselectedmorethanonceeachcompoundhasa63 2 changeofshowingupatleastoncesomesampleswon tbeselectedthesesampleswillbeusedtopredictperformancetheprocessisrepeatedmultipletimes say30 35 thebootstrap withbootstrapping thenumberofheld outsamplesisrandomsomemodels suchasrandomforest usebootstrappingwithinthemodelingprocesstoreduceover fitting 36 trainingmodelswithtuningparameters asingletraining testsplitisoftennotenoughformodelswithtuningparameterswemustuseresamplingtechniquestogetgoodestimatesofmodelperformanceovermultiplevaluesoftheseparameterswepickthecomplexityparameter s withthebestperformanceandre fitthemodelusingallofthedata 37 simulateddataexample let sfitanearestneighborsmodeltothesimulatedclassificationdata theoptimalnumberofneighborsmustbechosenifweuseleavegroupoutcross validationandsetaside20 wewillfitmodelstoarandom200samplesandpredict50samples30iterationswereusedwe lltrainover11oddvaluesforthenumberofneighborswealsohavea250pointtestset 38 toydataexample theplotontherightshowstheclassificationaccuracyforeachvalueofthetuningparameterthegreypointsarethe30resampledestimatestheblacklineshowstheaverageaccuracythebluelineisthe250sampletestsetitlookslike7ormoreneighborsisoptimalwithanestimatedaccuracyof86 39 toydataexample whatifwedidn tresampleandusedthewholedataset theplotontherightshowstheaccuracyacrossthetuningparametersthiswouldpickamodelthatover fitsandhasoptimisticperformance 40 modelbuildingtraining datapre processing 41 whypre process inordertogeteffectiveandstableresults manymodelsrequirecertainassumptionsaboutthedatathisismodeldependentwewilllisteachmodel spre processingrequirementsattheendingeneral pre processingrarelyhurtsmodelperformance butcouldmakemodelinterpretationmoredifficult 42 commonpre processingsteps formostmodels weapplythreepre processingprocedures removalofpredictorswithvarianceclosetozeroeliminationofhighlycorrelatedpredictorscenteringandscalingofeachpredictor 43 zerovariancepredictors mostmodelsrequirethateachpredictorhaveatleasttwouniquevalueswhy apredictorwithonlyoneuniquevaluehasavarianceofzeroandcontainsnoinformationabouttheresponse itisgenerallyagoodideatoremovethem 44 nearzerovariance predictors additionally ifthedistributionsofthepredictorsareverysparse thiscanhaveadrasticeffectonthestabilityofthemodelsolutionzerovariancedescriptorscouldbeinducedduringresamplingbutwhatdoesa nearzerovariance predictorlooklike 45 nearzerovariance predictor therearetwoconditionsforan nzv predictoralownumberofpossiblevalues andahighimbalanceinthefrequencyofthevaluesforexample alownumberofpossiblevaluescouldoccurbyusingfingerprintsaspredictorsonlytwopossiblevaluescanoccur 0or1 butwhatifthereare999zerovaluesinthedataandasinglevalueof1 thisisahighlyunbalancedcaseandcouldbetrouble 46 nzvexample incomputationalchemistrywecreatedpredictorsbasedonstructuralcharacteristicsofcompounds asanexample thedescriptor nr11 isthenumberof11 memberringsthetabletotherightisthedistributionofnr11fromatrainingsetthedistinctvaluepercentageis5 535 0 0093thefrequencyratiois501 23 21 8 47 detectingnzvs twocriteriafordetectingnzvsarethediscretevaluepercentagedefinedasthenumberofuniquevaluesdividedbythenumberofobservationsrule of thumb discretevaluepercentage19couldindicateaproblemifbothcriteriaareviolated theneliminatethepredictor 48 highlycorrelatedpredictors somemodelscanbenegativelyaffectedbyhighlycorrelatedpredictorscertaincalculations e g matrixinversion canbecomeseverelyunstablehowcanwedetectthesepredictors varianceinflationfactor vif inlinearregressionor alternativelycomputethecorrelationmatrixofthepredictorspredictorswith absolute pair wisecorrelationsaboveathresholdcanbeflaggedforremovalrule of thumbthreshold 0 85 49 highlycorrelatedpredictorsandresampling recallthatresamplingslightlyperturbsthetrainingdatasettoincreasevariationifamodelisadverselyaffectedbyhighcorrelationsbetweenpredictors theresamplingperformanceestimatescanbepoorincomparisontothetestsetinthiscase resamplingdoesabetterjobatpredictinghowthemodelworksonfuturesamples 50 centeringandscaling standardizingthepredictorscangreatlyimprovethestabilityofmodelcalculations moreimportantly thereareseveralmodels e g partialleastsquares thatimplicitlyassumethatallofthepredictorsareonthesamescaleapartfromthelossoftheoriginalunits thereisnorealdownsideofcenteringandscaling 51 modelbuildingtraining regression typemodels 52 setting responseiscontinuous 53 objective toconstructamodelofpredictorsthatcanbeusedtopredictaresponse 54 regressionmethods multiplelinearregressionpartialleastsquaresneuralnetworksmultivariateadaptiveregressionsplinessupportvectormachinesregressiontreesensemblesoftrees bagging boosting andrandomforestseachofthesemethodsseektofindarelationshipbetweenthepredictorsandresponsethatminimizeserrorbetweentheobservedandpredictedresponse 55 additivemodels inthebeginningtherewerelinearmodels andhastieandtibshirani 1990 said lettherebegeneralizedadditivemodels andnelderandwedderburn 1972 said lettherebegeneralizedlinearmodels andlinkfunctionsappeared andscatterplotsmoothersandbackfittingalgorithmsappeared 56 familiesofadditivemodels glm gam recursivepartitioning trees boosting randomforests bagging multivariateadaptiveregressionsplines neuralnets supportvectormachines pls flexibility additivitydependsonmodelparameters 57 assessingmodelperformance 58 assessingmodelperformance howwelldoesaregressionmodelperform answeringthisquestiondependsonhowwewanttousethemodel possiblegoalsare tounderstandtherelationshipbetweenthepredictorandtheresponse tousethemodeltopredictfutureobservations response ineithercase wecanuseseveralofdifferentmeasurestoevaluatemodelperformance wewillfocusontwo coefficientofdetermination r2 rootmeansquareerror rmse however thesetofdatathatweusetoevaluateperformancewillchangedependingonourpurpose 59 whichsetofdatatousetoevaluateperformance ifweareonlyinterestedinunderstandingtheunderlyingrelationshipbetweenthepredictorandtheresponse thenwecancomputer2andrmseonthedataforwhichthemodelwasbuilt i ethetrainingdata however thesevalueswillbeoverlyoptimisticofthemodel sabilitytopredictfutureobservations ifweareinterestedinunderstandingthemodel sabilitytopredictfutureobservations thenweneedtocomputer2andrmseondataforwhichthemodelwasnotbuilt i e atestsetorcross validationset foraheld outsetofdata r2iscommonlyreferredtoasq2andrmseiscommonlyreferredtoasrootmeansquaredpredictionerror rmspe 60 rootmeansquarederror rmse androotmeansquaredpredictionerror rmspe rmsemeasurestheaveragedeviationofanobservationtothebest fitplanermspemeasurestheaveragedeviationofanobservationtoitspredictedvalueforthetestorcross validationset n thenumberofobservationsinthetestorcross validationset 61 computingq2 process partitionthedataintoatrainingandtestingset orblockstobeusedfortrainingandtestingbuildthemodelonthetrainingdataandpredictthetestingdataq2 r2oftherelationshipbetweentheobservedandpredictedvaluesforthetestingdata 62 multiplelinearregression aquickreview 63 multiplelinearregression objective findtheplanethroughthedatathatminimizesthesum of squareserror 64 thebestplane tofindthebestplane wesolve whereynx1 xnx p 1 and p 1 x1thebest is 65 aside abitmoreabout xtx xtx isacriticalmatrixformanystatisticalmodelingtechniquesafewfunfacts xtx isproportionaltothecovariancematrix sscontainsthevariancesandcovariancesofallpredictorstechniquesthatdependon xtx alsorequirethatitisinvertible 66 assumptions diagnosticplots 67 whendoesregressionfail whenaplanedoesnotcapturethestructureinthedatawhenthevariance covariancematrixisoverdeterminedrecall theplanethatminimizessseis tofindthebestplane wemustcomputetheinverseofthevariance covariancematrixthevariance covariancematrixisnotalwaysinvertible twocommonconditionsthatcauseittobeuninvertibleare twoormoreofthepredictorsarecorrelated multicollinearity therearemorepredictorsthanobservations 68 a trivial exampleofmulticollinearity supposethatwehaveoneobservation 3 5 andwewishtofindthe best lineforthedata inthisexample thenumberofobservations 1 islessthanthenumberofparameters 2 slopeandintercept whenthenumberofparametersisgreaterthanthenumberofobservations wecanfindaninfinitenumberof best solutions inthepresenceofmulticollinearity thebestsolutionwillbeunstable 69 bostonhousingdata let susealinearregressionmodeltopredictthemedianhousepriceinboston process splitthedataintoatrainingset n 337 andtestingset n 169 forthetrainingset usethebootstraptodeterminethermspeandq2forthetestdatadeterminermspeandq2iftheunderlyingmodelisstable thevaluesofrmspeandq2shouldbesimilarbetweenthebootstrapandtestingdata 70 results theresultsarefairlysimilar atleastwithinthevariationofresamplingonereasonyoumayseedifferences multicollinearitymulticollinearityinthepredictorscanproducesomewhatunstablesolutionsforeachresamplewhenthedataareslightlychanged themodelcandrasticallychangethetestsetisasingle staticsetofdataforverificationthebootstrapestimateofperformancemaybebetterwithcollinearity 71 partialleastsquaresregression 72 solutionsforoverdeterminedcovariancematrices variablereductiontrytoaccomplishthisthroughthepre processingstepspartialleastsquares pls othermethodsapplyageneralizedinverseridgeregression adjuststhevariance covariancematrixsothatwecanfindauniqueinverse principalcomponentregression pcr notrecommended butit sagoodwaytounderstandpls 73 understandingpartialleastsquares principalcomponentsanalysis pcaseekstofindlinearcombinationsofth
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 物业服务企业绩效考核实施方案
- 2025年游戏开发行业招聘面试模拟题集及答案解析
- 2025年金融投资从业者必-备资质考试预测试题及答案
- 危废泄漏专项应急处理方案(范文示范)
- 2025年融媒体编辑笔试题目解析
- 2025年道路运输企业安全生产管理人员作业考试题库(附答案)
- 2025年注册验船师资格考试(A级船舶检验专业基础环境与人员保护)测试题及答案一
- 2026届海南省儋州市一中高一化学第一学期期中教学质量检测模拟试题含解析
- 2025年可持续发展与环境管理考试试题及答案
- 合肥公务员面试题及答案
- JG/T 8-2016钢桁架构件
- 选择测试题大全及答案
- 陕西西安工业投资集团有限公司招聘笔试题库2025
- 废旧船买卖合同协议书
- 2023年河北省中考数学真题(原卷版)
- 2025年4月自考04184线性代数(经管类)试题及答案含评分标准
- 公共管理监督体系构建
- 浅析人物形象构建:从心理学角度解析角色性格与行为表现
- 数学史课件教学课件
- 2025年军事专业基础知识考核试题及答案
- 私人代客炒股协议合同
评论
0/150
提交评论