FAFU机器学习05-1 Model Evaluation课件_第1页
FAFU机器学习05-1 Model Evaluation课件_第2页
FAFU机器学习05-1 Model Evaluation课件_第3页
FAFU机器学习05-1 Model Evaluation课件_第4页
FAFU机器学习05-1 Model Evaluation课件_第5页
已阅读5页,还剩38页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

FoundationsofMachineLearning

ModelEvaluation2023/11/4ModelEvaluationLesson4-12023/11/4ModelEvaluationLesson4-2Q:

Howdoweestimatetheperformanceofthesemodelsusingdifferentmachinelearningalgorithms?2023/11/4ModelEvaluationLesson4-3Q:

Howdoweestimatetheperformanceofthesemodelswithdifferentparameters?Blue:

Observed

dataGreen:

true

distributionPolynomial

Curve

FittingRed:

Predicted

curve2023/11/4ModelEvaluationLesson4-4Q:

Howdoweestimatetheperformanceofmachinelearningmodel?Answer:①Wewanttoestimatethegeneralizationperformance,thepredictiveperformanceofourmodelonfuture(unseen)data.②Wewanttoincreasethepredictiveperformancebytweakingthelearningalgorithmandselectingthebestperformingmodelfromagivenhypothesisspace.③wewanttocomparedifferentalgorithms,selectingthebest-performingoneaswellasthebestperformingmodelfromthealgorithm’shypothesisspace.BasicConcepts2023/11/4ModelEvaluationLesson4-5

i.i.d.:Independentandidenticallydistributionmeansthatallsampleshavebeendrawnfromthesameprobabilitydistributionandarestatisticallyindependentfromeachother.

Accuracy:thenumberofcorrectpredictionsadividedbythenumberofsamplesm

ErrorRate:thenumberofwrongpredictionsbdividedbythenumberofsamplesmBasicConcepts2023/11/4ModelEvaluationLesson4-6Error(误差):generallyspeaking,thedifferencebetweenexpectedoutputvaluefromthemodelandrealsamplevalueTrainingerror(训练误差):empiricalerror(经验误差),istheerrorwegetapplyingthemodeltothesamedatafromwhichwetrained.Testerror(测试误差):istheerrorthatweincuronnewdata..Generalizationerror(泛化误差):out-of-sampleerror,isameasureofhowaccuratelyanalgorithmisabletopredictoutcomevaluesforunseendataPractically,thetesterrorisusedtoestimategeneralizationerrorTheoretically,generalizationerrorboundisemployedBasicConcepts2023/11/4ModelEvaluationLesson4-7Overfitting(过拟合):

LowerrorontrainingdataandhigherrorontestdataOverfittinggenerallyoccurswhenamodelisexcessivelycomplex,suchashavingtoomany

parameters

relativetothenumberofobservations.Underfitting(欠拟合):HigherrorontrainingdataUnderfittingoccurswhenastatisticalmodelormachinelearningalgorithmcannotcapturetheunderlyingtrendofthedatawhenfittingalinearmodeltonon-lineardata2023/11/4ModelEvaluationLesson4-82023/11/4ModelEvaluationLesson4-9EvaluationMethodsHoldoutMethod(留出法)K-foldCross-validation(K折交叉验证法)Bootstrapping(自助法)2023/11/4ModelEvaluationLesson4-10EvaluationMethodsHoldoutMethod(留出法):isinarguablythesimplestmodelevaluationtechnique.splitthedatasetintotwodisjointparts:Atrainingsetandatestset2023/11/4ModelEvaluationLesson4-11EvaluationMethodsHoldoutMethod(留出法):isinarguablythesimplestmodelevaluationtechnique.splitthedatasetintotwodisjointparts:AtrainingsetandatestsetKeepinmind:therearemanywaystosplitthedataset,anddifferentwaysbringdifferentperformancethechangeintheunderlyingsamplestatisticsalongthefeaturesaxesisstillaproblemthatbecomesmorepronouncedifweworkwithsmalldatasets2023/11/4ModelEvaluationLesson4-12EvaluationMethodsHoldoutMethod(留出法):isinarguablythesimplestmodelevaluationtechnique.splitthedatasetintotwodisjointparts:AtrainingsetandatestsetKeepinmind:therearemanywaystosplitthedataset,anddifferentwaysbringdifferentperformanceStratifiedsampling(分层采样)Repeatholdout

method

k

timeswithdifferentrandomseedsandcomputetheaverageperformanceoverthese

k

repetitions2023/11/4ModelEvaluationLesson4-13EvaluationMethodsHoldoutMethod(留出法):isinarguablythesimplestmodelevaluationtechnique.splitthedatasetintotwodisjointparts:AtrainingsetandatestsetKeepinmind:therearemanywaystosplitthedataset,anddifferentwaysbringdifferentperformanceStratifiedsampling(分层采样)Repeatholdout

method

k

timeswithdifferentrandomseedsandcomputetheaverageperformanceoverthese

k

repetitionsKeepinmind:thesizeofatrainingsetwillaffecttheperformance2023/11/4ModelEvaluationLesson4-14EvaluationMethodsHoldoutMethod(留出法):isinarguablythesimplestmodelevaluationtechnique.splitthedatasetintotwodisjointparts:AtrainingsetandatestsetKeepinmind:therearemanywaystosplitthedataset,anddifferentwaysbringdifferentperformanceStratifiedsampling(分层采样)Repeatholdout

method

k

timeswithdifferentrandomseedsandcomputetheaverageperformanceoverthese

k

repetitionsKeepinmind:thesizeofatrainingsetwillaffecttheperformanceTakeabout2/3~4/5datasetastrainingdata2023/11/4ModelEvaluationLesson4-15EvaluationMethodsK-foldCross-validation(K折交叉验证法):isprobablyamostcommonbutmorecomputationallyintensiveapproach.Splitsthedatasetintokdisjointparts,calledfoldsTypicalchoicesforkare5,10or20K-foldcross-validationisaspecialcaseofcross-validationwhereweiterateoveradatasetset

k

timesIneachround,onepartisusedforvalidation,andtheremaining

k-1

partsaremergedintoatrainingsubsetformodelevaluation2023/11/4ModelEvaluationLesson4-16EvaluationMethodsK-foldCross-validation(K折交叉验证法):isprobablyamostcommonbutmorecomputationallyintensiveapproach.5-fold2023/11/4ModelEvaluationLesson4-17EvaluationMethodsK-foldCross-validation(K折交叉验证法):isprobablyamostcommonbutmorecomputationallyintensiveapproach.Keepinmind:therethelargerthenumberoffoldsusedink-foldCV,thebettertheerrorestimateswillbe,butthelongeryourprogramwilltaketorun.Solution:

useatleast10folds(ormore)whenyoucanLeave-One-Out(留一法):LOO,isaspecialcasewhenk=numberofdataLOOCVcanbeusefulforsmalldatasets2023/11/4ModelEvaluationLesson4-18EvaluationMethodsBootstrapping(自助法):bootstrapsamplingtechniqueforestimatingasamplingdistributiontheideaofthebootstrapmethodistogeneratenewdatafromapopulationbyrepeatedsamplingfromtheoriginaldataset

withreplacement2023/11/4ModelEvaluationLesson4-19EvaluationMethodsBootstrapping(自助法):bootstrapsamplingtechniqueforestimatingasamplingdistributiontheideaofthebootstrapmethodistogeneratenewdatafromapopulationbyrepeatedsamplingfromtheoriginaldataset

withreplacementapproximately

select0.632×n

samplesasbootstraptrainingsetsandreserve

0.368×n

out-of-bagsamplesfortestingineachiteration.2023/11/4ModelEvaluationLesson4-20EvaluationMetrics2023/11/4ModelEvaluationLesson4-21MetricsforBinaryclassificationMeasuringmodelperformancewithaccuracyFractionofcorrectlyclassifiedsamplesItisreallyonlysuitablewhenthereareanequalnumberofobservationsineachclass(whichisrarelythecase)andthatallpredictionsandpredictionerrorsareequallyimportant,whichisoftennotthecaseDefinition2023/11/4ModelEvaluationLesson4-22MetricsforBinaryclassificationMeasuringmodelperformancewithaccuracyFractionofcorrectlyclassifiedsamplesNotalwaysausefulmetric,maybemisleadingExample:EmailSpamclassification99%ofemailarereal,1%ofemailarespamCouldbuildamodelthatpredictsallemailarerealaccurcy=99%ButhorribleatactuallyclassifyingspamFailsatitsoriginalpurposeMetricsforBinaryClassification2023/11/4ModelEvaluationLesson4-23ConfusionmatrixOneofthemostcomprehensivewaystorepresenttheresultofevaluatingbinaryclassification2023/11/4ModelEvaluationLesson4-24MetricsforBinaryClassificationErrorrate&AccuracyTheerrorratecanbeunderstoodasthesumofallfalsepredictionsdividedbythenumberoftotalpredictions,andtheaccuracyiscalculatedasthesumofcorrectpredictionsdividedbythetotalnumberofpredictions,respectively:2023/11/4ModelEvaluationLesson4-25MetricsfromtheconfusionmatrixPrecision(查准率)Precision:measureshowmanyofthesamplespredictedaspositiveareactuallypositivePrecisionisusedasaperformancemetricwhenthegoalistolimitthenumberoffalsepositives.Examples:Predictingwhetheranewdrugwillbeeffectiveintreatingadiseaseinclinicaltrials2023/11/4ModelEvaluationLesson4-26MetricsfromtheconfusionmatrixRecall(查全率,召回率)Recall:measureshowmanyofthepositivearecapturedbythepositivepredictionsPrecisionisusedasaperformancemetricwhenweneedtoindentifyallpositivesamples.Examples:Findpeoplethataresick2023/11/4ModelEvaluationLesson4-27MetricsfromtheconfusionmatrixTradeoffbetweenPrecisionandRecallTogethigherprecisionbyincreasingthresholdTogethigherrecallbyreducingthreshold2023/11/4ModelEvaluationLesson4-28Metrics

fromtheconfusionmatrixTradeoffbetweenPrecisionandRecallTradeoffbetweenPrecisionandRecall2023/11/4ModelEvaluationLesson4-29Metrics

fromtheconfusionmatrixTradeoffbetweenPrecisionandRecallTradeoffbetweenPrecisionandRecallF1:F-scoreorF-measureF-score:iswiththeharmonicmean(调和平均数)ofprecisionandrecallAlgorithmPRAverageF1A10.50.40.450.444A20.70.10.40.175A30.0210.510.03922023/11/4ModelEvaluationLesson4-30Metrics

fromtheconfusionmatrixGeneralF-measure:FβWhenβ=1,becomingF1Whenβ>1,placingmoreemphasisonfalsenegative,andweighingrecallhigherthanprecisionWhenβ<1,attenuatingtheinfluenceoffalsenegative,andweighingrecalllowerthanprecision2023/11/4ModelEvaluationLesson4-31MetricsforBinaryClassificationGeneralF-measure:FβReceiveroperatingcharacteristics(ROC)ROC(受试者工作特征):considersallpossiblethresholdsforagivenclassifier,andshowsthefalsepositiverate(FPR)againstthetruepositiverate(TPR)2023/11/4ModelEvaluationLesson4-32MetricsforBinaryClassificationAreaUnderROCCurve(AUC)ModelSelection有了实验评估方法和性能度量,看起来就能对学习器的性能进行评估比较了:先使用某种实验评估方法测得学习器的某个性能度量结果,然后对这些结果进行比较.首先,我们希望比较的是泛化性能,然而通过实验评估方法我们获得的是测试集上的性能,两者的对比结果可能未必相同;第二,测试集上的性能与测试集本身的选择有很大关系,且不论使用不同大小的测试集会得到不同的结果,即使用相同大小的测试集?若包含的测试样例不同,测试结果也会有不同;第二,很多机器学习算法本身有一定的随机性,即便用相同的参数设置在同一个测试集上多次运行,其结果也会有不同.2023/11/4ModelEvaluationLesson4-33ModelSelection统计假设检验(hypothesistest)为我们进行学习器性能比较提供了重要依据,基于假设检验结果我们可:对单个学习器泛化性能的假设进行检对多个学习器进行性能比较。若在测试集上观察到学习器A比B好,则A的泛化性能是否在统计意义上优于B,以及这个结论的把握有多大.2023/11/4ModelEvaluationLesson4-34ModelSelection统计假设检验(hypothesistest)为我们进行学习器性能比较提供了重要依据,基于假设检验结果我们可:对单个学习器泛化性能的假设进行检对多个学习器进行性能比较。若在测试集上观察到学习器A比B好,则A的泛化性能是否在统计意义上优于B,以及这个结论的把握有多大.2023/11/4ModelEvaluationLesson4-35Anhypothesistestingproblem

ConsideramodelwithholdoutmethodSupportthatthemodelwasperformed5times,andtheaccuracyare[0.99,0.98,0.99,0.94,0.95]Canwesaythatthemeanaccuracyisdifferentfrom0.97?ConsiderthegraderoftwomodelsAhad{15,10,12,19,5,7}Bhad{14,11,11,12,6,7}CanwesayAhadbettergradesthanB?Astatistictestaimstoanswersuchquestionsconfidenceinterval(置信区间)点估计与区间估计点估计:用样本统计量来估计总体参数,因为样本统计量为数轴上某一点值,估计的结果也以一个点的数值表示,所以称为点估计。点估计虽然给出了未知参数的估计值,但是未给出估计值的可靠程度,即估计值偏离未知参数真实值的程度。区间估计:给定置信水平,根据估计值确定真实值可能出现的区间范围,该区间通常以估计值为中心,该区间则为置信区间。2023/11/4ModelEvaluationLesson4-36confidenceinterval(置信区间)

2023/11/4ModelEvaluationLesson4-37confidenceinterval(置信区间)点估计与区间估计标准差(standarddeviation)与标准误差(standarderror)95%的置信区间假设X服从正态分布:

X∼N(μ,σ2)不断进行采样,假设样本的大小为n,则样本的均值为:M=(X1​+X2​+⋯+Xn​​)/n

由大数定理与中心极限定理,M

服从:M∼N(μ,σ12​)2023/11/4ModelEvaluationLesson4-38confidenceinterval(置信区间)

2023/11/4ModelEvaluationLesson4-39HypothesisTestingandStatisticalSignificanceTheprocessofhypothesistestingNullhypothesis:Thenullhypothesisisamodelofthesystembasedontheassumptionthattheapparenteffectwasactuallyduetochance.p-value:Thep-valueistheprobabilityoftheapparenteffectunderthenullhypothesis.Interpretation:Basedonthep-value,w

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论