大数据数据挖掘培训讲义:偏差检测_第1页
大数据数据挖掘培训讲义:偏差检测_第2页
大数据数据挖掘培训讲义:偏差检测_第3页
大数据数据挖掘培训讲义:偏差检测_第4页
大数据数据挖掘培训讲义:偏差检测_第5页
已阅读5页,还剩33页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

大数据数据挖掘培训讲义:偏差检测第一页,共38页。OutlineSummarizationKEFIR–KeyFindingsReporterWSARE–WhatisStrangeAboutRecentEvents2第二页,共38页。WhatisNew?Olddatanewdata3第三页,共38页。SummarizationConciselysummarizewhatisnewanddifferent,unexpectedwithrespecttopreviousvalueswithrespecttoexpectedvalues…Focusonwhatisactionable!4第四页,共38页。Problem:HealthcareCostsHealthcarecostsinUS:1outof7GDP$andrisingpotentialproblems:fraud,misuse,…understandingwheretheproblemsareisfirststeptofixingthemGTE–selfinsuredformedicalcostsGTEhealthcarecosts–$X00,000,000Task:Analyzeemployeehealthcaredataandgenerateareportthatdescribesthemajorproblems5第五页,共38页。GTEKeyFindingsReporter:

KEFIRKEFIRApproach:AnalyzeallpossibledeviationsSelectinterestingfindingsAugmentkeyfindingswith:ExplanationsofplausiblecausesRecommendationsofappropriateactionsConvertfindingstoauser-friendlyreportwithtextandgraphics6第六页,共38页。KEFIRSearchSpace第七页,共38页。Drill-DownExample8第八页,共38页。WhatChangeIsImportant?9第九页,共38页。DeviationDetectionDrillDownthroughthesearchspaceGenerateafindingforeachmeasuredeviationfrompreviousperioddeviationfromnormdeviationprojectedfornextperiod,ifnoaction10第十页,共38页。InterestingnessofDeviationsImpact:howmuchthedeviationaffectsthebottomlineSavingsPercentage:howmuchofthedeviationfromthenormcanbeexpectedtobesavedbytheaction第十一页,共38页。RecommendationsHierarchicalrecommendationrulesdefineappropriateinterventionstrategiesforimportantmeasuresandstudyareas.Example:measure=admissionrateper1000&study_area=Inpatientadmissions&percent_change>0.10IfThenUtilizationreviewisneededintheareaofadmissioncertification.ExpectedSavings:20%第十二页,共38页。ExplanationAmeasureisexplainedbyfindingthepathofrelatedmeasureswiththehighestimpactThelargeincreaseinm1ingroups1wascausedbyanincreaseinm3,whichwascausedbyariseinm5,primarilyinsectors13.13第十三页,共38页。ReportGenerationAutomaticgenerationofbusiness-user-orientedreportsNaturallanguagegenerationwithtemplatematchingGraphicsdeliveredviabrowser14第十四页,共38页。第十五页,共38页。SampleKEFIRpagesOverviewInpatientadmissions16第十六页,共38页。StatusPrototypeimplementedinGTEin1995KEFIRreceivedGTE’shighestawardfortechnicalachievementin1995KeybusinessuserleftGTEin1996andsystemwasnolongerusedPublication:SelectingandReportingWhatisInteresting:TheKEFIRApplicationtoHealthcareData,C.Matheus,G.Piatetsky-Shapiro,andD.McNeill,inAdvancesinKnowledgeDiscoveryandDataMining,AAAI/MITPress,1996第十七页,共38页。What’sStrangeAboutRecentEvents(WSARE)Weng-KeenWong(CarnegieMellonUniversity)AndrewMoore(CarnegieMellonUniversity)GregoryCooper(UniversityofPittsburgh)MichaelWagner(UniversityofPittsburgh)Designedtobeeasilyapplicabletoanydate/time-indexedbiosurveillance-relevantdatastream第十八页,共38页。MotivationPrimaryKeyDateTimeHospitalICD9ProdromeGenderAgeHomeLocationWorkLocationManymore…1006/1/039:121781FeverM20sNE?…1016/1/0310:451787DiarrheaF40sNENE…1026/1/0311:031786RespiratoryF60sNEN…1036/1/0311:072787DiarrheaM60sE?…1046/1/0312:151717RespiratoryM60sENE…1056/1/0313:013780ViralF50s?NW…1066/1/0313:053487RespiratoryF40sSWSW…1076/1/0313:572786UnmappedM50sSESW…1086/1/0314:221780ViralM40s??…:::::::::::SupposewehaveaccesstoEmergencyDepartmentdatafromhospitalsaroundacity(withpatientconfidentialitypreserved)19第十九页,共38页。TraditionalApproachesWeneedtobuildaunivariatedetectortomonitoreachinterestingcombinationofattributes:DiarrheacasesamongchildrenRespiratorysyndromecasesamongfemalesViralsyndromecasesinvolvingseniorcitizensfromeasternpartofcityNumberofchildrenfromdowntownhospitalNumberofcasesinvolvingpeopleworkinginsouthernpartofthecityNumberofcasesinvolvingteenagegirlslivinginthewesternpartofthecityBotulinicsyndromecasesAndsoon…You’llneedhundredsofunivariatedetectors!Wewouldliketoidentifythegroupswiththestrangestbehaviorinrecentevents.20第二十页,共38页。WSAREApproachRule-BasedAnomalyPatternDetectionAssociationrulesusedtocharacterizeanomalouspatterns.Forexample,atwo-componentrulewouldbe: Gender=MaleAND40Age<5021第二十一页,共38页。WSAREv2.0Overview2. Searchforrulewithbestscore3. Determinep-valueofbestscoringrulethroughrandomizationtestAllData4. Ifp-valueislessthanthreshold,signalalertRecentDataBaselineObtainRecentandBaselinedatasets22第二十二页,共38页。Step1:ObtainRecentandBaselineDataRecentDataBaselineDatafromlast24hoursBaselinedataisassumedtocapturenon-outbreakbehavior.Weusedatafrom35,42,49and56dayspriortothecurrentday23第二十三页,共38页。ExampleSat12-23-200135.8%(48/134)oftoday'scaseshave30<=age<4017.0%(45/265)ofother(baseline)caseshave30<=age<4024第二十四页,共38页。Step2.SearchforBestRuleForeachrule,forma2x2contingencytableeg.PerformFisher’sExactTesttogetap-value(score)foreachrule(forthisdata0.00005)FindruleR-bestwiththelowestscore.Caution:Thisscoreisnotthetruep-valueofRBESTbecauseofmultipletestsCountRecentCountBaselineAgeDecile=34845AgeDecile38622025第二十五页,共38页。Step3:RandomizationTestTaketherecentcasesandthebaselinecases.ShufflethedatefieldtoproducearandomizeddatasetcalledDBRandFindtherulewiththebestscoreonDBRand.June4,2002C2June5,2002C3June12,2002C4June19,2002C5June26,2002C6June26,2002C7July2,2002C8July3,2002C9July10,2002C10July17,2002C11July24,2002C12July30,2002C13July31,2002C14July31,2002C15June4,2002C2June12,2002C3July31,2002C4June26,2002C5July31,2002C6June5,2002C7July2,2002C8July3,2002C9July10,2002C10July17,2002C11July24,2002C12July30,2002C13June19,2002C14June26,2002C1526第二十六页,共38页。Step3:RandomizationTestRepeattheprocedureonthepreviousslidefor1000iterations.Determinehowmanyscoresfromthe1000iterationsarebetterthantheoriginalscore.Iftheoriginalscorewerehere,itwouldplaceinthetop1%ofthe1000scoresfromtherandomizationtest.Wewouldbeimpressedandanalertshouldberaised.Estimatedp-valueoftheruleis:#betterscores/#iterations27第二十七页,共38页。ResultsonActualEDData

from20011.Sat2001-02-13:SCORE=-0.00000004PVALUE=0.0000000014.80%(74/500)oftoday'scaseshaveViralSyndrome=TrueandEncephaliticProdome=False7.42%(742/10000)ofbaselinehaveViralSyndrome=TrueandEncephaliticSyndrome=False2.Sat2001-03-13:SCORE=-0.00000464PVALUE=0.0000000012.42%(58/467)oftoday'scaseshaveRespiratorySyndrome=True6.53%(653/10000)ofbaselinehaveRespiratorySyndrome=True3.Wed2001-06-30:SCORE=-0.00000013PVALUE=0.000000001.44%(9/625)oftoday'scaseshave100<=Age<1100.08%(8/10000)ofbaselinehave100<=Age<1104.Sun2001-08-08:SCORE=-0.00000007PVALUE=0.0000000083.80%(481/574)oftoday'scaseshaveUnknownSyndrome=False74.29%(7430/10001)ofbaselinehaveUnknownSyndrome=False5.Thu2001-12-02:SCORE=-0.00000087PVALUE=0.0000000014.71%(70/476)oftoday'scaseshaveViralSyndrome=TrueandEncephaliticSyndrome=False7.89%(789/9999)ofbaselinehaveViralSyndrome=TrueandEncephaliticSyndrome=False28第二十八页,共38页。WSARE3:0ImprovingtheBaselineRecallthatthebaselinewasassumedtobecapturedbydatathatwasfrom35,42,49,and56dayspriortothecurrentday.BaselineWewouldliketodeterminethebaselineautomatically!Whatifthisassumptionisn’ttrue?Whatifdatafrom7,14,21and28dayspriorisbetter?29第二十九页,共38页。TemporalTrendsFrom:Goldenberg,A.,Shmueli,G.,Caruana,R.A.,andFienberg,S.E.(2002).Earlystatisticaldetectionofanthraxoutbreaksbytrackingover-the-countermedicationsales.ProceedingsoftheNationalAcademyofSciences(pp.5237-5249)30第三十页,共38页。WSAREv3.0Generatethebaseline…“Takingintoaccountrecentflulevels…”“Takingintoaccountthattodayisapublicholiday…”“TakingintoaccountthatthisisSpring…”“Takingintoaccountrecentheatwave…”“Takingintoaccountthatthere’saknownnaturalFood-borneoutbreakinprogress…”Bonus:Moreefficientuseofhistoricaldata31第三十一页,共38页。Idea:BayesianNetworks“OnColdTuesdayMorningsthefolkscominginfromtheNorthpartofthecityaremorelikelytohaverespiratoryproblems”“PatientsfromWestParkHospitalarelesslikelytobeyoung”“Onthedayafteramajorholiday,expectaboostinthemorningfollowedbyalullintheafternoon”BayesianNetwork:Agraphicalmodelrepresentingthejointprobabilitydistributionofasetofrandomvariables“TheViralprodromeismorelikelytoco-occurwithaRashprodromethanBotulinic”32第三十二页,共38页。ObtainingBaselineDataBaselineAllHistoricalDataToday’sEnvironmentLearnBayesianNetwork2.Generatebaselinegiventoday’senvironmentWhatshouldbehappeningt

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论