




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
大数据数据挖掘培训讲义:偏差检测第一页,共38页。OutlineSummarizationKEFIR–KeyFindingsReporterWSARE–WhatisStrangeAboutRecentEvents2第二页,共38页。WhatisNew?Olddatanewdata3第三页,共38页。SummarizationConciselysummarizewhatisnewanddifferent,unexpectedwithrespecttopreviousvalueswithrespecttoexpectedvalues…Focusonwhatisactionable!4第四页,共38页。Problem:HealthcareCostsHealthcarecostsinUS:1outof7GDP$andrisingpotentialproblems:fraud,misuse,…understandingwheretheproblemsareisfirststeptofixingthemGTE–selfinsuredformedicalcostsGTEhealthcarecosts–$X00,000,000Task:Analyzeemployeehealthcaredataandgenerateareportthatdescribesthemajorproblems5第五页,共38页。GTEKeyFindingsReporter:
KEFIRKEFIRApproach:AnalyzeallpossibledeviationsSelectinterestingfindingsAugmentkeyfindingswith:ExplanationsofplausiblecausesRecommendationsofappropriateactionsConvertfindingstoauser-friendlyreportwithtextandgraphics6第六页,共38页。KEFIRSearchSpace第七页,共38页。Drill-DownExample8第八页,共38页。WhatChangeIsImportant?9第九页,共38页。DeviationDetectionDrillDownthroughthesearchspaceGenerateafindingforeachmeasuredeviationfrompreviousperioddeviationfromnormdeviationprojectedfornextperiod,ifnoaction10第十页,共38页。InterestingnessofDeviationsImpact:howmuchthedeviationaffectsthebottomlineSavingsPercentage:howmuchofthedeviationfromthenormcanbeexpectedtobesavedbytheaction第十一页,共38页。RecommendationsHierarchicalrecommendationrulesdefineappropriateinterventionstrategiesforimportantmeasuresandstudyareas.Example:measure=admissionrateper1000&study_area=Inpatientadmissions&percent_change>0.10IfThenUtilizationreviewisneededintheareaofadmissioncertification.ExpectedSavings:20%第十二页,共38页。ExplanationAmeasureisexplainedbyfindingthepathofrelatedmeasureswiththehighestimpactThelargeincreaseinm1ingroups1wascausedbyanincreaseinm3,whichwascausedbyariseinm5,primarilyinsectors13.13第十三页,共38页。ReportGenerationAutomaticgenerationofbusiness-user-orientedreportsNaturallanguagegenerationwithtemplatematchingGraphicsdeliveredviabrowser14第十四页,共38页。第十五页,共38页。SampleKEFIRpagesOverviewInpatientadmissions16第十六页,共38页。StatusPrototypeimplementedinGTEin1995KEFIRreceivedGTE’shighestawardfortechnicalachievementin1995KeybusinessuserleftGTEin1996andsystemwasnolongerusedPublication:SelectingandReportingWhatisInteresting:TheKEFIRApplicationtoHealthcareData,C.Matheus,G.Piatetsky-Shapiro,andD.McNeill,inAdvancesinKnowledgeDiscoveryandDataMining,AAAI/MITPress,1996第十七页,共38页。What’sStrangeAboutRecentEvents(WSARE)Weng-KeenWong(CarnegieMellonUniversity)AndrewMoore(CarnegieMellonUniversity)GregoryCooper(UniversityofPittsburgh)MichaelWagner(UniversityofPittsburgh)Designedtobeeasilyapplicabletoanydate/time-indexedbiosurveillance-relevantdatastream第十八页,共38页。MotivationPrimaryKeyDateTimeHospitalICD9ProdromeGenderAgeHomeLocationWorkLocationManymore…1006/1/039:121781FeverM20sNE?…1016/1/0310:451787DiarrheaF40sNENE…1026/1/0311:031786RespiratoryF60sNEN…1036/1/0311:072787DiarrheaM60sE?…1046/1/0312:151717RespiratoryM60sENE…1056/1/0313:013780ViralF50s?NW…1066/1/0313:053487RespiratoryF40sSWSW…1076/1/0313:572786UnmappedM50sSESW…1086/1/0314:221780ViralM40s??…:::::::::::SupposewehaveaccesstoEmergencyDepartmentdatafromhospitalsaroundacity(withpatientconfidentialitypreserved)19第十九页,共38页。TraditionalApproachesWeneedtobuildaunivariatedetectortomonitoreachinterestingcombinationofattributes:DiarrheacasesamongchildrenRespiratorysyndromecasesamongfemalesViralsyndromecasesinvolvingseniorcitizensfromeasternpartofcityNumberofchildrenfromdowntownhospitalNumberofcasesinvolvingpeopleworkinginsouthernpartofthecityNumberofcasesinvolvingteenagegirlslivinginthewesternpartofthecityBotulinicsyndromecasesAndsoon…You’llneedhundredsofunivariatedetectors!Wewouldliketoidentifythegroupswiththestrangestbehaviorinrecentevents.20第二十页,共38页。WSAREApproachRule-BasedAnomalyPatternDetectionAssociationrulesusedtocharacterizeanomalouspatterns.Forexample,atwo-componentrulewouldbe: Gender=MaleAND40Age<5021第二十一页,共38页。WSAREv2.0Overview2. Searchforrulewithbestscore3. Determinep-valueofbestscoringrulethroughrandomizationtestAllData4. Ifp-valueislessthanthreshold,signalalertRecentDataBaselineObtainRecentandBaselinedatasets22第二十二页,共38页。Step1:ObtainRecentandBaselineDataRecentDataBaselineDatafromlast24hoursBaselinedataisassumedtocapturenon-outbreakbehavior.Weusedatafrom35,42,49and56dayspriortothecurrentday23第二十三页,共38页。ExampleSat12-23-200135.8%(48/134)oftoday'scaseshave30<=age<4017.0%(45/265)ofother(baseline)caseshave30<=age<4024第二十四页,共38页。Step2.SearchforBestRuleForeachrule,forma2x2contingencytableeg.PerformFisher’sExactTesttogetap-value(score)foreachrule(forthisdata0.00005)FindruleR-bestwiththelowestscore.Caution:Thisscoreisnotthetruep-valueofRBESTbecauseofmultipletestsCountRecentCountBaselineAgeDecile=34845AgeDecile38622025第二十五页,共38页。Step3:RandomizationTestTaketherecentcasesandthebaselinecases.ShufflethedatefieldtoproducearandomizeddatasetcalledDBRandFindtherulewiththebestscoreonDBRand.June4,2002C2June5,2002C3June12,2002C4June19,2002C5June26,2002C6June26,2002C7July2,2002C8July3,2002C9July10,2002C10July17,2002C11July24,2002C12July30,2002C13July31,2002C14July31,2002C15June4,2002C2June12,2002C3July31,2002C4June26,2002C5July31,2002C6June5,2002C7July2,2002C8July3,2002C9July10,2002C10July17,2002C11July24,2002C12July30,2002C13June19,2002C14June26,2002C1526第二十六页,共38页。Step3:RandomizationTestRepeattheprocedureonthepreviousslidefor1000iterations.Determinehowmanyscoresfromthe1000iterationsarebetterthantheoriginalscore.Iftheoriginalscorewerehere,itwouldplaceinthetop1%ofthe1000scoresfromtherandomizationtest.Wewouldbeimpressedandanalertshouldberaised.Estimatedp-valueoftheruleis:#betterscores/#iterations27第二十七页,共38页。ResultsonActualEDData
from20011.Sat2001-02-13:SCORE=-0.00000004PVALUE=0.0000000014.80%(74/500)oftoday'scaseshaveViralSyndrome=TrueandEncephaliticProdome=False7.42%(742/10000)ofbaselinehaveViralSyndrome=TrueandEncephaliticSyndrome=False2.Sat2001-03-13:SCORE=-0.00000464PVALUE=0.0000000012.42%(58/467)oftoday'scaseshaveRespiratorySyndrome=True6.53%(653/10000)ofbaselinehaveRespiratorySyndrome=True3.Wed2001-06-30:SCORE=-0.00000013PVALUE=0.000000001.44%(9/625)oftoday'scaseshave100<=Age<1100.08%(8/10000)ofbaselinehave100<=Age<1104.Sun2001-08-08:SCORE=-0.00000007PVALUE=0.0000000083.80%(481/574)oftoday'scaseshaveUnknownSyndrome=False74.29%(7430/10001)ofbaselinehaveUnknownSyndrome=False5.Thu2001-12-02:SCORE=-0.00000087PVALUE=0.0000000014.71%(70/476)oftoday'scaseshaveViralSyndrome=TrueandEncephaliticSyndrome=False7.89%(789/9999)ofbaselinehaveViralSyndrome=TrueandEncephaliticSyndrome=False28第二十八页,共38页。WSARE3:0ImprovingtheBaselineRecallthatthebaselinewasassumedtobecapturedbydatathatwasfrom35,42,49,and56dayspriortothecurrentday.BaselineWewouldliketodeterminethebaselineautomatically!Whatifthisassumptionisn’ttrue?Whatifdatafrom7,14,21and28dayspriorisbetter?29第二十九页,共38页。TemporalTrendsFrom:Goldenberg,A.,Shmueli,G.,Caruana,R.A.,andFienberg,S.E.(2002).Earlystatisticaldetectionofanthraxoutbreaksbytrackingover-the-countermedicationsales.ProceedingsoftheNationalAcademyofSciences(pp.5237-5249)30第三十页,共38页。WSAREv3.0Generatethebaseline…“Takingintoaccountrecentflulevels…”“Takingintoaccountthattodayisapublicholiday…”“TakingintoaccountthatthisisSpring…”“Takingintoaccountrecentheatwave…”“Takingintoaccountthatthere’saknownnaturalFood-borneoutbreakinprogress…”Bonus:Moreefficientuseofhistoricaldata31第三十一页,共38页。Idea:BayesianNetworks“OnColdTuesdayMorningsthefolkscominginfromtheNorthpartofthecityaremorelikelytohaverespiratoryproblems”“PatientsfromWestParkHospitalarelesslikelytobeyoung”“Onthedayafteramajorholiday,expectaboostinthemorningfollowedbyalullintheafternoon”BayesianNetwork:Agraphicalmodelrepresentingthejointprobabilitydistributionofasetofrandomvariables“TheViralprodromeismorelikelytoco-occurwithaRashprodromethanBotulinic”32第三十二页,共38页。ObtainingBaselineDataBaselineAllHistoricalDataToday’sEnvironmentLearnBayesianNetwork2.Generatebaselinegiventoday’senvironmentWhatshouldbehappeningt
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 实验1 研究匀变速直线运动-2023年高考物理实验专项突破(原卷版)
- 老年人安全游泳知识培训课件
- 第五节 双曲线 2026年高三数学第一轮总复习
- 脑溢血的死亡率有多高
- 天气与气候-2024年中考地理一轮复习知识清单(扣空版)
- 人工智能通识教程(微课版) 课件 04 人工智能技术的觉醒-深度学习技术框架
- 上海市某中学2025-2026学年高三年级上册暑期考试数学试卷(7月份)
- CN120203212A 一种以米粒为支架一步法培养大黄鱼细胞为动植物复合食品的方法
- CN120201845A 一种有机半导体异质结光子突触晶体管及其制备方法
- CN120200250A 一种基于企业供电服务画像的供电服务策略优化方法
- GB/T 6344-2008软质泡沫聚合材料拉伸强度和断裂伸长率的测定
- GB/T 39201-2020高铝粉煤灰提取氧化铝技术规范
- GB/T 3836.4-2021爆炸性环境第4部分:由本质安全型“i”保护的设备
- GB/T 20801.6-2020压力管道规范工业管道第6部分:安全防护
- GB/T 19355.2-2016锌覆盖层钢铁结构防腐蚀的指南和建议第2部分:热浸镀锌
- 核心素养视角下教师专业发展课件
- 企业信用信息公告系统年度报告模板:非私营其他企业
- 施工员钢筋工程知识培训(培训)课件
- 质量管理体系审核中常见的不合格项
- 共用水电费分割单模板
- 《阿房宫赋》全篇覆盖理解性默写
评论
0/150
提交评论