2024哈佛大学稀疏事件数据的逻辑回归

上传人：1*** IP属地：上海上传时间：2025-05-02 格式：DOCX 页数：29 大小：1.17MB 积分：10.8 举报 版权申诉

已阅读5页，还剩24页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

LogisticRegressioninRareEventsDataWestudyrareeventsdata,binarydependentvariableswithdozenstothousandsoftimesfewerones(events,suchaswars,vetoes,casesofpoliticalactivism,orepidemiologicalinfections)thanzeros(“nonevents”).Inmanyliteratures,thesevariableshaveprovendif-ficulttoexplainandpredict,aproblemthatseemstohaveatleasttwosources.First,popularstatisticalprocedures,suchaslogisticregression,cansharplyunderestimatetheprobabilityofrareevents.Werecommendcorrectionsthatoutperformexistingmethodsandchangetheestimatesofabsoluteandrelativerisksbyasmuchassomeestimatedeffectsreportedintheliterature.Second,commonlyuseddatacollectionstrategiesaregrosslyinefficientforrareeventsdata.Thefearofcollectingdatawithtoofeweventshasledtodatacollectionswithhugenumbersofobservationsbutrelativelyfew,andpoorlymeasured,explanatoryvariables,suchasininternationalconflictdatawithmorethanaquarter-milliondyads,onlyafewofwhichareatwar.Asitturnsout,moreefficientsam-wars)andatinyfractionofnonevents(peace).Thisenablesscholarstosaveasmuchas99%oftheir(nonfixed)datacollectioncostsortocollectmuchmoremeaningfulex-KristianGleditsch,GuidoImbens,ChuckManski,PeterMcCullagh,WalterMebane,JonathanNagler,BruceRussett,KenScheve,PhilSchrodt,MartinTanner,andRichardTuckerforhelpfulsuggestions;ScottBennett,KristianGleditsch,PaulHuth,andRichardTuckerfordata;andtheNationalScienceFoundation(SBR-9729884andSBR-9753126),theCentersforDiseaseControlandPrevention(DivisionofDiabetesTranslation),theintheSocialSciencesforresearchsupport.Softwarewewrotetoimplementthemethodsinthispaper,called“ReLogit:RareEventsLogisticRegression,”isavailableforStataandforGaussfrom\hhttp://GKing.Harvard.Edu.Wehavewrittenacompanionpiecetothisarticlethatoverlapsthisone:itexcludesthemathematicalproofsandothertechnicalmaterial,andhaslessgeneralnotation,butitincludesempiricalexamplesandmorepedagogicallyorientedmaterial(seeKingandZeng2000b;copyavailableat\hhttp://GKing.Harvard.Edu).Copyright2001bytheSocietyforPolitical

我们研究罕见事件数据，这些数据是二元依赖变量，其中事件（如战争、否决、政治活动案例或流行病感染）少几十到几千倍。在许多文献中，这以节省多达99%的（非固定）数据收集成本，或者收集更多有意义的解释变量。我们提作者注：感谢JamesFowler、EthanKatz和MikeTomz提供研究协助；JimAlt、JohnFreeman、KristianGleditsch、GuidoImbens、ChuckManski、PeterMcCullagh、WalterMebane、JonathanNagler、BruceRussett、KenScheve、PhilSchrodt、MartinTanner和RichardTucker出有益的建议；ScottBennett、KristianGleditsch、PaulHuth和RichardTucker提供数据；以及美国国家科学基金会（SBR‑9729884和SBR‑9753126）、疾病控制与预防中心（糖尿病翻译部）、美国国家老龄化研究所（P01AG17625‑01）、世界卫生组织和社会科学基础研究中心提供研究支持。我们编写的用于实现本文中方法的软件“ReLogit：罕见事件逻辑回归”，可在Stata和Gauss上使用，网址为\hhttp://GKing.Harvard.Edu。我们还为本文撰写了一篇配套文章，与本文重叠：它不包括数学证明和其他技术材料，符号也不够通用，但包括经验示例和更多面向教学的材料（参见King和Zeng2000b；副本可在\hhttp://GKing.Harvard.Edu获取）。P1:P1:P1:P1: GaryKingandLangche1WEADDRESSPROBLEMSinthestatisticalanalysisofrareeventsdata—binarydepen-dentvariableswithdozenstothousandsoftimesfewerones(events,suchaswars,coups,presidentialvetoes,decisionsofcitizenstorunforpoliticaloffice,orinfectionsbyun-commondiseases)thanzeros(“nonevents”).(Ofcourse,bytrivialrecoding,thisdefinitionandrelatedsocialsciencesandperhapsmostprevalentininternationalconflict(andothertoexplainandpredict,aproblemwebelievehasamultiplicityofsources,includingthetwoweaddresshere:mostpopularstatisticalprocedures,suchaslogisticregression,cansharplyunderestimatetheprobabilityofrareevents,andcommonlyuseddatacollectionstrategiesaregrosslyinefficient.First,althoughthestatisticalpropertiesoflinearregressionmodelsareinvarianttothe(unconditional)meanofthedependentvariable,thesameisnottrueforbinarydependentvariablemodels.Themeanofabinaryvariableistherelativefrequencyofeventsinthedata,which,inadditiontothenumberofobservations,constitutestheinformationcontentofthedataset.Weshowthatthisoftenoverlookedpropertyofbinaryvariablemodelshasbiasedinsmallsamples(underabout200)iswelldocumentedinthestatisticalliterature,butnotaswidelyunderstoodisthatinrareeventsdatathebiasesinprobabilitiescanbesubstantivelymeaningfulwithsamplesizesinthethousandsandareinapredictabledirection:estimatedeventprobabilitiesaretoosmall.Aseparate,andalsooverlooked,problemisthatthealmost-universallyusedmethodofcomputingprobabilitiesofeventsinlogitanalysisissuboptimalinfinitesamplesofrareeventsdata,leadingtoerrorsinthesamedirectionasbiasesinthecoefficients.Appliedresearchersvirtuallynevercorrectfortheunderestimationofeventprobabilities.Theseproblemswillbeinnocuousinsomeapplications,butweoffersimpleMonteCarloexampleswherethebiasesareaslargeassomeestimatedeffectsreportedintheliterature.Wedemonstratehowtocorrectfortheseproblemsandprovidesoftwaretomakethecomputationstraightforward.Asecondsourceofthedifficultiesinanalyzingrareeventsliesindatacollection.Givenbetteroradditionalvariables.Inrareeventsdata,fearofcollectingdatasetswithnoevents(andthuswithoutvariationonY)hasledresearcherstochooseverylargenumbersofobservationswithfew,andinmostcasespoorlymeasured,explanatoryvariables.Thisisareasonablechoice,giventheperceivedconstraints,butitturnsoutthatfarmoreefficientonesandasmallrandomsampleofzerosandnotloseconsistencyorevenmuchefficiencyrelativetothefullsample.Thisresultdrasticallychangestheoptimaltrade-offbetweenmoreobservationsandbettervariables,enablingscholarstofocusdatacollectioneffortswheretheymattermost.Asanexample,weusealldyads(pairsofcountries)foreachyearsinceWorldWarIItogenerateadatasetbelowwith303,814observations,ofwhichonly0.34%,or1042dyads,wereatwar.Datasetsofthissizearenotuncommonininternationalrelations,buttheymakedatamanagementdifficult,statisticalanalysestime-consuming,anddatacollectionexpensive.1(Eventhemorecommon5000–10000observationdatasetsareinconvenienttodealwithifonehastocollectvariablesforallthecases.)Moreover,mostdyads1BennettandStam(1998b)analyzeadatasetwith684,000dyad-yearsand(1998a)haveevendevelopedsophis-ticatedsoftwareformanagingthelarger,1.2million-dyaddatasettheydistribute.

GaryKing和Langche计文献中，logit（200）中存在偏差是众所周知的事实，logit中，担心收集到没有事件（因此没有在Y上的变化）的数据集，导致研究人员选例如，我们使用自第二次世界大战以来每年的所有双边关系（国家对）来生成以下数据集，其中包含303,814个观测值，其中只有0.34%，即1042这种规模的数据集在国际关系研究中并不罕见，但它们使得数据管理变得困难，统计分析耗时，数据收集成本高昂。1（即使更常见的5000‑10000个观测值的数据集，如果必须收集所有案例的变量，也会变得难以处理。）此外，大多数双边关系涉及Bennett和Stam（1998b）分析了一个包含684,000个双边年的数据集，而（1998a）甚至为它们分发的1,200万个双边关系的数据集开发了复杂的软件。LogisticRegressioninRareEvents countrieswithlittlerelationshipatall(sayBurkinaFasoandSt.Lucia),muchlesswithsomerealisticprobabilityofgoingtowar,andsothereisawell-foundedperceptionthatmanyofthedataare“nearlyirrelevant”(MaozandRussett1993,p.627).Indeed,manyofthedatahaveverylittleinformationcontent,whichiswhywecanavoidcollectingthevastinpoliticalsciencedesignedtocopewiththisproblem,suchasselectingdyadsthatare“politicallyrelevant”(MaozandRussett1993),arereasonableandpracticalapproachestoadifficultproblem,buttheynecessarilychangethequestionasked,alterthepopulationtowhichweareinferring,orrequireconditionalanalysis(suchasonlycontiguousdyadsoronlythoseinvolvingamajorpower).Lesscarefulusesofthesetypesofdataselectionappropriateeasy-to-applycorrections,nearly300,000observationswithzerosneednotbecollectedorcouldevenbedeletedwithonlyaminorimpactonsubstantiveconclusions.Withtheseprocedures,scholarswhowishtoaddnewvariablestoanexistingcollectioncansaveapproximately99%ofthenonfixedcostsintheirdatacollectionbudgetorcanreallocatedatacollectioneffortstogeneratealargernumberofmoreinformativeandmeaningfulvariablesthanwouldotherwisebepossible.2Relativetosomeotherfieldsinofmeasurementovermanyyearsandhavegeneratedalargequantityofdata.Selectingonthedependentvariableinthewaywesuggesthasthepotentialtobuildontheseefforts,ThisprocedureofselectiononYalsoaddressesalong-standingcontroversyintheinternationalconflictliteraturewherebyqualitativescholarsdevotetheireffortswheretheIncontrast,quantitativescholarsarecriticizedforspendingtimeanalyzingverycrudedeMesquita1981;GellerandSinger1998;Levy1989;Rosenau1976;Vasquez1993).Itmuchmorewiththeonesthanthezeros,butresearchersmustbecarefultoavoidbias.Fortunately,thecorrectionsareeasy,andsothegoalsofbothcampscanbeThemainintendedcontributionofthispaperistointegratethesetwotypesofcorrec-tions,whichhavebeenstudiedmostlyinisolation,andtoclarifythelargelyunnoticedconsequencesofrareeventsdatainthiscontext.Wealsotrytoforgeacriticallinkbetweeneventsbias,andstandarderrorinconsistency,inapopularmethodofcorrectingselectiononY.ThisisusefulwhenselectingonYleadstosmallersamples.Wealsoprovideanimprovedmethodofcomputingprobabilityestimates,proofsoftheequivalenceofsomeleadingeconometricmethods,andsoftwaretoimplementthemethodsdeveloped.Weofferappearinourcompanionpaper(KingandZeng2000b).32Thefixedcostsinvolvedingearinguptocollectdatawouldbebornewitheitherdatacollectionstrategy,andsoselectingonthedependentvariableaswesuggestsavessomethinglessinresearchdollarsthanthefractionofobservationsnotcollected.3WehavefoundnodiscussioninpoliticalscienceoftheeffectsoffinitesamplesandrareeventsonlogisticregressionorofmostofthemethodswediscussthatallowselectiononY.Thereisabriefdiscussionofoneandinanunpublishedpapertheycitethathasrecentlybecomeavailable(Achen1999).

和Russett1993，第627页）。事实上，许多数据的信息含量非常低，这就是为二元组进行推断，是有偏见的。通过适当的易于应用的校正，几乎30万个零值观Y（BuenodeMesquita1981；Geller和Singer1998；Levy1989；Rosenau1976Vasquez1993）的非常粗略的测量而受到批评。结果证明，双方都有一YY导致样本量更小的时验的形式提供证据。经验示例见我们的配套论文（King和Zeng2000b）。3本和罕见事件对逻辑回归或我们讨论的大多数允许对Y进行选择的方法的影响的讨论。BuenodeMesquita和Lalman（1992年附录）以及他们引用的一篇未发表的论文（Achen1999年）中简要讨论了一种在渐近样本中纠正对Y的选择的方法。 February16, GaryKingandLangcheLogisticRegression:ModelandInlogisticregression,asingleoutcomevariableYi(i=1,...,n)followsaBernoulliprobabilityfunctionthattakesonthevalue1withprobabilityπiand0withprobability1−πi.Thenπivariesovertheobservationsasaninverselogisticfunctionofavectorxi,whichincludesaconstantandk−1explanatoryvariables:Yi∼Bernoulli(Yi|πi)πi=1+e−xi

GaryKing和LangcheYi（例如，个人的健康状况或一个国家发动战争的可能性）10的概率为1πiπi随着观察值的xik1个解释变量：Yi∼Bernoulli(Yi|πi)πi=1+e−xi

Yi1−πi TheBernoullihasprobabilityfunctionP(Yi|πi)=πi(1−πi .Theunknown

伯努利概率函数P(Yi|πi)=πi i。未知参数β=(β0,1)r是一个k×meterβ=(β0,βr)risak×1vector,whereβ0isascalarconstanttermandβ1isavectorwithelementscorrespondingtotheexplanatoryvariables.

β0是一个标量常数项，β1Analternativewaytodefinethesamemodelisbyimagininganunobservedcontinuousfunctionofxi.ThemodelwouldbeveryclosetoalinearregressionifY∗wereobserved:

Y（例如，个人的健康状况或一个国家发动战争的可能性）µiµi随着观察值xiY∗，则该模型将非常接近线性回归： µi=xi

∗|

µi=xi whereLogistic(Yi|µi)istheone-parameterlogisticprobabilitye−(Y∗−µiP(Y∗)

e−(Y∗−µiP(Y∗)

1+e−(Y∗−µi)2 1+e−(Y∗−µi)2Unfortunately,insteadofobservingY∗,weseeonlyitsdichotomousrealization,YiwhereYi=1ifY∗>0andYi=0ifY∗≤0.Forexample,ifY∗measureshealth,Yi

Y∗YiYi1

bedead(1)oralive(0).IfY∗werethepropensitytogotowar,Yicouldbeatwar(1)oratpeace(0).ThemodelremainsthesamebecausePr(Yi=1|β)=πi=Pr(Y∗>0|

Pr(Yi=1|β)=πi=Pr(Y∗>0|∫

∫

Logistic(Yi|µi)dYi=1+e−xi whichisexactlyasinEq.(1).Wealsoknowthattheobservationmechanism,whichturnsthecontinuousY∗intothedichotomousYi,generatesmostofthemischief.Thatis,weransimulationstryingtoestimateβfromanobservedY∗andmodel2andfoundthatmaximum-likelihoodestimationofβisapproximatelyunbiasedinsmallsamples.Theparametersareestimatedbymaximumlikelihood,withthelikelihood

Logistic(Yi|µi)dYi=1+e−xi 产生了大部分的麻烦。也就是说，我们进行了模拟，试图从观察到的βY∗和模型2中估计β，发现最大似然估计在样本量较小的情况下是近似无偏的。 πYi(1−πi)1−Yi

BytakinglogsandusingEq.(1),thelog-likelihoodsimplifieslnL(β|y)=Σln(πi)+Σln(1−πi

i=1

lnL(β|y)=Σln(πi)+Σln(1−πi

= ln1+e(1−2Yi)xi (e.g.,Greene1993,p.643).Maximum-likelihoodlogitanalysisthenworksbyfindingthe

= ln1+e(1−2Yi)xi Greene1993643）。最大似然对数分析通过找到使该函数值最大的β的值来工作，我们将其标记为βˆ。渐近 February16, LogisticRegressioninRareEvents variancematrix,V(βˆ),isalsoretainedtocomputestandarderrors.Whenobservationsareselectedrandomly,orrandomlywithinstratadefinedbysomeoralloftheexplanatorycollinearityamongthecolumnsinXorperfectdiscriminationbetweenzerosandones).Thatinrareeventsdataonesaremorestatisticallyinformativethanzeroscanbeseenbystudyingthevariancematrix,

所有解释变量定义的层内随机选择时，βˆ是一致的，并且渐近有效（X列比

Thepartofthismatrixaffectedbyrareeventsisthefactorπi(1−πi).Mostrareeventslogitmodelhassomeexplanatorypower,theestimateofπiamongobservationsforwhichrareeventsareobserved(i.e.,forwhichYi=1)willusuallybelarger[andcloserto0.5,oneswillcausethevariancetodropmoreandhencearemoreinformativethanadditionalzeros(seeImbens1992,pp.1207,1209;Cosslett1981a;LancasterandImbens1996b).Finally,wenotethatthequantityofinterestinlogisticregressionisrarelytherawoutputbymostcomputerprograms.Instead,scholarsarenormallyinterestedinmoredirectfunctionsoftheprobabilities.Forexample,absoluteriskistheprobabilitythataneventoccursgivenchosenvaluesoftheexplanatoryvariables,Pr(Y=1|X=x).TherelativeriskisthesameprobabilityrelativetotheprobabilityofaneventgivensomebaselinevaluesofX,e.g.,Pr(Y=1|X=1)/Pr(Y=1|X=0),thefractionalincreaseintheThisquantityisfrequentlyreportedinthepopularmedia(e.g.,theprobabilityofsomeformsofcancerincreaseby50%ifonestopsexercising)andiscommoninmanyscholarlyliteratures.Inpoliticalscience,thetermisnotoftenused,butthemeasureisusuallycomputeddirectlyorstudiedimplicitly.Alsoofconsiderableinterestisthefirstdifference(orattributablerisk),thechangeinprobabilityasafunctionofachangeininformativewhenmeasuringeffects,whereasrelativeriskisdimensionlessandsotobeeasiertocompareacrossapplicationsortimeperiods.AlthoughscholarsoftenarguethetwoprobabilitiesthatmakeupeachrelativeriskandeachfirstdifferenceisbestwhenHowtoSelectontheDependentWefirstdistinguishamongalternativedatacollectionstrategiesandshowhowtoadaptthelogitmodelforeach.Then,inSection5,webuildonthesemodelstoalsoallowrareeventandfinitesamplecorrections.Thissectiondiscussesresearchdesignissues,andSection4considersthespecificstatisticalcorrectionsnecessary.DataCollectiontions(X,Y)areselectedatrandom,orexogenousstratifiedsampling,whichallowsYtoberandomlyselectedwithincategoriesdefinedbyX.Optimalstatisticalmodelsareidenticalunderthesetwosamplingschemes.Indeed,inepidemiology,bothareknownunderonename,cohort(orcross-sectional,todistinguishitfromapanel)study.

πi(1πiPr(Yi1|πilogit的稀疏事件的估计πi（即对于Yi=1的观察值）通常较大[，并且更接近0.5，因为在稀疏事件研究中，概率通常非常小（Beck2000），而在没有观察到Yi0πi(1πi0（其倒数）较小。在这种情况下，额外的1将使方差下降更多，因此比额外的0更有信息量（参见Imbens1992年，第1207页，第1209页；Cosslett1981a；Lancaster和Imbens1996b）在给定解释变量的选择值的情况下事件发生的概率，Pr(Y=1|Xx)。相对风险是相对于给定某些基线值X的事件发生概率的同一概率，例如，Pr(Y1|X1)Pr(Y1|X0)，风险的分数增加。这个量经常在大众媒体中报道（例如，如果停止锻炼，某些形式的癌症的患病概率会增加50%）并且在许多学术文献中很常例如Pr(Y1|X1−Pr(Y1|X0)。第一差分在测量效应时通常最有信息量，经常争论它们的相对优点（参见Breslow和Day1980年，第2章；以及Manski1999），但在方便的时候报告构成每个相对风险和每个第一差分的两在计量经济学中通常使用的策略，要么是随机抽样，其中所有观测值（X，Y）都是随机选择的，要么是fi定抽样，这允许在由X定义的类别内随机选择 GaryKingandLangcheWhenoneofthevaluesofYisrareinthepopulation,considerableresourcesindatacollectioncanbesavedbyrandomlyselectingwithincategoriesofY.Thisisknownineconometricsaschoice-basedorendogenousstratifiedsamplingandinepidemiologyasacase-controldesign(Breslow1996);itisalsousefulforchoosingqualitativecasestudies(Kingetal.1994,Sect.4.4.2).ThestrategyistoselectonYbycollecting(randomlyorallthoseavailable)forwhichY=1(the“cases”)andarandomselectionofvariablescollectedonalargecohort,andthensubsampleusingalltheonesandarandomvariabletoanexistingcollection,suchasthedyadicdatadiscussedaboveandanalyzedfromalargerrandomsample,withveryfewvariables,oftheentireU.S.population.Inthispaper,weuseinformationonthepopulationfractionofoneswhenitisavailable,andsothesamemodelswedescribeapplytobothcase-controlandcase-cohortstudies.MesquitaandLalman’s(1992)designisfairlyclosetoacase-controlstudywith“contam-inatedcontrols,”meaningthatthe“control”samplewasfromthewholepopulationratherthanonlythoseobservationsforwhichY=0(seeLancasterandImbens1996a).Althoughwedonotanalyzehybriddesignsinthispaper,ourviewisnotthatpurecase-controlsam-plingisappropriateforallpoliticalsciencestudiesofrareevents.(Forexample,additionalefficienciesmightbegainedbymodifyingadatacollectionstrategytofitvariablesthatareeasiertocollectwithinregionalorlanguageclusters.)Rather,ourargumentisthatscholarsshouldconsideramuchwiderrangeofpotentialsamplingstrategies,andassociatedsta-tisticalmethods,thanisnowcommon.Thispaperfocusesonlyontheleadingalternativedesignwhichwebelievehasthepotentialtoseewidespreaduseinpoliticalscience.Problemstocarefullyavoided.First,thesamplingdesignforwhichthepriorcorrectionandweightingmethodsareappropriaterequiresindependentrandom(orcomplete)selectionofobser-vationsforwhichY=1andY=0.Thisencompassesthecase-controlandcase-cohortselection,orviahybridapproaches—requiredifferentstatisticalSecond,whenselectingonY,wemustbecarefulnottoselectonXdifferentlyforthetwosamples.Theclassicexampleisselectingallpeopleinthelocalhospitalwithcancer(Y=1)andarandomselectionoftheU.S.populationwithoutlivercancer(Y=0).TheproblemisthatthesampleofcancerpatientsselectsonY=1andimplicitlyontheinclinationtoseekhealthcare,findtherightmedicalspecialist,havetherighttests,etc.NotrecognizingtheimplicitselectiononXistheproblemhere.SincetheY=0sampledoesnotsimilarlyselectonthesameexplanatoryvariables,thesedatawouldinduceselectionbias.OnesolutioninthisexamplemightbetoselecttheY=0samplefromthosewhosymptoms.Anothersolutionwouldbetomeasureandcontrolfortheomittedvariables.ThistypeofinadvertentselectiononXcanbeaseriousprobleminendogenousdesigns,justasselectiononYcanbiasinferencesinexogenousdesigns.Moreover,although

GaryKing和Langche当Y在总体中的某个值很罕见时，通过在Y的类别内随机选择，可以在数据收集上节省大量资源。这在计量经济学中被称为基于选择或内生分层抽样，在流行病学中则称为病例‑对照设计（Breslow1996）；它也适用于选择定性案例研究（King等人，19944.4.2节）。该策略是通过收集观察值（随机或所有可用的观察值）Y（“”）Y（“对照”）。这种抽样方法通常辅以对总体中一个的已知或估计的先验知识——这种的解释变量不可用时也是如此）。最后，-队列研究开始于对大型队列的一Verba（1995）对活动家进行的详细研究，每个活动家都是从更大的随我们使用一个的总体分数信息，因此我们描述的相同模型也适用于病例‑对照和病例‑队列研究。还尝试了许多其他混合数据收集策略。例如，BuenodeMesquita和Lalman（1992）的设计与病例‑对照研究中的“污染对照”相当，这意味着“对照”样本来自整个总体，而不仅仅是那些Y=0的观察值（参见Lancaster和Imbens1996a）。尽管我们在这篇论文中没有分析混合设计，但按照我们建议的方式选择因变量有几个陷阱应该小心避免。首先，适用于先验校正和加权方法的抽样设计需要独立随机（或完全）选择观察值，这些观察值包括1Y0阶段抽样、非随机选择或混合方法——需要不同的统计方法。其次，在选择Y时，我们必须小心不要对两个样本选择不同的X。一个经典的例子是选择当地医院中所有患有肝癌的人（Y=1）的整个人口（Y=0）。问题是癌症患者的样本在选择Y=1的同时，也隐含地选择了寻求医疗保健、找到合适的医疗专家、进行正确的检查等倾向。没有认识到对X的隐含选择是这里的问题。由于Y=0样本不会以类似的方式选择相同的解释变量，这些数据会导致选择偏差。在这个例子中，一个可能的解决方案是从那些接受了相同的肝癌检查但最终没有患病的人中选择Y=0样本。这种设计会产生有效的推论，但仅适用于有肝癌样症状的健康意识人群。另一个解决方案是测量并控制遗漏的变量。XY的选择在LogisticRegressioninRareEvents thesocialsciencesrandom(orexperimentercontrolover)assignmentofthevaluesoftheexplanatoryvariablesforeachunitisoccasionallypossibleinexogenousorrandomsampling(andwithalargenisgenerallydesirablesinceitrulesoutomittedvariablebias),randomassignmentonXisimpossibleinendogenoussampling.Fortunately,biasduetoselectiononXismucheasiertoavoidinapplicationssuchasinternationalconflictandrelatedfields,sinceaclearlydesignatedcensusofcasesisnormallyavailablefromwhichtodrawasample.Insteadofrelyingonthedecisionsofsubjectsaboutwhethertocometoahospitalandtakeatest,theselectionintothedatasetinourfieldcanoftenbeentirelydeterminedbytheinvestigator.SeeHollandandRubin(1988).Third,anotherproblemwithintentionalselectiononYisthatvalidexploratorydataanalysiscanbemorehazardous.Inparticular,onecannotuseanexplanatoryvariableasadependentvariableinanauxiliaryanalysiswithoutspecialprecautions(seeNagelkerkeetal.1995).Finally,theoptimaltrade-offbetweencollectingmoreobservationsversusbetterorjudgmentcallsandqualitativeassessments.Fortunately,tohelpguidethesedecisionsinfieldslikeinternationalrelationswehavelargebodiesofworkonmethodsofquantitativemeasurementand,also,manyqualitativestudiesthatmeasurehard-to-collectvariablesforasmallnumberofcases(suchasleaders’perceptions).ontheoptimaltrade-offbetweenmoreobservationsandbettervariables.First,whenzerosandonesareequallyeasytocollect,andanunlimitednumberofeachareavailable,an“equalsharessamplingdesign”(i.e.,y¯=0.5)isoptimalinalimitednumberofsituationsandclosetooptimalinalargenumber(Cosslett1981b;Imbens1992).Thisisausefulbutinfieldslikeinternationalrelations,thenumberofobservableones(suchaswars)isstrictlylimited,andsoinmostofourapplicationscollectingallavailableoralargesampleofonesisbest.Theonlyrealdecision,then,ishowmanyzerostocollectinaddition.Ifcollectingzeroswerecostless,weshouldcollectasmanyaswecanget,sincemoredataarealwaysbetter.Ifcollectingzerosisnotcostless,butnot(much)moreexpensivethancollectingones,thenoneshouldcollectmorezerosthanones.However,sincethemarginaltodropasthenumberofzerospassesthenumberofones,wewillnotoftenwanttocollectmorethan(roughly)twotofivetimesmorezerosthanones.Ingeneral,theoptimalnumberofzerosdependsonhowmuchmorevaluabletheexplanatoryvariablesbecomewiththeresourcessavedbycollectingfewerobservations.Finally,ausefulpracticeissequential,involvingfirstthecollectionofallonesand(say)anequalnumberofzeros.Then,ifthestandarderrorsandconfidenceintervalsarenarrowenough,stop.Otherwise,continuetoexplanatoryvariablessequentiallyaswell,butthisisnotoftenthecase.CorrectingEstimatesforSelectiononDesignsthatselectonYcanbeconsistentandefficientbutonlywiththeappropriatestatisticalcorrections.Sections4.1and4.2introducethepriorcorrectionandweightingforthelogitmodel.InAppendixA,weexplicatethisresultandthenprovethatthebesteconometricestimatorinthistraditionalsoreducestothemethodofpriorcorrection

（n）中偶尔是可能的。在内生抽样中，在X上随机分配是不可能的。幸运的是，在诸如国际冲突和相关领域等应用中，由于通常可以从一个明确指定的案例普查中抽取样本，因此对X而不是依赖于受试者是否来医院接受测试的决定。参见Holland和Rubin殊预防措施（参见Nagelkerke等人，1995年）。在有限的情况下，“等份抽样设计（即y的等份分配）是最佳的，在大多数情况下接近最佳（Cosslett1981b；Imbens1992年）。这是一个有用的事实，但4.14.220等人（1985年）已经证明，这些计量经济学方法中的两种等同于logit模型的先验修正。在附录A中，我们解释了这一结果，并证明在这一传统中最佳的计量经 February16, GaryKingandLangchethemodelislogitandthesamplingprobability,E(y¯),isunknown.Toourknowledge,thisresulthasnotappearedpreviouslyintheliterature.PriorPriorcorrectioninvolvescomputingtheusuallogisticregressionMLEandcorrectingtheestimatesbasedonpriorinformationaboutthefractionofonesinthepopulation,τ,andtheobservedfractionofonesinthesample(orsamplingprobability),y¯.Knowledgeofτcancomefromcensusdata,arandomsamplefromthepopulationmeasuringYonly,acase-cohortsample,orothersources.InAppendixB,wetrytoelucidatethismethodbypresentingaderivationofthemethodofpriorcorrectionforlogitandmostotherstatisticalinanyoftheabovesamplingdesigns,theMLEβˆ1isastatisticallyconsistentestimateβ1andthefollowingcorrectedestimateisconsistentfor

GaryKing和LangchelogitE¯yMLE1τ和样本中观察到的1的比例（或采样概率）¯y的先验信息来校正估计。τ的知识可以来自人口普查数据、从人口中随机抽取的仅测量Y的样本、病例队列样本或其他来源。Blogit来阐明这种方法。对于logit模型，在任何上述抽样设计中，MLEˆβ1β1的一个统计一致估计，以下校正估计是一致的β0：1−βˆ−ln1−τy1−whichequalsβˆ0onlyinrandomlyselectedcross-sectionaldata.Ofcourse,scholarsarenotnormallyinterestedinβbutratherintheprobabilitythataneventoccurs,Pr(Yi=1|β)=πi=(1+exiβ)−1,whichrequiresgoodestimatesofbothβ1andβ0.EpidemiologistsandbiostatisticiansusuallyattributepriorcorrectiontoPrenticeandPyke(1979);ciansattributetheresulttoManskiandLerman(1977),whointurncreditanunpublishedcommentbyDanielMcFadden.Theresultwaswell-knownpreviouslyinthespecialcaseofalldiscretecovariates(e.g.,Bishopetal.1975,p.63)andhasbeenshowntoapplytoothermultiplicativeinterceptmodels(Hsiehetal.1985,p.659).Priorcorrectionrequiresknowledgeofthefractionofonesinthepopulation,τ.For-tunately,τisstraightforwardtodetermineininternationalconflictdatasincethenumberofconflictsisthesubjectofthestudyandthedenominator,thepopulationofcountriesordyads,iseasytocountevenifnotentirelyintheanalysis.4Akeyadvantageofpriorcorrectioniseaseofuse.Anystatisticalsoftwarethatcanestimatelogitcoefficientscanbeused,andEq.(7)iseasytoapplytotheintercept.Ifthefunctionalformandexplanatoryvariablesarecorrect,estimatesareconsistentandasymptoticallyefficient.Thechiefdisadvantageofpriorcorrectionisthatifthemodelismisspecified,estimatesofbothβ0andβ1areslightlylessrobustthanweighting(XieandManski1989),amethodtowhichwenowturn.Analternativeprocedureistoweightthedatatocompensatefordifferencesinthesample(y¯)andpopulation(τ)fractionsofonesinducedbychoice-basedsampling.Theresultingweightedexogenoussamplingmaximum-likelihoodestimator(duetoManskiandLerman4KingandZeng(2000a),buildingonresultsofManski(1999),modifythemethodsinthispaperforthesituationwhenτisunknownorpartiallyknown.KingandZenguse“robustbayesiananalysis”tospecifyclassesofpriordistributionsonτ,representingfullorpartialignorance.Forexample,theusercanspecifythatτiscompletelyunknownorknowntofallwithsomeprobabilitytolieonlyinagiveninterval.Theresultisclassesofposteriordistributions(insteadofasingleposterior)that,inmanycases,provideinformativeestimatesofquantitiesof

βˆ−ln1−τy¯ 1−这仅在随机选择的横断面数据中等于ˆβ0。当然，学者们通常对β不感兴趣，而是对事件发生的概率感兴趣，Pr(Yi1|β)=πi(1+exiβ)−1，这需要β1和β0的良好估计。流行病学家和生物统计学家通常将先验校正归功于Prentice和Pyke(1979)；计量经济学家将结果归功于Manski和Lerman(1977)，他们反过来又归功于DanielMcFadden的一篇未发表的评论。在所有离散协变量（例如，Bishop1−β1的估计略低于加权（XieManski1989），Lerman1977）相对简单。我们不是最大化公式（5）King和Zeng（2000a）在Manski（1999）的结果基础上，修改了本文中当τ未知或部分已知时的方法。King和Zeng使用“稳健贝叶斯分析”来指定τ上的先验分布类别，代表完全或部分未知。例如，用户可以指定τ完全未知或以某种概率仅位于给定的区间内。结果是后验分布类别（而不是单 February16, LogisticRegressioninRareEvents theweightedlog-lnLw(β|y)=w1Σln(πi

人人文库> 全部分类> 专业文献 > 工程机械

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

2024哈佛大学稀疏事件数据的逻辑回归

文档简介

温馨提示

最新文档

评论

相关文档