版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Kernel Stein Tests for Multiple Model Comparison Jen Ning Lim Max Planck Institute for Intelligent Systems jlimtuebingen.mpg.de Makoto Yamada Kyoto University, RIKEN AIP makoto.yamadariken.jp Bernhard Schlkopf Max Planck Institute for Intelligent Systems bstuebingen.mpg.de Wittawat Jitkrittum Max Pl
2、anck Institute for Intelligent Systems wittawattuebingen.mpg.de Abstract We address the problem of non-parametric multiple model comparison: givenl candidate models, decide whether each candidate is as good as the best one(s) or worse than it. We propose two statistical tests, each controlling a dif
3、ferent notion of decision errors. The fi rst test, building on the post selection inference framework, provably controls the number of best models that are wrongly declared worse (false positive rate). The second test is based on multiple correction, and controls the proportion of the models declare
4、d worse but are in fact as good as the best (false discovery rate). We prove that under appropriate conditions the fi rst test can yield a higher true positive rate than the second. Experimental results on toy and real (CelebA, Chicago Crime data) problems show that the two tests have high true posi
5、tive rates with well-controlled error rates. By contrast, the naive approach of choosing the model with the lowest score without correction leads to more false positives. 1Introduction Given a sample (a set of i.i.d. observations), and a set oflcandidate modelsM, we address the problem of non-parame
6、tric comparison of the relative fi t of these candidate models. The comparison is non-parametric in the sense that the class of allowed candidate models is broad (mild assumptions on the models). All the given candidate models may be wrong; that is, the true data generating distribution may not be p
7、resent in the candidate list. A widely used approach is to pre-select a divergence measure which computes a distance between a model and the sample (e.g., Frchet Inception Distance (FID, 16), Kernel Inception Distance 3 or others), and choose the model which gives the lowest estimate of the divergen
8、ce. An issue with this approach is that multiple equally good models may give roughly the same estimate of the divergence, giving a wrong conclusion of the best model due to noise from the sample (see Table 1 in 17 for an example of a misleading conclusion resulted from direct comparison of two FID
9、estimates). It was this issue that motivates the development of a non-parametric hypothesis test of relative fi t (RelMMD) between two candidate models 4. The test uses as its test statistic the difference of two estimates of Maximum Mean Discrepancy (MMD, 14), each measuring the distance between th
10、e generated sample from each model and the observed sample. It is known that if the kernel function used is characteristic 27,11 , the population MMD defi nes a metric on a large class of distributions. As a result, the magnitude of the relative test statistic provides a measure of relative fi t, al
11、lowing one to decide a (signifi cantly) better model when the statistic is suffi ciently large. The key to avoiding the previously mentioned issue of false detection is to appropriately choose the threshold based on the null distribution, i.e., the distribution of the statistic when the two models a
12、re equally good. An extension of RelMMD to a linear-time relative test was considered by Jitkrittum et al. 17. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. A limitation of the relative tests of RelMMD and others 4,17 is that they are limited to the comp
13、arison of onlyl = 2candidate models. Indeed, taking the difference is inherently a function of two quantities, and it is unclear how the previous relative tests can be applied when there arel 2 candidate models. We note that relative fi t testing is different from goodness-of-fi t testing, which aim
14、s to decide whether a given model is the true distribution of a set of observations. The latter task may be achieved with the Kernel Stein Discrepancy (KSD) test 6,23,13 where, in the continuous case, the model is specifi ed as a probability density function and needs only be known up to the normali
15、zer. A discrete analogue of the KSD test is studied in 32. When the model is represented by its sample, goodness-of-fi t testing reduces to two-sample testing, and may be carried out with the MMD test 14, its incomplete U-statistic variants 33,31, the ME and SCF tests 7,18, and related kernel-based
16、tests 8,10, among others. To reiterate, we stress that in general multiple model comparison differs from multiple goodness-of-fi t tests. While the latter may be addressed withl individual goodness-of-fi t tests (one for each candidate), the former requires comparinglcorrelated estimates of the dist
17、ances between each model and the observed sample. The use of the observed sample in the l estimates is what creates the correlation which must be accounted for. In the present work, we generalize the relative comparison tests of RelMMD and others 4,17 to the case ofl 2models. The key idea is to sele
18、ct the “best” model (reference model) that is the closest match to the observed sample, and considerl hypotheses. Each hypothesis tests the relative fi t of each candidate model with the reference model, where the reference is chosen to be the model giving the lowest estimate of the pre-chosen diver
19、gence measure (MMD or KSD). The total output thus consists ofl binary values where 1 (assign positive) indicates that the corresponding model is signifi cantly worse (higher divergence to the sample) than the reference, and 0 indicates no evidence for such claim (indecisive). We assume that the outp
20、ut is always 0 when the reference model is compared to itself. The need for a reference model greatly complicates the formulation of the null hypothesis (i.e., the null hypothesis is random due to the noisy selection of the reference), an issue that is not present in the multiple goodness-of-fi t te
21、sting. We propose two non-parametric multiple model comparison tests (Section 3.3) following the previ- ously described scheme. Each test controls a different notion of decision errors. The fi rst test RelPSI builds on the post selection inference framework and provably (Lemma 4.2) controls the numb
22、er of best models that are wrongly declared worse (FPR, false positive rate). The second test RelMulti is based on multiple correction, and controls the proportion of the models declared worse but are in fact as good as the best (FDR, false discovery rate). In both tests, the underlying divergence m
23、easure can be chosen to be either the Maximum Mean Discrepancy (MMD) allowing each model to be represented by its sample, or the Kernel Stein Discrepancy (KSD) allowing the comparison of any models taking the form of unnormalized, differentiable density functions. As theoretical contribution, the as
24、ymptotic null distribution of RelMulti-KSD (RelMulti when the divergence measure is KSD) is provided (Theorem C.1), giving rise to a relative KSD test in the case ofl = 2 models, as a special case. To our knowledge, this is the fi rst time that a KSD-based relative test for two models is studied. Fu
25、rther, we show (in Theorem 4.1) that the RelPSI test can yield a higher true positive rate (TPR) than the RelMulti test, under appropriate conditions. Experiments (Section 5) on toy and real (CelebA, Chicago Crime data) problems show that the two proposed tests have high true positive rates with wel
26、l-controlled respective error rates FPR for RelPSI and FDR for RelMulti. By contrast, the naive approach of choosing the model with the lowest divergence without correction leads to more false positives. 2Background Hypothesis testing of relative fi t between l = 2 candidate models, P1and P2, to the
27、 data generating distributionR(unknown) can be performed by comparing the relative magnitudes of a pre-chosen discrepancy measure which computes the distance from each of the two models to the observed sample drawn fromR. Our proposed methods RelPSI and RelMulti (described in Section 3.3) generalize
28、 this formulation based upon selective testing 20, and multiple correction 1, respectively. Underlying these new tests is a base discrepancy measureDfor measuring the distance between each candidate model to the observed sample. In this section, we review Maximum Mean Discrepancy (MMD, 14) and Kerne
29、l Stein Discrepancy (KSD, 6,23), which will be used as a base discrepancy measure in our proposed tests in Section 3.3. 2 Reproducing kernel Hilbert space Given a positive defi nite kernelk : X X R, it is known that there exists a feature map: X Hand a reproducing kernel Hilbert Space (RKHS)Hk= H as
30、sociated with the kernelk2. The kernelkis symmetric and is a reproducing kernel onHin the sense that k(x,y) = h(x),(y)iHfor all x,y X where h,iH= h,i denotes the inner product. It follows from this reproducing property that for anyf H,hf,(x)i = f(x)for allx X. We interchangeably write k(x,) and (x).
31、 Maximum Mean DiscrepancyGiven a distributionP and a positive defi nite kernelk, the mean em- bedding ofP, denoted byP , is defi ned asP= ExPk(x,)26 (exists ifExPpk(x,x) sp(x0)k(x,x0) + sp(x)x0k(x,x0) + xk(x,x0)sp(x0) + trx,x0k(x,x0). The KSD has an unbiased estimatorKSD V2 u(P,R) = 1 n(n1) P i6=jup
32、(xi,xj)wherexi n i=1 i.i.d. R, which is also a second-order U-statistic. Like the MMD, a linear-time estimator ofKSD2is given by KSD 2 l = 2 bnc Pbnc/2 i=1 up(x2i,x2i1). It is known that n KSD 2 l is asymptotically normally dis- tributed 23. In contrast to the MMD estimator, the KSD estimator requir
33、es only samples from R, andPis represented by its score functionxlogp(x)which is independent of the normalizing constant. As shown in the previous work, an explicit probability density function is far more repre- sentative of the distribution than its sample counterpart 19,17. KSD is suitable when t
34、he candidate models are given explicitly (i.e., known density functions), whereas MMD is more suitable when the candidate models are implicit and better represented by their samples. 3Proposal: non-parametric multiple model comparison In this section, we propose two new tests: RelMulti (Section 3.2)
35、 and RelPSI (Section 3.3), each controlling a different notion of decision errors. Problem(Multiple Model Comparison).Suppose we havelmodels denoted asM = Pil i=1, which we can either: draw a sample (a collection ofni.i.d. realizations) from or have access to their unnormalized log densitylogp(x). T
36、he goal is to decide whether each candidatePiis worse than the best one(s) in the candidate list (assign positive), or indecisive (assign zero). The best model is defi ned to bePJsuch thatJ argminj1,.,lD(Pj,R)whereDis a base discrepancy measure (see Section 2), and R is the data generating distribut
37、ion (unknown). Throughout this work, we assume that all candidate modelsP1,.,Pland the unknown data generating distributionRhave a common supportX Rd, and are all distinct. The task can be 3 seen as a multiple binary decision making task, where a modelP Mis considered negative if it is as good as th
38、e best one, i.e.,D(P,R) = D(PJ,R)whereJ argminjD(Pj,R). The index set of all models which are as good as the best one is denoted byI:= i | D(Pi,R) = minj=1,.,lD(Pj,R). When|I| 1,Jis an arbitrary index inI. Likewise, a model is considered positive if it is worse than the best model. Formally, the ind
39、ex set of all positive models is denoted byI+:= i | D(Pi,R) D(PJ,R). It follows thatI I+= and I I+= I := 1,.,l. The problem can be equivalently stated as the task of deciding whether the index for each model belongs toI+(assign positive). The total output thus consists ofl binary values where 1 (ass
40、ign positive) indicates that the corresponding model is signifi cantly worse (higher divergence to the sample) than the best, and 0 indicates no evidence for such claim (indecisive). In practice, there are two diffi culties: fi rstly,Rcan only be observed through a sample Xn:= xin i=1 i.i.d. Rso tha
41、tD(Pi,R)has to be estimated by D(Pi,R)computed on the sample; secondly, the indexJof the reference model (the best model) is unknown. In our work, we consider the complete, and linear-time U-statistic estimators of MMD or KSD as the discrepancy D(see Section 2). We note that the main assumption on t
42、he discrepancy Dis that n( D(Pi,R) D(Pj,R) d N(,2)for anyPi,Pj Mandi 6= j. If this holds, our proposal can be easily amended to accommodate a new discrepancy measureDbeyond MMD or KSD. Examples include (but not limited to) the Unnormalized Mean Embedding 7,17, Finite Set Stein Discrepancy 19,17, or
43、other estimators such as the block 33 and incomplete estimator 31. 3.1Selecting a reference candidate model In both proposed tests, the algorithms start by fi rst choosing a modelP J Mas the reference model where J argminjI D(Pj,R)is a random variable. The algorithms then proceed to test the relativ
44、e fi t of each modelPifori 6= Jand determine if it is statistically worse than the selected reference P J. The null and the alternative hypotheses for the i th candidate model can be written as H J 0,i: D(Pi,R) D(PJ,R) 0 | PJ is selected as the reference, H J 1,i: D(Pi,R) D(PJ,R) 0 | PJ is selected
45、as the reference. These hypotheses are conditional on the selection event (i.e., selecting Jas the reference index). For each of thelnull hypotheses, the test uses as its statisticsz := n D(Pi,R) D(P J,R)where = 0, , 1 |z J , ,1 |z i ,andz = n D(P1,R), , D(Pl,R). The distribution of the test statist
46、iczdepends on the choice of estimator for the discrepancy measure Dwhich can beMMD V2 uorKSD V2 u . Defi ne := D(P1,R),.,D(Pl,R), then the hypotheses above can be equivalently expressed asH J 0,i : 0 | Az 0vs.H J 1,i : 0 | Az 0, where we note thatdepends oni,A 1,0,1(l1)l,As,:= 0,.,1 |z J , , 1 |z s
47、, ,0for all s 1,.,lJandAs,:denote thesthrow ofA. This equivalence was exploited in the multiple goodness-of-fi t testing by Yamada et al. 31. The conditionAz 0represents the fact thatP J is selected as the reference model, and expresses D(P J,R) D(Ps,R) for all s 1,.,lJ. 3.2RelMulti: for controlling
48、 false discovery rate (FDR) Unlike traditional hypothesis testing, the null hypotheses here are conditional on the selection event, making the null distribution non-trivial to derive 21,22 . Specifi cally, the sample used to form the selection event (i.e., establishing the reference model) is the sa
49、me sample used for testing the hypothesis, creating a dependency. Our fi rst approach of RelMulti is to divide the sample into two independent sets, where the fi rst is used to chooseP J and the latter for performing the test(s). This approach simplifi es the null distribution since the sample used
50、to form the selection event and the test sample are now independent. That is,H J 0,i: 0 | Az 0 simplifi es toH J 0,i: 0 due to independence. In this case, the distribution of the test statistic (forMMD 2 uand KSD 2 u) after 4 selection is the same as its unconditional null distribution. Under our as
51、sumption that all distributions are distinct, the test statistic is asymptotically normally distributed 14, 23, 6. For the complete U-statistic estimator of Maximum Mean Discrepancy (MMD 2 u), Bounliphone et al. 4 showed that, under mild assumptions,zis jointly asymptotically normal, where the covar
52、iance matrix is known in closed form. However, forKSD V2 u, only the marginal variance is known 6,23 and not its cross covariances, which are required for characterizing the null distributions of our test (see Algorithm 2 in the appendix for the full algorithm of RelMulti). We present the asymptotic
53、 multivariate characterization ofKSD 2 uin Theorem C.1. Given a desired signifi cance level (0,1), the rejection threshold is chosen to be the(1 )- quantile of the distributionN(0, 2)where 2is the plug-in estimator of the asymptotic variance2 of our test statistic (see 4, Section 3 for MMD and Secti
54、on D for KSD). With this choice, the false rejection rate for each of thel 1hypotheses is upper bounded by(asymptotically). However, to control the false discovery rate for thel 1tests it is necessary to further correct with multiple testing adjustments. We use the BenjaminiYekutieli procedure 1 to
55、adjust. We note that when testingH J 0,J, the result is always 0 (fail to reject) by default. When l 2, following the result of 1 the asymptotic false discovery rate (FDR) of RelMulti is provably no larger than. The FDR in our case is the fraction of the models declared worse that are in fact as goo
56、d as the (true) reference model. For l = 2, no correction is required as only one test is performed. 3.3RelPSI: for controlling false positive rate (FPR) A caveat of the data splitting used in RelMulti is the loss of true positive rate since a portion of sample for testing is used for forming the se
57、lection. When the selection is wrong, i.e., J I+, the test will yield a lower true positive rate. It is possible to alleviate this issue by using the full sample for selection and testing, which is the approach taken by our second proposed test RelPSI. This approach requires us to know the null dist
58、ribution of the conditional null hypotheses (see Section 3.1), which can be derived based on Theorem 3.1. Theorem 3.1(Polyhedral Lemma 20).Suppose thatz N(,) and the selection event is affi ne, i.e., Az b for some matrix A and b, then for any , we have z | Az b T N(, , V(z), V+(z), whereT N(,2,a,b)is a truncated normal distribution with meanand va
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 八年级英语上册第五单元“影视娱乐”Section A综合复习与能力拓展教学设计
- XX建材建筑施工企业安全月活动总结
- 2025江苏二级建造师考试真题及答案《建设工程施工管理》
- 住宅楼构造柱施工方案
- 施工现场防护急救措施措施
- 工艺管道及阀门安装工程施工方案
- 食品药品安全知识竞赛试题及答案
- 冠心病护理查房(含并发症护理)
- 某工程管理文物保护制度
- 2026年苏教版高二第二学期化学期末核心考点精讲试卷(附答案可下载)
- 不得诋毁对方的协议书
- 2024年河南省鹿邑县人民医院公开招聘护理工作人员试题带答案详解
- 行星架铸造工艺设计【版本2】
- 第13课-他们都说我包的饺子好吃(口语)
- 无碳小车测试题及答案大全
- 2024年消防考试真题解析试题及答案
- 2025陕西烟草专卖局招聘42人易考易错模拟试题(共500题)试卷后附参考答案
- 车祸伤的救治与护理
- 离婚协议书模板标准电子版分享
- 2023年江苏省无锡市中考政治真题含解析
- 新理性主义完整版本
评论
0/150
提交评论