版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Calibration tests in multi-class classifi cation: A unifying framework David Widmann Department of Information Technology Uppsala University, Sweden david.widmannit.uu.se Fredrik Lindsten Division of Statistics and Machine Learning Linkping University, Sweden fredrik.lindstenliu.se Dave Zachariah De
2、partment of Information Technology Uppsala University, Sweden dave.zachariahit.uu.se Abstract In safety-critical applications a probabilistic model is usually required to be cali- brated, i.e., to capture the uncertainty of its predictions accurately. In multi-class classifi cation, calibration of t
3、he most confi dent predictions only is often not suffi - cient. We propose and study calibration measures for multi-class classifi cation that generalize existing measures such as the expected calibration error, the maximum calibration error, and the maximum mean calibration error. We propose and ev
4、alu- ate empirically different consistent and unbiased estimators for a specifi c class of measures based on matrix-valued kernels. Importantly, these estimators can be in- terpreted as test statistics associated with well-defi ned bounds and approximations of the p-value under the null hypothesis t
5、hat the model is calibrated, signifi cantly improving the interpretability of calibration measures, which otherwise lack any meaningful unit or scale. 1Introduction Considerthe problemofanalyzing microscopicimagesof tissue samplesandreporting atumourgrade, i.e., a score that indicates whether cancer
6、 cells are well-differentiated or not, affecting both prognosis and treatment of patients. Since for some pathological images not even experienced pathologists might all agree on one classifi cation, this task contains an inherent component of uncertainty. This type of uncertainty that can not be re
7、moved by increasing the size of the training data set is typically called aleatoric uncertainty (Kiureghian and Ditlevsen, 2009). Unfortunately, even if the ideal model is among the class of models we consider, with a fi nite training data set we will never obtain the ideal model but we can only hop
8、e to learn a model that is, in some sense, close to it. Worse still, our model might not even be close to the ideal model if the model class is too restrictive or the number of training data is smallwhich is not unlikely given the fact that annotating pathological images is expensive. Thus ideally o
9、ur model should be able to express not only aleatoric uncertainty but also the uncertainty about the model itself. In contrast to aleatoric uncertainty this so-called epistemic uncertainty can be reduced by additional training data. Dealing with these different types of uncertainty is one of the maj
10、or problems in machine learning. The application of our model in clinical practice demands “meaningful” uncertainties to avoid doing harm to patients. Being too certain about high tumour grades might cause harm due to unneeded aggressive therapies and overly pessimistic prognoses, whereas being too
11、certain about low tumour grades might result in insuffi cient therapies. “Proper” uncertainty estimates are also crucial if the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. model is supervised by a pathologist that takes over if the uncertainty reported
12、 by the model is too high. False but highly certain gradings might incorrectly keep the pathologist out of the loop, and on the other hand too uncertain gradings might demand unneeded and costly human intervention. Probability theory provides a solid framework for dealing with uncertainties. Instead
13、 of assigning exactly one grade to each pathological image, so-called probabilistic models report subjective probabilities, sometimes also called confi dence scores, of the tumour grades for each image. The model can be evaluated by comparing these subjective probabilities to the ground truth. One d
14、esired property of such a probabilistic model is sharpness (or high accuracy), i.e., if possible, the model should assign the highest probability to the true tumour grade (which maybe can not be inferred from the image at hand but only by other means such as an additional immunohistochemical stainin
15、g). However, to be able to trust the predictions the probabilities should be calibrated (or reliable) as well (DeGroot and Fienberg, 1983; Murphy and Winkler, 1977). This property requires the subjective probabilities to match the relative empirical frequencies: intuitively, if we could observe a lo
16、ng run of predictions(0.5,0.1,0.1,0.3)for tumour grades1,2,3, and4, the empirical frequencies of the true tumour grades should be(0.5,0.1,0.1,0.3). Note that accuracy and calibration are two complementary properties: a model with over-confi dent predictions can be highly accurate but miscalibrated,
17、whereas a model that always reports the overall proportion of patients of each tumour grade in the considered population is calibrated but highly inaccurate. Research of calibration in statistics and machine learning literature has been focused mainly on binary classifi cation problems or the most c
18、onfi dent predictions: common calibration measures such as the expected calibration error (ECE) (Naeini et al., 2015), the maximum calibration error (MCE) (Naeini et al., 2015), and the kernel-based maximum mean calibration error (MMCE) (Kumar et al., 2018), and reliability diagrams (Murphy and Wink
19、ler, 1977) have been developed for binary classifi cation. This is insuffi cient since many recent applications of machine learning involve multiple classes. Furthermore, the crucial fi nding of Guo et al. (2017) that many modern deep neural networks are miscalibrated is also based only on the most
20、confi dent prediction. Recently Vaicenavicius et al. (2019) suggested that this analysis might be too reduced for many realistic scenarios. In our example, a prediction of(0.5,0.3,0.1,0.1)is fundamentally different from a prediction of(0.5,0.1,0.1,0.3) , since according to the model in the fi rst ca
21、se it is only half as likely that a tumour is of grade 3 or 4, and hence the subjective probability of missing out on a more aggressive therapy is smaller. However, commonly in the study of calibration all predictions with a highest reported confi dence score of0.5are grouped together and a calibrat
22、ed model has only to be correct about the most confi dent tumour grade in 50% of the cases, regardless of the other predictions. Although theECE can be generalized to multi-class classifi cation, its applicability seems to be limited since its histogram-regression based estimator requires partitioni
23、ng of the potentially high-dimensional probability simplex and is asymptotically inconsistent in many cases (Vaicenavicius et al., 2019). Sample complexity bounds for a bias-reduced estimator of the ECE introduced in metereological literature (Brcker, 2011; Ferro and Fricker, 2012) were derived in c
24、oncurrent work (Kumar et al., 2019). 2Our contribution In this work, we propose and study a general framework of calibration measures for multi-class classifi cation. We show that this framework encompasses common calibration measures for binary classifi cation such as the expected calibration error
25、 (ECE), the maximum calibration error (MCE), and the maximum mean calibration error (MMCE) by Kumar et al. (2018). In more detail we study a class of measures based on vector-valued reproducing kernel Hilbert spaces, for which we derive consistent and unbiased estimators. The statistical properties
26、of the proposed estimators are not only theoretically appealing, but also of high practical value, since they allow us to address two main problems in calibration evaluation. As discussed by Vaicenavicius et al. (2019), all calibration error estimates are inherently random, and comparing competing m
27、odels based on these estimates without taking the randomness into account can be very misleading, in particular when the estimators are biased (which, for instance, is the case for the commonly used histogram-regression based estimator of theECE). Even more fundamentally, all commonly used calibrati
28、on measures lack a meaningful unit or scale and are therefore not interpretable as such (regardless of any fi nite sample issues). 2 The consistency and unbiasedness of the proposed estimators facilitate comparisons between compet- ing models, and allow us to derive multiple statistical tests for ca
29、libration that exploit these properties. Moreover, by viewing the estimators as calibration test statistics, with well-defi ned bounds and approximations of the corresponding p-value, we give them an interpretable meaning. We evaluate the proposed estimators and statistical tests empirically and com
30、pare them with ex- isting methods. To facilitate multi-class calibration evaluation we provide the Julia packages ConsistencyResampling.jl(Widmann, 2019c),CalibrationErrors.jl(Widmann, 2019a), andCalibrationTests.jl(Widmann, 2019b) for consistency resampling, calibration error esti- mation, and cali
31、bration tests, respectively. 3Background We start by shortly summarizing the most relevant defi nitions and concepts. Due to space constraints and to improve the readability of our paper, we do not provide any proofs in the main text but only refer to the results in the supplementary material, which
32、 is intended as a reference for mathematically precise statements and proofs. 3.1Probabilistic setting Let(X,Y )be a pair of random variables withXandYrepresenting inputs (features) and outputs, respectively. We focus on classifi cation problems and hence without loss of generality we may assume tha
33、t the outputs consist of the m classes 1, ., m. Letmdenote the(m 1)-dimensional probability simplexm:= z Rm 0: kzk1 = 1. Then a probabilistic modelgis a function that for every inputxoutputs a predictiong(x) mthat models the distribution ? PY = 1|X = x,.,PY = m|X = x? m of class Y given input X = x.
34、 3.2Calibration 3.2.1Common notion The common notion of calibration, as, e.g., used by Guo et al. (2017), considers only the most confi dent predictionsmaxygy(x)of a modelg . According to this defi nition, a model is calibrated if PY = argmax y gy(X)| max y gy(X) = max y gy(X)(1) holds almost always
35、. Thus a model that is calibrated according to Eq. (1) ensures that we can partly trust the uncertainties reported by its predictions. As an example, for a prediction of(0.4,0.3,0.3) the model would only guarantee that in the long run inputs that yield a most confi dent prediction of 40% are in the
36、corresponding class 40% of the time.1 3.2.2Strong notion According to the more general calibration defi nition of Brcker (2009); Vaicenavicius et al. (2019), a probabilistic modelgis calibrated if for almost all inputsxthe predictiong(x)is equal to the distribution of class Y given prediction g(X) =
37、 g(x). More formally, a calibrated model satisfi es PY = y |g(X) = gy(X)(2) almost always for all classesy 1,.,m. As Vaicenavicius et al. (2019) showed, for multi-class classifi cation this formulation is stronger than the defi nition of Zadrozny and Elkan (2002) that only demands calibrated margina
38、l probabilities. Thus we can fully trust the uncertainties reported by the predictions of a model that is calibrated according to Eq. (2). The prediction(0.4,0.3,0.3)would actually imply that the class distribution of the inputs that yield this prediction is(0.4,0.3,0.3). To emphasize the difference
39、 to the defi nition in Eq. (1), we call calibration according to Eq. (2) calibration in the strong sense or strong calibration. 1 This notion of calibration does not consider for which class the most confi dent prediction was obtained. 3 To simplify our notation, we rewrite Eq. (2) in vectorized for
40、m. Equivalent to the defi nition above, a model g is calibrated in the strong sense if r(g(X) g(X) = 0(3) holds almost always, where r() := ? PY = 1|g(X) = ,.,PY = m|g(X) = ? is the distribution of class Y given prediction g(X) = . The calibration of certain aspects of a model, such as the fi ve lar
41、gest predictions or groups of classes, can be investigated by studying the strong calibration of models induced by so-called calibration lenses. For more details about evaluation and visualization of strong calibration we refer to Vaicenavicius et al. (2019). 3.3Matrix-valued kernels The miscalibrat
42、ion measure that we propose in this work is based on matrix-valued kernelsk: m m Rmm . Matrix-valued kernels can be defi ned in a similar way as the more common real- valued kernels, which can be characterized as symmetric positive defi nite functions (Berlinet and Thomas-Agnan, 2004, Lemma 4). Defi
43、 nition 1 (Micchelli and Pontil (2005, Defi nition 2); Caponnetto et al. (2008, Defi nition 1). We call a functionk: m m Rmma matrix-valued kernel if for alls,t mk(s,t) = k(t,s)| and it is positive semi-defi nite, i.e., if for alln N,t1,.,tn m, andu1,.,un Rm n X i,j=1 u| ik(ti,tj)uj 0. There exists
44、a one-to-one mapping of such kernels and reproducing kernel Hilbert spaces (RKHSs) of vector-valued functionsf : m Rm. We provide a short summary of RKHSs of vector-valued functions on the probability simplex in Appendix D. A more detailed general introduction to RKHSs of vector-valued functions can
45、 be found in the publications by Caponnetto et al. (2008); Carmeli et al. (2010); Micchelli and Pontil (2005). Similar to the scalar case, matrix-valued kernels can be constructed from other matrix-valued kernels and even from real-valued kernels. Very simple matrix-valued kernels are kernels of the
46、 form k(s,t) = k(s,t)Im, wherekis a scalar-valued kernel, such as the Gaussian or Laplacian kernel, andImis the identity matrix. As Example D.1 shows, this construction can be generalized by, e.g., replacing the identity matrix with an arbitrary positive semi-defi nite matrix. An important class of
47、kernels are so-called universal kernels. Loosely speaking, a kernel is called uni- versal if its RKHS is a dense subset of the space of continuous functions, i.e., if in the neighbourhood of every continuous function we can fi nd a function in the RKHS. Prominent real-valued kernels on the probabili
48、ty simplex such as the Gaussian and the Laplacian kernel are universal, and can be used to construct universal matrix-valued kernels of the form in Example D.1, as Lemma D.3 shows. 4 Unifi cation of calibration measures In this section we introduce a general measure of strong calibration and show it
49、s relation to other existing measures. 4.1Calibration error In the analysis of strong calibration the discrepancy in the left-hand side of Eq. (3) lends itself naturally to the following calibration measure. Defi nition 2.LetFbe a non-empty space of functionsf : m Rm. Then the calibration error (CE)
50、 of model g with respect to class F is CEF,g := sup fF E(r(g(X) g(X)|f(g(X). 4 A trivial consequence of the design of theCEis that the measure is zero for every model that is calibrated in the strong sense, regardless of the choice ofF. The converse statement is not true in general. As we show in Th
51、eorem C.2, the class of continuous functions is a space for which the CEis zero if and only if modelgis strongly calibrated, and hence allows to characterize calibrated models. However, since this space is extremely large, for every model theCEis either0or.2. Thus a measure based on this space does
52、not allow us to compare miscalibrated models and hence is rather impractical. 4.2Kernel calibration error Due to the correspondence between kernels and RKHSs we can defi ne the following kernel measure. Defi nition 3.Letk be a matrix-valued kernel as in Defi nition 1. Then we defi ne the kernel cali
53、bration error (KCE) with respect to kernelkasKCEk,g := CEF,g, whereFis the unit ball in the RKHS corresponding to kernel k. As mentioned above, a RKHS with a universal kernel is a dense subset of the space of continuous functions. Hence these kernels yield a function space that is still large enough
54、 for identifying strongly calibrated models. Theorem 1 (cf. Theorem C.1).Letk be a matrix-valued kernel as in Defi nition 1, and assume that k is universal. Then KCEk,g = 0 if and only if model g is calibrated in the strong sense. From the supremum-based Defi nition 2 it might not be obvious how the
55、KCEcan be evaluated. Fortunately, there exists an equivalent kernel-based formulation. Lemma 1 (cf. Lemma E.2).Letk be a matrix-valued kernel as in Defi nition 1.If Ekk(g(X),g(X)k , then KCEk,g = ? E ?(e Y g(X) |k(g(X),g(X0)(e Y0 g(X0) ?1/2, (4) where (X0,Y 0) is an independent copy of (X,Y ) and ei
56、 denotes the ith unit vector. 4.3Expected calibration error The most common measure of calibration error is the expected calibration error (ECE). Typically it is used for quantifying calibration in a binary classifi cation setting but it generalizes to strong calibration in a straightforward way. Le
57、td: mm R0be a distance measure on the probability simplex. Then the expected calibration error of a model g with respect to d is defi ned as ECEd,g = Ed(r(g(X),g(X).(5) Ifd(p,q) = 0 p = q, as it is the case for standard choices ofdsuch as the total variation distance or the (squared) Euclidean dista
58、nce, thenECEd,gis zero if and only ifgis strongly calibrated as per Eq. (3). The ECE with respect to the cityblock distance, the total variation distance, or the squared Euclidean distance, are special cases of the calibration error CE, as we show in Lemma I.1. 4.4Maximum mean calibration error Kuma
59、r et al. (2018) proposed a kernel-based calibration measure, the so-called maximum mean calibration error (MMCE), for training (better) calibrated neural networks. In contrast to their work, in our publication we do not discuss how to obtain calibrated models but focus on the evaluation of calibration and on calibration tests. Moreover, theMMCE applies only to a binary classifi cation setting wherea
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2024-2025学年刑法期末考试通关题库带答案详解(黄金题型)
- 2024-2025学年云南农业职业技术学院单招《数学》常考点试卷附参考答案详解(研优卷)
- 2024-2025学年度“安全生产事故隐患排查”知识竞赛自我提分评估及参考答案详解(培优B卷)
- 2024-2025学年公务员(国考)题库附完整答案详解(易错题)
- 2024-2025学年度反射疗法师3级过关检测试卷及答案详解(历年真题)
- 2024-2025学年农村信用社招聘考试通关题库【网校专用】附答案详解
- 创新引领发展协同承诺书6篇
- 2024-2025学年度施工员自我提分评估含完整答案详解【易错题】
- 公共服务承诺保证承诺书范文7篇
- 物流管理手册提升物流效率指南
- 长江禁捕课件
- 水池带顶拆除施工方案
- 药厂现场QA工作总结
- 房地产项目融资计划书范例
- 研究生工作站管理办法
- 通信弱电维护课件
- 华为PDT经理角色认知培训教材-细分版第二部分
- 2025年八年级美术国测试题及答案
- 脑转移瘤综合治疗策略
- 2025年工勤人员转岗考试题库
- 2025年国家电网面试题及答案
评论
0/150
提交评论