iccv2019论文全集8604-the-fairness-of-risk-scores-beyond-classification-bipartite-ranking-and-the-xauc-metric

上传人：我*** IP属地：北京上传时间：2020-06-20 格式：PDF 页数：11 大小：1.16MB 积分：12 举报 版权申诉

iccv2019论文全集8604-the-fairness-of-risk-scores-beyond-classification-bipartite-ranking-and-the-xauc-metric_第2页

iccv2019论文全集8604-the-fairness-of-risk-scores-beyond-classification-bipartite-ranking-and-the-xauc-metric_第3页

iccv2019论文全集8604-the-fairness-of-risk-scores-beyond-classification-bipartite-ranking-and-the-xauc-metric_第4页

iccv2019论文全集8604-the-fairness-of-risk-scores-beyond-classification-bipartite-ranking-and-the-xauc-metric_第5页

已阅读5页，还剩6页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

1、The Fairness of Risk Scores Beyond Classifi cation: Bipartite Ranking and the xAUC Metric Nathan Kallus Cornell University New York, NY Angela Zhou Cornell University New York, NY Abstract Where machine-learned predictive risk scores inform high-stakes decisions, s

2、uch as bail and sentencing in criminal justice, fairness has been a serious concern. Recent work has characterized the disparate impact that such risk scores can have when used for a binary classifi cation task. This may not account, however, for the more diverse downstream uses of risk scores and t

3、heir non-binary nature. To better account for this, in this paper, we investigate the fairness of predictive risk scores from the point of view of a bipartite ranking task, where one seeks to rank positive examples higher than negative ones. We introduce the xAUC disparity as a metric to assess the

4、disparate impact of risk scores and defi ne it as the difference in the probabilities of ranking a random positive example from one protected group above a negative one from another group and vice versa. We provide a decomposition of bipartite ranking loss into components that involve the discrepanc

5、y and components that involve pure predictive ability within each group. We use xAUC analysis to audit predictive risk scores for recidivism prediction, income prediction, and cardiac arrest prediction, where it describes disparities that are not evident from simply comparing within-group predictive

6、 performance. 1Introduction Predictive risk scores support decision-making in high-stakes settings such as bail sentencing in the criminal justice system, triage and preventive care in healthcare, and lending decisions in the credit industry 2,38 . In these areas where predictive errors can signifi

7、cantly impact individuals involved, studies of fairness in machine learning have analyzed the possible disparate impact introduced by predictive risk scores primarily in a binary classifi cation setting: if predictions determine whether or not someone is detained pre-trial, is admitted into critical

8、 care, or is extended a loan. But the “human in the loop” with risk assessment tools often has recourse to make decisions about extent, intensity, or prioritization of resources. That is, in practice, predictive risk scores are used to provide informative rank-orderings of individuals with binary ou

9、tcomes in the following settings: (1)In criminal justice, the “risk-needs-responsivity” model emphasizes matching the level of social service interventions to the specifi c individuals risk of re-offending 3, 6. (2)In healthcare and other clinical decision-making settings, risk scores are used as de

10、cision aids for prevention of chronic disease or triage of health resources, where a variety of interventional resource intensities are available; however, the prediction quality of individual conditional probability estimates can be poor 9, 28, 38, 39. (3)In credit, predictions of default risk affe

11、ct not only loan acceptance/rejection decisions, but also risk-based setting of interest rates. Fuster et al.22embed machine-learned credit scores in an economic pricing model which suggests negative economic welfare impacts on Black and Hispanic borrowers. 33rd Conference on Neural Information Proc

12、essing Systems (NeurIPS 2019), Vancouver, Canada. White innocents ranked above black recidivators Black innocents ranked above white recidivators (a) ROC curves and XROC curves for COMPAS(b) Score distributions PrR = r | A = a,Y = y Figure 1: Analysis of xAUC disparities for the COMPAS Violent Recid

13、ivism Prediction dataset (4)In municipal services, predictive analytics tools have been used to direct resources for main- tenance, repair, or inspection by prioritizing or ranking by risk of failure or contamination 12,40. Proposals to use new data sources such as 311 data, which incur the self-sel

14、ection bias of citizen complaints, may introduce inequities in resource allocation 32. We describe how the problem of bipartite ranking, that of fi nding a good ranking function that ranks positively labeled examples above negative examples, better encapsulates how predictive risk scores are used in

15、 practice to rank individual units, and how a new metric we propose,xAUC, can assess ranking disparities. Most previous work on fairness in machine learning has emphasized disparate impact in terms of confusion matrix metrics such as true positive rates and false positive rates and other desiderata,

16、 such as probability calibration of risk scores. Due in part to inherent trade-offs between these performance criteria, some have recommended to retain unadjusted risk scores that achieve good calibration, rather than adjusting for parity across groups, in order to retain as much information as poss

17、ible and allow human experts to make the fi nal decision 10,14,15,27. At the same time, group-level discrepancies in the prediction loss of risk scores, relative to the true Bayes-optimal score, are not observable, since only binary outcomes are observed. In particular, our bipartite ranking-based p

18、erspective reconciles a gap between the differing arguments made by ProPublica and Equivant (then Northpointe) regarding the potential bias or disparate impact of the COMPAS recidivism tool. Equivant levies within-group AUC parity (“accuracy equity”) (among other desiderata such as calibration and p

19、redictive parity) to claim fairness of the risk scores in response to ProPublicas allegations of bias due to true positive rate/false positive rate disparities for the Low/Not Low risk labels 2,19. OurxAUCmetric, which measures the probability of positive- instance members of one group being misrank

20、ed below negative-instance members of another group, and vice-versa, highlights that within-group comparison ofAUCdiscrepancies does not summarize accuracy inequity. We illustrate this in Fig. 1 for a risk score learned from COMPAS data:xAUC disparities refl ect disparate misranking risk faced by po

21、sitive-label individual of either class. In this paper, we propose and study the cross-ROC curve and the correspondingxAUCmetric for auditing disparities induced by a predictive risk score, as they are used in broader contexts to inform resource allocation. We relate thexAUCmetric to different group

22、- and outcome-based decompositions of a bipartite ranking loss, and assess the resulting metrics on datasets where fairness has been of concern. 2Related Work Our analysis of fairness properties of risk scores in this work is most closely related to the study of “disparate impact” in machine learnin

23、g, which focuses on disparities in the outcomes of a process across protected classes, without racial animus 4. Many previous approaches have considered formalizations via error rate metrics of the confusion matrix in a binary classifi cation setting 5,25, 29,35,44 . By now, a panoply of fairness me

24、trics have been studied for binary classifi cation in order to assess group-level disparities in confusion matrix-based metrics. Proposals for error rate balance assess or try to equalize true positive rates and/or false positive rates, error rates measured conditional on the true outcome, emphasizi

25、ng the equitable treatment of those who actually are of the outcome type of interest 25,44. Alternatively, one might assess the negative/positive predictive 2 value (NPV/PPV) error rates conditional on the thresholded model prediction 13. In missing-data settings, these metrics can be partially iden

26、tifi ed to support fairness assessments 11, 30, 30. The predominant criterion used for assessing fairness of risk scores, outside of a binary classifi cation setting, is that of calibration. Group-wise calibration requires thatPrY = 1 | R = r,A = a = PrY = 1 | R = r,A = b = r,as in 13. The impossibi

27、lities of satisfying notions of error rate balance and calibration simultaneously have been discussed in 13,31. Liu et al.33show that group calibration is a byproduct of unconstrained empirical risk minimization, and therefore is not a restrictive notion of fairness. Hebert-Johnson et al.26note the

28、critique that group calibration does not restrict the variance of a risk score as an unbiased estimator of the Bayes-optimal score. Other work has considered fairness in ranking settings specifi cally, with particular attention to applications in information retrieval, such as questions of fair repr

29、esentation in search engine results. Yang and Stoyanovich43assess statistical parity at discrete cut-points of a ranking, incorporating position bias inspired by normalized discounted cumulative gain (nDCG) metrics. Celis et al.8 consider the question of fairness in rankings, where fairness is consi

30、dered as constraints on diversity of group membership in the topkrankings, for any choice ofk. Singh and Joachims41consider fairness of exposure in rankings under known relevance scores and propose an algorithmic framework that produces probabilistic rankings satisfying fairness constraints in expec

31、tation on exposure, under a position bias model. We focus instead on the bipartite ranking setting, where the area under the curve (AUC) loss emphasizes ranking quality on the entire distribution, whereas other ranking metrics such as nDCG or top-k metrics emphasize only a portion of the distributio

32、n. The problem of bipartite ranking is related to, but distinct from, binary classifi cation 1,20,36; see 16,34 for more information. While the bipartite ranking induced by the Bayes-optimal score is analogously Bayes-risk optimal for bipartite ranking (e.g., 34), in general, a probability-calibrate

33、d classifi er is not optimizing for the bipartite ranking loss. Cortes and Mohri16observe that AUC may vary widely for the same error rate, and that algorithms designed to globally optimize the AUC perform better than optimizing surrogates of the AUC or error rate. Narasimhan and Agarwal37 study tra

34、nsfer regret bounds between the related problems of binary classifi cation, bipartite ranking, and outcome-probability estimation. 3Problem Setup and Notation We suppose we have data(X,A,Y )on featuresX 2 X, sensitive attributeA 2 A, and binary labeled outcomeY 2 0,1. We are interested in assessing

35、the downstream impacts of a predictive risk scoreR : X A ! R, which may or may not access the sensitive attribute. When these risk scores represent an estimated conditional probability of positive label,R : X A ! 0,1. For brevity, we also letR = R(X,A)be the random variable corresponding to an indiv

36、iduals risk score. We generally use the conventions thatY = 1 is associated with opportunity or benefi t for the individual (e.g., freedom from suspicion of recidivism, creditworthiness) and that when discussing two groups, A = a and A = b, the group A = a might be a historically disadvantaged group

37、. Let the conditional cumulative distribution function of the learned scoreRevaluated at a threshold given label and attribute be denoted by Fa y() = PrR | Y = y,A = a. We letGa y = 1 ? Fa y denote the complement ofFa y. We drop theasubscript to refer to the whole population:Fy() = PrR | Y = y . Thr

38、esholding the score yields a binary classifi er, Y= IR ? . The classifi ers true negative rate (TNR) isF0(), its false positive rate (FPR) is G0(), its false negative rate (FNR) isF1(), and its true positive rate (TPR) isG1(). Given a risk score, the choice of optimal threshold for a binary classifi

39、 er depends on the differing costs of false positive and false negatives. We might expect cost ratios of false positives and false negatives to differ if we consider the use of risk scores to direct punitive measures or to direct interventional resources. In the setting of bipartite ranking, the dat

40、a comprises of a pool of positive labeled examples,S+= Xii2m, drawn i.i.d. according to a distributionX+ D+, and negative labeled examplesS?= X0 ii2ndrawn according to a distributionX? D?36. The rank order may be determined by a score functions(X), which achieves empirical bipartite ranking error 1

41、mn Pm i=1 Pn j=1Is(Xi) R0. 4The Cross-ROC (xROC) and Cross-Area Under the Curve (xAUC) We introduce the cross-ROC curve and the cross-area under the curve metricxAUCthat summarize group-level disparities in misranking errors induced by a score function R(X,A). Defi nition 1 (Cross-Receiver Operating

42、 Characteristic curve (xROC). xROC(;R,a,b) = (PrR | A = b,Y = 0,PrR | A = a,Y = 1) ThexROCa,bcurve parametrically plotsxROC(;R,a,b)over the space of thresholds 2 R, generating the curve of TPR of groupaon the y-axis vs. the FPR of groupbon the x-axis. We defi ne thexAUC(a,b)metric as the area under

43、thexROCa,bcurve. Analogous to the usualAUC, we provide a probabilistic interpretation of thexAUCmetric as the probability of correctly ranking a positive instance of groupaabove a negative instance of groupaunder the corresponding outcome- and class-conditional distributions of the score. Defi nitio

44、n 2 (xAUC). xAUC(a,b) = Z 1 0 Ga 1(G b 0) ?1(v)dv = PrRa 1 Rb 0 whereRa 1is drawn fromR | Y = 1,A = aandRb0is drawn fromR | Y = 0,A = bindependently. For brevity, henceforth,Ra yis taken to be drawn fromR | Y = y,A = aand independently of any other such variable. We also drop the superscript to deno

45、te omitting the conditioning on sensitive attribute (e.g., Ry). ThexAUCaccuracy metrics for a binary sensitive attribute measure the probability that a randomly chosen unit from the “positive” groupY = 1in groupa, is ranked higher than a randomly chosen unit from the “negative group,Y = 0in groupb,

46、under the corresponding group- and outcome- conditional distributions of scoresRy a. We letAUC a denote the within-group AUC for groupA = a, PrRa 1 Ra 0. If the difference between these metrics, the xAUC disparity ?xAUC = PrRa 1 Rb 0 ? PrRb1 Ra0 = PrRb1 Ra0 ? PrRa1 Rb 0 is substantial and positive,

47、then we might consider groupbto be systematically “disadvantaged” andato be “advantaged” whenY = 0is a negative or harmful label or is associated with punitive measures, as in the recidivism predication case. Conversely, we have the opposite interpretation if Y = 0 is a positive label associated wit

48、h greater benefi cial resources. Similarly, since?xAUCis anti-symmetric in a,b, negative values are also interpreted in the converse. When higher scores are associated with opportunity or additional benefi ts and resources, as in the recidivism predication case, a positive?xAUCmeans groupaeither gai

49、ns by correctly having its deserving members correctly ranked above the non-deserving members of groupband/or by having its non-deserving members incorrectly ranked above the deserving members of groupb; and symmetrically, groupbloses in the same way. The magnitude of the disparity?xAUCdescribes the

50、 misranking disparities incurred under this predictive score, while the magnitude of thexAUC measures the particular across-subgroup rank-accuracies. ComputingthexAUCissimple:onesimplycomputesthesamplestatistic, 1 nb 0n a 1 P i: Ai=a, Yi=1 P j: Ai=b, Yi=0 IR(Xi) R(Xj). Algorithmic routines for compu

51、ting the AUC quickly by a sorting routine can be directly used to compute thexAUC s. Asymptotically exact confi dence intervals are available, as shown in DeLong et al.17, using the generalized U-statistic property of this estimator. 4 (a) COMPAS: a = Black,b = White,0 = Recidivate(b) Adult: a = Bla

52、ck,b = White,0 = Low income Figure 2: Balanced xROC curves for COMPAS and Adult datasets Table 1: Ranking error metrics (AUC, xAUC, Brier scores for calibration) for different datasets. We include standard errors in Table 2 of the appendix. COMPASFraminghamGermanAdult A =BlackWhiteNon-F.Female Ra 0,

53、 xAUC 0(b) = PrR1 Rb 0 xAUC1(a) = PrRa 1 R0, xAUC1(b) = PrRb 1 R0 These xAUC disparities compare misranking error faced by individuals from either group, conditional on a specifi c outcome:xAUC0(a)?xAUC0(b)compares the ranking accuracy faced by those of the negative classY = 0across groups, andxAUC1

54、(a)?xAUC1(b)analogously compares those of the positive classY = 1. The following proposition shows how the population AUC decomposes as weighted combinations of thexAUCand within-classAUCs, or the balanced decompositions xAUC1or xAUC0, weighted by the outcome-conditional class probabilities. Proposi

55、tion 1 (xAUC metrics as decompositions of AUC). AUC = PrR1 R0 = X b02A PrA = b0| Y = 0 X a02A PrA = a0| Y = 1PrRa 0 1 Rb 0 0 = X a02A PrA = a0| Y = 1PrRa 0 1 R0 = X a02A PrA = a0| Y = 0PrR1 Ra 0 0 5Assessing xAUC 5.1COMPAS Example In Fig. 1, we revisit the COMPAS data and assess ourxROCandxAUCcurves

56、 to illustrate ranking disparities that may be induced by risk scores learned from this data. The COMPAS dataset is of size 5 n = 6167,p = 402, where sensitive attribute is race, withA = a,bfor black and white, respectively. We defi ne the outcomeY = 1for non-recidivism within 2 years andY = 0for vi

57、olent recidivism. Covariates include information on number of prior arrests and age; we follow the pre-processing of Friedler et al. 21. We fi rst train a logistic regression model on the original covariate data (we do not use the decile scores directly in order to do a more fi ne-grained analysis),

58、 using a 70%, 30% train-test split and evaluating metrics on the out-of-sample test set. In Table 1, we report the group-level AUC and the Brier7scores (summarizing calibration), and ourxAUCmetrics. The xAUC for columnA = a isxAUC(a,b), for columnA = bit isxAUC(b,a), and for columnA = a,xAUCyisxAUCy

59、(a). The Brier score for a probabilistic prediction of a binary outcome is 1 n Pn i(R(Xi) ? Yi) 2. The score is overall well-calibrated (as well as calibrated by group), consistent with analyses elsewhere 13,19. We also report the metrics from using a bipartite ranking algorithm, Bipartite Rankboost of Freund et al.20and calibrating the resulting ranking score by Platt Scaling, displaying the results as

人人文库> 全部分类> 应用文书

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

iccv2019论文全集8604-the-fairness-of-risk-scores-beyond-classification-bipartite-ranking-and-the-xauc-metric

文档简介

温馨提示

最新文档

评论

iccv2019论文全集8604-the-fairness-of-risk-scores-beyond-classification-bipartite-ranking-and-the-xauc-metric

文档简介

温馨提示

最新文档

评论

相关文档