\[
\newcommand{\Yhat}{\hat{Y}}
\newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)}
\newcommand{\indep}{\perp}
\]
My presentation of this topic largely (but not entirely) follows the excellent review paper by Corbett-Davies and Goel (2018).
When we talk about classification problem, we always mean that we’re trying to predict a categorical, usually binary, label, outcome or class \(Y\) from features \(X\) (which may or may not be categorical).
We’ll be a little more telegraphic today.
“Protected attributes”
- Legally mandated in some contexts
- US law generally prohibits discrimination on the basis of race, ethincity, sex, religion, national origin, or age
- Obvious exceptions are in fact part of the law: don’t appeal to laws against age discrimination to try to get in to a bar, laws against religious discrimination don’t force a Catholic diocese to hire a Muslim as a priest (but might mean the diocese couldn’t discriminate against a Muslim janitor)
- OTOH there’s nothing in US law against (for example) discrimination by caste
- Arguably ethically mandated everywhere
- I am not going to try to tell you what to believe about ethics, but I do want you to think about whether what you are doing with these powers is ethical
Some notions of “fairness” for classification
- Don’t use protected features directly
- Sometimes called “anti-classification”
- What about strongly-associated unprotected features?
- Have equal measures of error across groups
- Sometimes called “classification parity”
- Which error measures, exactly?
- Calibration: everyone with the same score should have the same actual probability of \(Y=1\), regardless of group
- Conditional independence of \(Y\) from protected attribute given score
- Weak!
Concrete use-case: Pretrial detention, recidivism prediction
- You don’t get arrested, your screw-up cousin Archie gets arrested
- Court decides whether or not to keep Archie in jail pending trial or let Archie go (perhaps on bail)
- Court wants Archie to show up and not do any more crimes
- \(Y=1\): Archie will be arrested for another crime if released
- \(Y=0\): Archie will not be arrested
- Similarly for failure to appear on trial date, arrest for violence, etc.
Notation
- Archie’s features \(=X = (X_p, X_u)\) where \(X_p\) are the protected features and \(X_u\) are the unprotected ones
- \(Y=\) whether or not Archie will be arrested for another crime before trial
- Or: will show up for trial, will be re-arrested after being released from prison, will default on the loan, …
- Generally, today, \(Y=1\) is the bad case
- \(\Yhat(x) =\) prediction we make about someone with features \(x\)
- Here \(\Yhat=1\) means “we predict re-arrest” (or recidivism), and so someone paying attention to use would presumably not release this person
- \(\Yhat(x)\) can ignore some features in \(x\)!
- \(p(x) = \Prob{Y=1|X=x}\) is the true risk function
- Cost-minimizing \(\Yhat(x)\) thresholds based on \(p(x)\), and the costs of both errors
- Note that true risk function isn’t known
- \(s(x) =\) risk score we calculate based on \(x\) (which may or may not be an estimate of \(p(x)\))
“Anti-classification”
- Anti-classification means: don’t use protected categories to make these decisions/predictions
- Formalization I: Prediction must be the same for any two inputs with the same unprotected features
- Does not guard against inference/proxies for protected attributes
- Zipcode is a powerful proxy for race and education
- Zipcode plus websites visited would be an even more powerful proxy
- Formalization II: Decision must be independent of protected features (DeDeo 2016)
- Can be achieved by deliberately distorting the distribution of features
- Specifically, instead of the actual joint distribution \(\Prob{Y=y, X_u = x_u, X_p = x_p}\), use the distorted distribution \[
\tilde{P}(y, x_u, x_p) = \Prob{Y=y, X_u=x_u, X_p=x_p}\frac{\Prob{Y=y}}{\Prob{Y=y|X_p=x_p}}
\]
- One way to do this in practice is to weight the data points, with the weight of data point \(i\) being exactly \[
\frac{\Prob{Y=y_i}}{\Prob{Y_i=y_i|X_p={x_p}_{i}}}
\] so we give more weight to data points with labels which their protected attributes make relatively unlikely.
- This is an elegant solution but I don’t think anyone except DeDeo uses it.
“Classification parity”
- Demographic parity: \(\Prob(\Yhat(X) = 1| X_p) = \Prob(\Yhat(X)=1)\)
- E.g., equal detention rates across groups
- Thought exercise for you: Does this imply independence between \(\Yhat\) and \(X_p\)?
- Implementing this typically implies different thresholds on \(p(x)\) for each group!
- FPR parity: equal false positive rates across groups
- \(\Prob(\Yhat(X)=1|Y=0, X_p) = \Prob(\Yhat(X)=1|Y=0)\)
- Concretely: Equal detention rates among those who would not have commited a crime if released
- May (as in that example) be hard to know what those rates are…
- FNR parity: equal false negative rates across groups
- \(\Prob(\Yhat(X)=0|Y=1, X_p) = \Prob(\Yhat(X)=0|Y=1)\)
- Concretely: Equal probability of detention among those who would have gone on to commit a crime had they been released
- PPV/NPV parity: equal positive and negative predictive values across groups \[
\Prob{Y=1|\Yhat(X), X_p} = \Prob{Y=1|\Yhat(X)}
\] so outcome is independent of protected attributes given prediction
“Calibration”
- Risk score \(s(X)\) is calibrated, or equally calibrated, \[
\Prob{Y=1|s(X), X_p} = \Prob{Y=1|s(X)}
\]
- Equivalently, \[
Y \indep X_p | s(X)
\]
- (can you show this is equivalent?)
- Note that this has to be true if \(s(X) = p(x) \equiv \Prob{Y=1|X=x}\)
- Also note: if decision is a function of score, then calibration does not imply equal positive predictive value
Tensions
- If true rates of recidivism are different by group, then you cannot have calibration and equal error rates (Chouldechova)
- Specifically, she claims “it is straightforward to show that” \[
FPR = \frac{R}{1-R}\frac{1-PPV}{PPV}(1-FNR)
\] where \(R = \Prob{Y=1}\), both for the over-all population, and separately for each group (see backup)
- So if \(R\) is different for each group, but PPV is the same (calibration), then FPR and FNR must be different by group (violation of error rate parity)
- OTOH, if FPR and FNR are the same (error rate parity), and \(R\) is different, then PPV must be different across groups, and the predictor can’t be calibrated
- Finally, if PPV and error rates are the same, then prevalence must be equal
The lurking problem: designedly missing data
- Suppose we, the legal system, hold everyone with \(\Yhat=1\) until trial, and only release those with \(\Yhat=0\)
- Then we have no data about \(\Prob{Y=0|\Yhat=1}\), since we’ve made sure that can’t happen
- We do get to see \(\Prob{Y=0|\Yhat=0}\) (i.e., the negative predictive value)
- Every case contributing to \(\Prob{Y=1|\Yhat=1}\) will (very likely) make our jobs uncomfortable…
- Similarly, it’s hard for lenders to know how many borrowers they rejected would have paid back their loans, or for colleges to know how many rejected applicants would have done well at their school
- Might be able to proxy this if the applicants got loans from other lenders / went to other schools, but issues of proxy quality (was that school really similar?)
- Not usually an option for the courts
- Historical data is only available for those released by the courts before, which introduces all sorts of weird biases
- E.g., suppose having a history of being a gang member usually rules out pre-trial release, and all the exceptions who were released are really unusual people who can (say) prove to the courts that they’ve totally turned around their lives
- In the data, then, gang membership could well be associated with lower risk of recidivism
Summing up
- Basic anti-classification (don’t use protected attributes) is easy, but leaves open proxies
- Classification parity is a solvable technical problem
- Calibration is also a solvable technical problem
- We cannot possibly achieve all three of anti-classification, classification parity, and calibration.
- We can’t even really achieve both classification parity and calibration.
- For many applications, actually following our predictions would remove the data needed to see whether we were right or not
Backup: Filling in Chouldechova’s “it is straightforward to show that”
(maybe it’s straightforward for Alex…)
I’ll write out the algebra for the population as a whole; doing it for each group just means sprinkling in conditioning signs.
\(R = \Prob{Y=1}\) is the true prevalence or base rate.
Chouldechova’s claim is that \[
FPR = \frac{R}{1-R} \frac{1-PPV}{PPV} (1-FNR)
\]
Substituting in from the definitions, \[
\Prob{\Yhat=1|Y=0} = \frac{\Prob{Y=1}}{\Prob{Y=0}} \frac{1-\Prob{Y=1|\Yhat=1}}{\Prob{Y=1|\Yhat=1}} (1-\Prob{\Yhat=0|Y=1})
\] Since \(Y\) and \(\Yhat\) are both binary, \[
\Prob{\Yhat=1|Y=0} = \frac{\Prob{Y=1}}{\Prob{Y=0}} \frac{\Prob{Y=0|\Yhat=1}}{\Prob{Y=1|\Yhat=1}}\Prob{\Yhat=1|Y=1}
\] but \[\begin{eqnarray}
\Prob{Y=0|\Yhat=1} & = & \Prob{Y=0, \Yhat=1}/\Prob{\Yhat=1}\\
&= & \Prob{\Yhat=1|Y=0}\Prob{Y=0}/\Prob{\Yhat=1}\\
\Prob{Y=1|\Yhat=1} & = & \Prob{\Yhat=1|Y=1}\Prob{Y=1}/\Prob{\Yhat=1}
\end{eqnarray}\] so \[\begin{eqnarray}
\Prob{Y=0|\Yhat=1} / \Prob{Y=1|\Yhat=1} & = & \Prob{\Yhat=1|Y=0}\Prob{Y=0} / \Prob{\Yhat=1|Y=1} \Prob{Y=1}\\
\frac{\Prob{Y=1}}{\Prob{Y=0}} \frac{\Prob{Y=0|\Yhat=1}}{ \Prob{Y=1|\Yhat=1}} & = &\Prob{\Yhat=1|Y=0} / \Prob{\Yhat=1|Y=1}
\end{eqnarray}\] and so, substituting in, we get \[
\Prob{\Yhat=1|Y=0} = \Prob{\Yhat=1|Y=0}
\] which is certainly true.
References
Corbett-Davies, Sam, and Sharad Goel. 2018. “The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning.” E-print, arxiv:1808.00023. https://arxiv.org/abs/1808.00023.
DeDeo, Simon. 2016. “Wrong Side of the Tracks: Big Data and Protected Categories.” In Big Data Is Not a Monolith, edited by Cassidy R. Sugimoto, Hamid R. Ekbia, and Michael Mattioli, 31–42. Cambridge, Massachusetts: MIT Press. http://arxiv.org/abs/1412.4643.