diff --git a/paper/paper.pdf b/paper/paper.pdf index 3e0cacb552ec374539b99516e1e18658e9a16e13..ab2daf18f2f01715c4626ae2b86b5fa1ca05a614 100644 Binary files a/paper/paper.pdf and b/paper/paper.pdf differ diff --git a/paper/paper.tex b/paper/paper.tex index 663cdc917ffae986c5b23e7fbb1105af2edebfbf..2211b3d1e4aff96d3b4a73c9dcdfad676fc0d548 100644 --- a/paper/paper.tex +++ b/paper/paper.tex @@ -30,9 +30,10 @@ \usepackage{nicefrac} % compact symbols for 1/2, etc. \usepackage{microtype} % microtypography \usepackage{xcolor} % colors +\usepackage{amsmath} -\title{High-Fidelity Counterfactual Explanations through Conformal Prediction} +\title{Conformal Counterfactual Explanations} % The \author macro works with any number of authors. There are two commands @@ -48,10 +49,10 @@ Patrick Altmeyer\thanks{Use footnote for providing further information about author (webpage, alternative address)---\emph{not} for acknowledging funding agencies.} \\ - Department of Computer Science\\ - Cranberry-Lemon University\\ - Pittsburgh, PA 15213 \\ - \texttt{hippo@cs.cranberry-lemon.edu} \\ + Faculty of Electrical Engineering, Mathematics and Computer Science\\ + Delft University of Technology\\ + 2628 XE Delft, The Netherlands \\ + \texttt{p.altmeyer@tudelft.nl} \\ % examples of more authors % \And % Coauthor \\ @@ -86,22 +87,42 @@ \section{Introduction}\label{intro} -Counterfactual Explanations are a powerful, flexible and intuitive way to not only explain Black Box Models but also enable affected individuals to challenge them through the means of Algorithmic Recourse. Instead of opening the black box, Counterfactual Explanations work under the premise of strategically perturbing model inputs to understand model behaviour \cite{wachter2017counterfactual}. Intuitively speaking, we generate explanations in this context by asking simple what-if questions of the following nature: `Our credit risk model currently predicts that this individual's credit profile is too risky to offer them a loan. What if they reduced their monthly expenditures by 10\%? Will our model then predict that the individual is credit-worthy'? +Counterfactual Explanations are a powerful, flexible and intuitive way to not only explain Black Box Models but also enable affected individuals to challenge them through the means of Algorithmic Recourse. Instead of opening the black box, Counterfactual Explanations work under the premise of strategically perturbing model inputs to understand model behaviour \citep{wachter2017counterfactual}. Intuitively speaking, we generate explanations in this context by asking simple what-if questions of the following nature: `Our credit risk model currently predicts that this individual's credit profile is too risky to offer them a loan. What if they reduced their monthly expenditures by 10\%? Will our model then predict that the individual is credit-worthy'? -This is typically implemented by defining a target outcome $t \in \mathcal{Y}$ for some individual $x \in \mathcal{X}$, for which the model $f: \mathcal{X} \mapsto \mathcal{Y}$ initially predicts a different outcome: $f(x)\ne t$. Counterfactuals are then searched by minimizing a loss function that compares the predicted model output to the target outcome: $\ell(f(x),t)$. Since Counterfactual Explanations (CE) work directly with the Black Box Model, they always have full local fidelity by construction. Fidelity is defined as the degree to which explanations approximate the predictions of the Black Box Model. This arguably one of the most important evaluation metrics for model explanations, since any explanation that explains a prediction not actually made by the model is useless \cite{molnar2020interpretable}. +This is typically implemented by defining a target outcome $t \in \mathcal{Y}$ for some individual $x \in \mathcal{X}$, for which the model $M:\mathcal{X}\mapsto\mathcal{Y}$ initially predicts a different outcome: $M(x)\ne t$. Counterfactuals are then searched by minimizing a loss function that compares the predicted model output to the target outcome: $\text{yloss}(M(x),t)$. Since Counterfactual Explanations (CE) work directly with the Black Box Model, they always have full local fidelity by construction. Fidelity is defined as the degree to which explanations approximate the predictions of the Black Box Model. This arguably one of the most important evaluation metrics for model explanations, since any explanation that explains a prediction not actually made by the model is useless \citep{molnar2020interpretable}. -In situations where full fidelity is a requirement, CE therefore offers a more appropriate solution to Explainable Artificial Intelligence (XAI) than other popular approaches like LIME \cite{ribeiro2016why} and SHAP \cite{lundberg2017unified}, which involve local surrogate models. But even full fidelity is not a sufficient condition for ensuring that an explanation adequately describes the behaviour of a model. That is because two very distinct explanations can both lead to the same model prediction, especially when dealing with heavily parameterized models: +In situations where full fidelity is a requirement, CE therefore offers a more appropriate solution to Explainable Artificial Intelligence (XAI) than other popular approaches like LIME \citep{ribeiro2016why} and SHAP \citep{lundberg2017unified}, which involve local surrogate models. But even full fidelity is not a sufficient condition for ensuring that an explanation adequately describes the behaviour of a model. That is because two very distinct explanations can both lead to the same model prediction, especially when dealing with heavily parameterized models: \begin{quotation} […] deep neural networks are typically very underspecified by the available data, and […] parameters [therefore] correspond to a diverse variety of compelling explanations for the data. - --- \cite[Wilson 2020]{wilson2020case} + --- \citet{wilson2020case} \end{quotation} When people talk about Black Box Models, this is usually the type of model they have in mind. -In the context of CE, the idea that no two explanations are the same arises almost naturally. Even the baseline approach proposed by \cite[Wachter et al.]{wachter2017counterfactual} can yield a diverse set of explanations if counterfactuals are intialised randomly. This multiplicity of explanations has not only been acknowledged in the literature but positively embraced: since individuals seeking Algorithmic Recourse (AR) have unique preferences,~\cite[Mothilal et al.]{mothilal2020explaining}, for example, have prescribed \textit{diversity} as an explicit goal for counterfactuals. More generally, the literature on CE and AR has brought forward a myriad of desiderata for explanations, which we will discuss in more detail in the following section. +In the context of CE, the idea that no two explanations are the same arises almost naturally. Even the baseline approach proposed by \citet{wachter2017counterfactual} can yield a diverse set of explanations if counterfactuals are initialised randomly. This multiplicity of explanations has not only been acknowledged in the literature but positively embraced: since individuals seeking Algorithmic Recourse (AR) have unique preferences,~\citet{mothilal2020explaining}, for example, have prescribed \textit{diversity} as an explicit goal for counterfactuals. More generally, the literature on CE and AR has brought forward a myriad of desiderata for explanations, which we will discuss in more detail in the following section. -\section{Adversarial Example or Plausible Explanation?}\label{background} +\section{Adversarial Example or Valid Explanation?}\label{background} + +Most state-of-the-art approaches to generating Counterfactual Explanations (CE) rely on gradient descent to optimize different flavours of the same counterfactual search objective, + +\begin{equation} \label{eq:general} +\begin{aligned} +\mathbf{s}^\prime &= \arg \min_{\mathbf{s}^\prime \in \mathcal{S}} \left\{ {\text{yloss}(M(f(\mathbf{s}^\prime)),y^*)}+ \lambda {\text{cost}(f(\mathbf{s}^\prime)) } \right\} +\end{aligned} +\end{equation} + +where $\text{yloss}$ denotes the primary loss function already introduced above and $\text{cost}$ is either a single penalty or a collection of penalties that are used to impose constraints through regularization. Following the convention in \citet{altmeyer2023endogenous} we use $\mathbf{s}^\prime=\{ s_k\}_K$ to denote the vector $K$-dimensional array of counterfactual states. This is to explicitly account for the fact that we can generate multiple counterfactuals, as with DiCE \citep{mothilal2020explaining}, and may choose to traverse a latent representation $\mathcal{Z}$ of the feature space $\mathcal{X}$, as we will discuss further below. + +Solutions to Equation~\ref{eq:general} are considered valid as soon as the predicted label matches the target label. A stripped-down counterfactual explanation is therefore little different from an adversarial example. In, for example, generic counterfactual search as in \citet{wachter2017counterfactual} has been applied to MNIST data. + +\begin{figure} + \centering + \fbox{\rule[-.5cm]{0cm}{4cm} \rule[-.5cm]{4cm}{0cm}} + \caption{Sample figure caption.} +\end{figure} + +To properly serve both AI practitioners and individuals affected by AI decision-making systems, Counterfactual Explanations should have certain desirable properties, sometimes referred to as \textit{desiderata}. Besides diversity, which we already introduced above, some of the most prominent desiderata include sparsity, proximity \citep{wachter2017counterfactual}, actionability~\citep{ustun2019actionable}, plausibility \citep{joshi2019realistic,poyiadzi2020face,schut2021generating}, robustness \citep{upadhyay2021robust,pawelczyk2022probabilistically,altmeyer2023endogenous} and causality~\citep{karimi2021algorithmic}. Researchers have come up with various ways to meet these desiderata, which have been surveyed in~\citep{verma2020counterfactual} and~\citep{karimi2020survey}. @@ -473,9 +494,10 @@ Note that the Reference section does not count towards the page limit. \medskip -\bibliographystyle{plain} +\bibliographystyle{plainnat} \bibliography{../bib} + %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section*{Checklist} diff --git a/paper/www/mnist_9to4_latent.png b/paper/www/mnist_9to4_latent.png new file mode 100644 index 0000000000000000000000000000000000000000..8f6b3a53c7295ef3678d841e79dc5e0bd38f6e3c Binary files /dev/null and b/paper/www/mnist_9to4_latent.png differ