diff --git a/paper/aaai/author_response.md b/paper/aaai/author_response.md index 3b5bf3a4a098c2b48ddd20597b3e4495609dae7a..be03a38640a8737c3bd537c05984537470e9666c 100644 --- a/paper/aaai/author_response.md +++ b/paper/aaai/author_response.md @@ -9,4 +9,9 @@ 2. We have run more extensive experiments including fine-tuning hyperparameter choices 4. We have revisited the mathematical notation. 5. We have moved the introduction of conformal prediction forward and added more detail in line with reviewer feedback. -6. We have extended the limitations section. \ No newline at end of file +6. We have extended the limitations section. +7. Distance metric + 1. We have revisited the distance metrics and decided to use the L2 Norm for plausibility and faithfulness + 2. Orginially, we used the L1 Norm in line with how the the closeness criterium is commonly evaluated. But in this context the L1 Norm implicitly addresses the desire for sparsity. + 3. In the case of image data, we also used cosine distance. + \ No newline at end of file diff --git a/paper/aaai/paper.pdf b/paper/aaai/paper.pdf index 885121276d88428db8f4a2bfc19d47cbe58dbb08..c7b1d817cfc62c1a20a07ac0a42ab623b052c343 100644 Binary files a/paper/aaai/paper.pdf and b/paper/aaai/paper.pdf differ diff --git a/paper/appendix.tex b/paper/appendix.tex index 623aa6c3271450a03c6d2df42a27c44b92f75e62..5ef3a48b07b2b5217b53aaf5ceacefffa9bc0a07 100644 --- a/paper/appendix.tex +++ b/paper/appendix.tex @@ -22,13 +22,13 @@ As mentioned in the body of the paper, we rely on a biased sampler involving sep \end{aligned} \end{equation} -Consistent with~\citet{grathwohl2020your}, we have specified $\epsilon=2$ and $\sigma=0.01$ as the default values for all of our experiments. The number of total SGLD steps $J$ varies by dataset (Table~\ref{tab:ebmparams}). Following best practices, we initialize $\mathbf{x}_0$ randomly in 5\% of all cases and sample from a buffer in all other cases. The buffer itself is randomly initialised and gradually grows to a maximum of 10,000 samples during training as $\hat{\mathbf{x}}_{J}$ is stored in each epoch~\citep{du2020implicit,grathwohl2020your}. +Consistent with~\citet{grathwohl2020your}, we have specified $\epsilon=2$ and $\sigma=0.01$ as the default values for all of our experiments. The number of total SGLD steps $J$ varies by dataset (Table~\ref{tab:ebmparams}). Following best practices, we initialize $\mathbf{x}_0$ randomly in 5\% of all cases and sample from a buffer in all other cases. The buffer itself is randomly initialised and gradually grows to a maximum of 10,000 samples during training as $\hat{\mathbf{x}}_{J}$ is stored in each epoch~\citep{du2019implicit,grathwohl2020your}. It is important to realise that sampling is done during each training epoch, which makes training Joint Energy Models significantly harder than conventional neural classifiers. In each epoch the generated (batch of) sample(s) $\hat{\mathbf{x}}_{J}$ is used as part of the generative loss component, which compares its energy to that of observed samples $\mathbf{x}$: $L_{\text{gen}}(\theta)=\mu_{\theta}(\mathbf{x})[\mathbf{y}]-\mu_{\theta}(\hat{\mathbf{x}}_{J})[\mathbf{y}]$. Our full training objective can be summarized as follows, \begin{equation}\label{eq:jem-loss} \begin{aligned} - L(\theta) &= L_{\text{clf}}(\theta) + L_{\text{gen}}(\theta) + \lambda L_{\text{reg}}(\theta) + L_{\text{JEM}}(\theta) &= L_{\text{clf}}(\theta) + L_{\text{gen}}(\theta) + \lambda L_{\text{reg}}(\theta) \end{aligned} \end{equation} @@ -87,9 +87,43 @@ In addition to the smooth set size penalty,~\citet{stutz2022learning} also propo \caption{Prediction set size (left), smooth set size loss (centre) and configurable classification loss (right) for a JEM trained on our \textit{Linearly Separable} data.}\label{fig:cp-diff} \end{figure} -\subsection{ECCCo}\label{app:eccco} +\subsection{\textit{ECCCo}}\label{app:eccco} + +In this section, we explain \textit{ECCCo} in some more detail, briefly discuss convergence conditions for counterfactual explanations and provide details concerning the actual implementation of our framework in \texttt{Julia}. + +\subsubsection{More detail on our generator} + +The counterfactual search objective for \textit{ECCCo} was introduced in Equation~\ref{eq:eccco} in the body of the paper. We restate this equation here for reference: + +\begin{equation} \label{eq:eccco-app} + \begin{aligned} + \mathbf{Z}^\prime &= \arg \min_{\mathbf{Z}^\prime \in \mathcal{Z}^L} \{ {\text{yloss}(M_{\theta}(f(\mathbf{Z}^\prime)),\mathbf{y}^+)}+ \lambda_{1} {\text{dist}(f(\mathbf{Z}^\prime),\mathbf{x}) } \\ + &+ \lambda_2 \Delta\mathcal{E}(\mathbf{Z}^\prime,\widehat{\mathbf{X}}_{\theta,\mathbf{y}^+}) + \lambda_3 \Omega(C_{\theta}(f(\mathbf{Z}^\prime);\alpha)) \} + \end{aligned} +\end{equation} + +We can make the connection to energy-based modeling more explicit by restating this equation in terms $L_{\text{JEM}}(\theta)$, which we defined in Equation~\ref{eq:jem-loss}. In particular, note that for $\lambda_2=1$ and $\lambda L_{\text{reg}}(\theta)=0$ we have + +\begin{equation} \label{eq:eccco-jem} + \begin{aligned} + \mathbf{Z}^\prime &= \arg \min_{\mathbf{Z}^\prime \in \mathcal{Z}^L} \{ {L_{\text{JEM}}(\theta;M_{\theta}(f(\mathbf{Z}^\prime)),\mathbf{y}^+)}+ \lambda_{1} {\text{dist}(f(\mathbf{Z}^\prime),\mathbf{x}) } + \lambda_3 \Omega(C_{\theta}(f(\mathbf{Z}^\prime);\alpha)) \} + \end{aligned} +\end{equation} + +since $\Delta\mathcal{E}(\cdot)$ is equivalent to the generative loss function $L_{\text{gen}}(\cdot)$. In fact, this is also true for $\lambda L_{\text{reg}}(\theta)\ne0$ since we use the Ridge penalty $L_{\text{reg}}(\theta)$ in the counterfactual search just like we do in joint-energy training. This detail was omitted from the body of the paper for the sake of simplicity. + +Aside from the additional penalties in Equation~\ref{eq:eccco-app}, the only key difference between our counterfactual search objective and the joint-energy training objective is the parameter that is being optimized. In joint-energy training we optimize the objective with respect to the network weights $\theta$: + +\begin{equation}\label{eq:jem-grad} + \begin{aligned} + \nabla_{\theta} L_{\text{JEM}}(\theta) &=\nabla_{\theta} L_{\text{clf}}(\theta) + \nabla_{\theta}L_{\text{gen}}(\theta) + \lambda \nabla_{\theta} L_{\text{reg}}(\theta) + \end{aligned} +\end{equation} + +we take these parameters as fixed in our counterfactual search. + + -In this section, we briefly discuss convergence conditions for CE and provide details concerning the actual implementation of our framework in \texttt{Julia}. \subsubsection{A Note on Convergence} Convergence is not typically discussed much in the context of CE, even though it has important implications on outcomes. One intuitive way to specify convergence is in terms of threshold probabilities: once the predicted probability $p(\mathbf{y}^+|\mathbf{x}^{\prime})$ exceeds some user-defined threshold $\gamma$ such that the counterfactual is valid, we could consider the search to have converged. In the binary case, for example, convergence could be defined as $p(\mathbf{y}^+|\mathbf{x}^{\prime})>0.5$ in this sense. Note, however, how this can be expected to yield counterfactuals in the proximity of the decision boundary, a region characterized by high aleatoric uncertainty. In other words, counterfactuals generated in this way would generally not be plausible. To avoid this from happening, we specify convergence in terms of gradients approaching zero for all our experiments and all of our generators. This is allows us to get a cleaner read on how the different counterfactual search objectives affect counterfactual outcomes. diff --git a/paper/neurips/paper.pdf b/paper/neurips/paper.pdf index f5a0bfab487c92aab43d7b7e5a66706a18be922b..dfee8692887d231ea444e982fadc950d6882cc30 100644 Binary files a/paper/neurips/paper.pdf and b/paper/neurips/paper.pdf differ