now some progress on fmnist as well

e2b25d2d · Pat Alt · caf8e920 · e2b25d2d · e2b25d2d · e2b25d2d
Commit e2b25d2d authored 1 year ago by Pat Alt
--- a/dev/rebuttal/6zGr.md
+++ b/dev/rebuttal/6zGr.md
@@ -4,12 +4,7 @@ Thank you! In this individual response, we will refer back to the main points di

 > "The major weakness of this work is that plausibility for non-JEM-based classifiers is very low on 'real-world' datasets (Table 2)."

-As we argue in **Point 3** (and to some extent also **Point 2**) of the global rebuttal, we believe that this should not be seen as a weakness at all:
-
- Conditional on high fidelity, plausibility hinges on the quality of the underlying model. 
- Subpar outcomes can therefore be understood as a signal that the model needs to be improved. 
-
-As noted in the global rebuttal, we aim to make this intuition even clearer in the paper. 
+Please refer to **Point 3** (and to some extent also **Point 2**) of the global rebuttal. 

 #### Visual quality (MNIST)

@@ -31,7 +26,7 @@ We will discuss this more thoroughly in the paper.
  - Concerning plausibility, larger perturbations are typically necessary to move counterfactuals not simply across the decision boundary, but into dense areas of the target domain. Thus, REVISE, for example, is also often associated with larger perturbations.
 - This tradeoff can be governed through penalty strengths: if closeness is a high priority, simply increase the relative size of $\lambda_1$ in Equation (5).

-We are happy to highlight this tradeoff in section 7. 
+We will highlight this tradeoff in section 7. 

 #### Datasets

@@ -52,7 +47,7 @@ Please refer to **Point 4** in the global rebuttal.
 - This is true and we are transparent about this in the paper (line 320 to 322). 
 - ECCCo is intentionally biased towards faithfulness in the same way that Wachter is intentionally biased towards minimal perturbations. 

-We are happy to make this point more explicit in section 7. 
+We will make this point more explicit in section 7. 

 #### Other questions

@@ -61,7 +56,7 @@ Finally, let us try to answer the specific questions that were raised:
 - In line 178 we (belatedly) mention that the L1 Norm is our default choice for dist$(\cdot)$. We realise now that it's not obvious that this also applies to Equations 3 and 4 and will fix that. Note that we also experimented with other distance/similarity metrics, but found the differences in outcomes to be small enough to consistently rely on L1 for its sparsity-inducing properties. 
 - $f$ by default just rescales the input data. GMSC data is standardized and MNIST images are rescaled to $[-1,1]$ (mentioned in Appendix D, lines 572-576, but maybe this indeed belongs in the body). $f^{-1}$ is simply the inverse transformation. Synthetic data is not rescaled. We still explicitly mention $f$ here to stay consistent with the generalised notation in Equation (1). For example, $f$/$f^{-1}$ could just as well be a compression/decompression or an encoder/decoder as in REVISE.
 - In all of our experiments we set $\alpha=0.05$ (90\% target coverage) and $\kappa=1$ to avoid penalising sets of size one. We should add this to Appendix D, thanks for flagging. Note that we did experiment with these parameter choices, but as we point out in the paper, more work is needed to better understand the role of Conformal Prediction in this context. 
- I have just run ECCCo and Wachter for a single MNIST digit on my machine (no GPU) using default parameters from the experiment:
+- We have just run ECCCo and Wachter for a single MNIST digit on my machine (no GPU) using default parameters from the experiment:
    - ECCCo: `4.065607 seconds (4.34 M allocations: 1.011 GiB, 7.62\% gc time)`. 
    - Wachter: `1.899047 seconds (2.16 M allocations: 343.889 MiB, 4.59\% gc time, 74.80\% compilation time)`.
  

--- a/dev/rebuttal/ZaU8.md
+++ b/dev/rebuttal/ZaU8.md
@@ -14,4 +14,4 @@ Regarding the specific question/suggestion raised by the reviewer, we do actuall

 > "Need for gradient access, e.g. through autodiff, for black-box model under investigation."

-This is indeed a limitation of our approach, although it is worth pointing out that many of the existing state-of-the-art approaches to CE rely on gradient access. We do have a paragraph on this in section 7, but are happy to expand on this to the extent possible. 
\ No newline at end of file
+This is indeed a limitation of our approach, although it is worth pointing out that many of the existing state-of-the-art approaches to CE rely on gradient access, which we briefly touch upon in Section 7.
\ No newline at end of file
--- a/dev/rebuttal/global.md
+++ b/dev/rebuttal/global.md
-We would like to thank all of the reviewers for their detailed and thoughtful reviews &mdash; your feedback is truly much appreciated. 
+We would like to thank all of the reviewers for their detailed and thoughtful reviews &mdash; your feedback is truly much appreciated. Below, we will respond to points that have been raised by at least two reviewers. Individual responses to each reviewer contain additional points. 

-Based on the reviewers' helpful suggestions, we plan to extend section 7 to deepen the interpretation of the results presented in this work as well as its limitations. Below, we will respond to points that have been raised by at least two reviewers. Individual responses to each reviewer contain additional points. 
-
-### Point 1 (Real-world data)
-
-*Summary:*
-
-> Some reviewers have noted that "experiments with real-world data a bit limited" and "only conducted on small-scale datasets".
-
-*Response*:
+### Point 1: Add more datasets

 We agree that further work could benefit from including additional datasets and will make this point clear in section 7. That being said, we have relied on datasets commonly used in similar studies. Due to the size and scope of this work, we have decided to focus on conveying our motivation, methodology and conclusions through illustrative datasets. 

-### Point 2 (Models)
-
-*Summary:*
-
-> Some reviewers have noted that "focus of the models being tested seems narrow". The work could benefit from including additional models like "MLPs, CNNs, or transformer".
-
-*Response*:
+### Point 2: Add more models

 We agree that further work could benefit from including additional models and will make this point clear in section 7. In line with similar studies, we have chosen simple neural network architectures as our starting point. Moving on from there, our goal has been to understand if we can improve these simple models through joint-energy training, in order to yield more plausible counterfactuals that faithfully convey the improved quality of the underlying model. 

 To this end, we think that our experiments provide sufficient evidence. The size and scope of this work ultimately led us to prioritise this main point. To get this point across we focused on JEMs, because they are known to have properties that are naturally aligned with the idea of plausible counterfactuals. The question about which other kinds of models yield plausible and faithful counterfactuals (e.g. "MLPs, CNNs, or transformer" but also Bayesian NNs, adversarially trained NNs) is interesting in itself, but something we have delegated to future studies. We will be more clear about this in section 7. 

-Nonetheless, to immediately address the reviewers' concerns here, we provide additional qualitative examples for MNIST in the companion PDF. These also include a larger deep ensemble ($n=50$) and a simple CCN (LeNet-5), both of which tend to yield more plausible and less noisy counterfactual images than a simple MLP. For comparison, we have also added the corresponding counterfactuals generated by Wachter. In the context of the large ensemble, improved plausibility appears to be driven by better predictive uncertainty quantification. LeNet-5 seems to benefit to some extent from its network architecture that is more appropriate for image data. Wachter fails to uncover any of this. A more detailed study of different models would indeed be very interesting and we believe that ECCCo facilitates such work.
+Nonetheless, to immediately address the reviewers' concerns here, we provide additional qualitative examples for MNIST in the companion PDF. These also include a larger deep ensemble and a simple CCN (LeNet-5), both of which tend to yield more plausible and less noisy counterfactual images than a simple MLP. For comparison, we have also added the corresponding counterfactuals generated by Wachter. In the context of the large ensemble, improved plausibility appears to be driven by better predictive uncertainty quantification. LeNet-5 seems to benefit to some extent from its network architecture that is more appropriate for image data. Wachter fails to uncover any of this. A more detailed study of different models would indeed be very interesting and we believe that ECCCo facilitates such work.

-### Point 3 (Plausibility and Applicability)
-
-*Summary:*
-
-> Some reviewers have expressed concern around whether "ECCCo generates plausible counterfactuals beyond synthetic datasets for non-JEM-based classifiers" and asked for qualitative examples for non-JEM-based coutnerfactuals. Failure to produce plausible counterfactuals "could significantly limit ECCCos’ applicability and utility for researchers as well as practitioners alike".
-
-*Response*:
+### Point 3: Are ECCCos plausible enough in practice?

 We agree that additional qualitative examples for MNIST can help to demonstrate that ECCCo does indeed uncover plausible patterns learned by non-JEM-based classifiers. Based on the reviewers' suggestions we will therefore move the qualitative examples provided in the companion PDF into the supplementary material. Will respect to our other real-world dataset, the results in Table 2 indicate that ECCCo consistently achieves substantially higher plausibility than Wachter. 

 It is important to note here, that ECCCo aims to generate faithful counterfactuals first and foremost. Plausibility is achieved only to the extent that the underlying model learns plausible explanations for the data. Thus, we disagree that failure to produce plausible counterfactuals would limit ECCCo's usefulness in practice. We argue that this should not be seen as a weakness, but rather as a strength of ECCCo, for the following reasons:

 - For practitioners/researchers it is valuable information indicating that despite good predictive performance, the learned posterior density $p_{\theta}(\mathbf{x}|\mathbf{y^{+}})$ is high in regions of the input domain that are implausible (in the sense of Def 2.1, i.e. the corresponding true density $p(\mathbf{x}|\mathbf{y^{+}})$ is low in those same regions).
- Instead of using surrogate-aided counterfactual search engines to sample those counterfactuals from $p_{\theta}(\mathbf{x}|\mathbf{y^{+}})$ that are indeed plausible, we would are that the next point of action in such cases should generally be to improve the model.
+- Instead of using surrogate-aided counterfactual search engines to sample those counterfactuals from $p_{\theta}(\mathbf{x}|\mathbf{y^{+}})$ that are indeed plausible, we would argue that the next point of action in such cases should generally be to improve the model.
 - We agree that this places an additional burden on researchers/practitioners, but that does not render ECCCo impractical. In situations where providing actionable recourse is an absolute priority, practitioners can always resort to REVISE and related tools in the short term. Major discrepancies between ECCCo and surrogate-aided tools should then at the very least signal to researchers/practitioners, that the underlying model needs to be improved in the medium term.

 Based on the reviewers' observations in this context, we will clarify this tension between faithfulness and plausibility further by sharpening the relevant paragraphs in our paper.

-### Point 4 (Ablation studies)
-
-*Summary:*
-
-> Some reviewers have pointed at the need for additional ablation studies to assess "if conformal prediction is actually required for ECCCos".
-
-*Response:*
+### Point 4: Add ablation studies

-We already do this to some extent: the experiments involving our synhtetic datasets are set up to explicitly address this question. We point out that Conformal Prediction appears to play less of a role than the energy-based constraint (lines 278 to 281). We also note in section 7 that further future work is needed to understand the role of CP better (lines 330 to 332). We are happy to expand on this to the extent possible: one possible explanation for the limited impact of CP could be that CP relies on exchangeability. In other words, the smooth set size penalty may not be as effective as intended when we move out of domain during counterfactual search, because it fails to adequately address epistemic uncertainty. Due to the limited size and scope of this work, we have reserved these types of questions for future work.
+We already do this to some extent: the experiments involving our synthetic datasets are set up to explicitly address this question where we study *ECCCo (no CP)* and *ECCCo (no EBM)*, respectively. We find that dropping these components generally leads to worse results, but also point out that Conformal Prediction (CP) appears to play less of a role than Energy-Based Modelling (EBM) (lines 278 to 281). We also note in section 7 that further future work is needed to understand the role of CP better (lines 330 to 332). We will expand on this to the extent possible: one possible explanation for the limited impact of CP could be that CP relies on exchangeability. In other words, the smooth set size penalty may not be as effective as intended when we move out of domain during counterfactual search, because it fails to adequately address epistemic uncertainty. Due to the limited size and scope of this work, we have reserved these types of questions for future work.
--- a/dev/rebuttal/pekM.md
+++ b/dev/rebuttal/pekM.md
@@ -4,27 +4,25 @@ Thank you! In this individual response, we will refer back to the main points di

 > "Some notions are lacking descriptions and explanations"

-We will make a full pass over all notation, and improve where needed. In the meantime, we want to clarify here:
-
- We state in Definition 4.1 that $p_{\theta}(\mathbf{x}|\mathbf{y^{+}})$ "denote[s] the conditional distribution of $\mathbf{x}$ in the target class  $\mathbf{y}^{+}$, where $\theta$ denotes the parameters of model $M_{\theta}$" following Grathwohl (2020). In the following sentence (line 137) of the paper we state that this can be understood intuitively as "what the model has learned about the data". 
- Both $\varepsilon(\cdot)$ and $\hat{\mathbf{X}}_{\theta,y^{+}}^{n_E}$ are explained in line 146 and lines 168-169, respectively. Additional detail can also be found in the Appendix. To the extent possible, we will extend these explanations. 
-
+We will make a full pass over all notation, and improve where needed. 
 #### Conditional distribution

 > "[...] the class-condition distribution $p(\mathbf{x}|\mathbf{y^{+}})$ is existed but unknown and learning this distribution is very challenging especially for structural data"

-We disagree with the statement that this should be seen as a weakness of our paper:
+We do not see this as a weakness of our paper. Instead:

 - While we agree that learning this distribution is not always trivial, we note that this task is at the very core of Generative Modelling and AI&mdash;a field that has recently enjoyed success, especially in the context of large unstructured data like images and language.
 - Learning the generative task is also at the core of related approaches mentioned in the paper like REVISE: as we mention in line 89, the authors of REVISE "propose using a generative model such as a Variational Autoencoder (VAE)" to learn $p(\mathbf{x})$. We also point to other related approaches towards plausibility that all centre around learning the data-generating process of the inputs $X$ (lines 85 to 104).
 - Learning $p(\mathbf{x}|\mathbf{y^{+}})$ should generally be easier than learning the unconditional distribution $p(\mathbf{x})$, because the information contained in labels can be leveraged in the latter case. 

+We will revisit section 2 to clarify this.
+
 #### Implausibility metric

 > "Additionally, the implausibility metric seems not general and rigorous [...]"

- We agree it is not perfect and speak to this in the paper (e.g. lines 297 to 299). But we think that it is an improved, more robust version of the metric that was previously proposed and used in the literature (lines 159 to 166). Nonetheless, we are happy to make this limitation clearer also in section 7.
- The rule-based unary constraint metric proposed in Vo et al. (2023) looks interesting, but the paper will be presented for the first time at KDD in August 2023 and we were not aware of it at the time of writing. Thanks for bringing it to our attention, we are happy to mention it in the same context in section 7. 
+- We agree it is not perfect and speak to this in the paper (e.g. lines 297 to 299). But we think that it is an improved, more robust version of the metric that was previously proposed and used in the literature (lines 159 to 166). Nonetheless, we will make this limitation clearer also in section 7.
+- The rule-based unary constraint metric proposed in Vo et al. (2023) looks interesting, but the paper will be presented for the first time at KDD in August 2023 and we were not aware of it at the time of writing. Thanks for bringing it to our attention, we will mention it in the same context in section 7. 

 #### Definiton of "faithfulness"

@@ -36,7 +34,7 @@ We wish to highlight a possible reviewer misunderstanding with regard to a funda
 - Specifically, we want to understand if counterfactuals are consistent with what the model has learned about the data, which is best expressed as $p_{\theta}(\mathbf{x}|\mathbf{y^{+}})$ (Def. 4.2).
 - The role of SGLD in this context is described in some detail in Section 4.1 (lines 138 to 155) and additional explanations are provided in Appendix A.

-We will try to clarify this in the paper as much as possible. 
+We will revisit sections 3 and 4 of the paper to better explain this.

 #### Conformal Prediction (CP) 

@@ -54,8 +52,8 @@ We reiterate our motivation here:
 - Since CP is model-agnostic, we propose relying on it to relax restrictions that were previously placed on the class of classifiers (lines 183 to 189).
 - CP does indeed produce prediction sets in the context of classification. That is why we work with a smooth version of the set size that is compatible with gradient-based counterfactual search, as we explain in some detail in lines 194 to 205 and also in Appendix B. 

-#### Experiments
+Following the suggestion from reviewer 6zGr we will smoothen the introduction Conformal Prediction and better motivate it beforehand.

-> "The experiments are humble and not really solid to me. [...] the authors need to conduct ablation studies regarding the involving terms in (5)."
+#### Experiments

-We think that experiments of this scale are common in the related literature. Please refer to the global rebuttal for more details. Concerning **ablation studies**, please refer to **Point 4** in the global rebuttal.
\ No newline at end of file
+Please see **Points 1** and **4** of our global rebuttal
\ No newline at end of file
--- a/dev/rebuttal/uCjw.md
+++ b/dev/rebuttal/uCjw.md
 Thank you! In this individual response, we will refer back to the main points discussed in the global response where relevant and discuss any other specific points the reviewer has raised. Below we will go through individual points where quotations trace back to reviewer remarks.

-#### Data and models
+#### Q1 and Q3: Data and models

-> "[...] I still find the experiments with real-world data a bit limited. [...] The focus of the models being tested seems narrow."
+Please refer to **Point 1** and **Point 2** in the global response, respectively. 

-Firstly, concerning the limited set of models and real-world datasets (Question 1 and Question 3), please refer to **Point 1** and **Point 2** in the global response, respectively. 
-
-#### Generalisability
+#### Q2: Generalisability

 > "Is the ECCCos approach adaptable to a broad range of black-box models beyond those discussed?"

 Our approach should generalise to any classifier that is differentiable with respect to inputs, consistent with other gradient-based counterfactual generators (Equation 1). Our actual implementation is currently compatible with neural networks trained in Julia and has experimental support for `torch` trained in either Python or R. Even though it is possible to generate counterfactuals for non-differentiable models, it is not immediately obvious to us how SGLD can be applied in this context. An interesting question for future research would be if other scalable and gradient-free methods can be used to sample from the conditional distribution learned by the model. 

-#### Link to causality
+#### Q4: Link to causality

 > "There’s a broad literature on causal abstractions and causal model explanations that seems related."

-This is an interesting thought. We would have to think about this more, but there is a possible link to the work by Karimi et al. on counterfactuals through interventions as opposed to perturbations (references in the paper). An idea could be to use the abstracted causal graph as our sampler for ECCCo (instead of SGLD). Combining the approach proposed by Karimi et al. with ideas underlying ECCCo, one could then generate counterfactuals that faithfully describe the causal graph learned by the model, instead of generating counterfactuals that comply with prior causal knowledge. We think this may go beyond the scope of our paper but would be happy to add this to section 7.
\ No newline at end of file
+We agree that there is a connection with causal abstractions and causal model explanations. The two papers by Karimi et al. we are citing in the paper point in this direction, too, addressing counterfactuals through interventions as opposed to perturbations. An area of future research could be to use the abstracted causal graph as our sampler for ECCCo (instead of SGLD). Combining the approach proposed by Karimi et al. with ideas underlying ECCCo, one could then generate counterfactuals that faithfully describe the causal graph learned by the model, instead of generating counterfactuals that comply with prior causal knowledge. We will extend section 7 with a short discussion of this connection.
\ No newline at end of file
--- a/dev/rebuttal/www/mnist_2to3_16.png
+++ b/dev/rebuttal/www/mnist_2to3_16.png
--- a/dev/rebuttal/www/mnist_2to3_17.png
+++ b/dev/rebuttal/www/mnist_2to3_17.png
--- a/dev/rebuttal/www/mnist_2to3_18.png
+++ b/dev/rebuttal/www/mnist_2to3_18.png
--- a/dev/rebuttal/www/mnist_2to3_19.png
+++ b/dev/rebuttal/www/mnist_2to3_19.png
--- a/dev/rebuttal/www/mnist_2to3_20.png
+++ b/dev/rebuttal/www/mnist_2to3_20.png
--- a/dev/rebuttal/www/mnist_4to1_11.png
+++ b/dev/rebuttal/www/mnist_4to1_11.png
--- a/dev/rebuttal/www/mnist_4to1_12.png
+++ b/dev/rebuttal/www/mnist_4to1_12.png
--- a/dev/rebuttal/www/mnist_4to1_13.png
+++ b/dev/rebuttal/www/mnist_4to1_13.png
--- a/dev/rebuttal/www/mnist_4to1_14.png
+++ b/dev/rebuttal/www/mnist_4to1_14.png
--- a/dev/rebuttal/www/mnist_4to1_15.png
+++ b/dev/rebuttal/www/mnist_4to1_15.png
--- a/dev/rebuttal/www/mnist_5to8_21.png
+++ b/dev/rebuttal/www/mnist_5to8_21.png
--- a/dev/rebuttal/www/mnist_5to8_22.png
+++ b/dev/rebuttal/www/mnist_5to8_22.png
--- a/dev/rebuttal/www/mnist_5to8_23.png
+++ b/dev/rebuttal/www/mnist_5to8_23.png
--- a/dev/rebuttal/www/mnist_5to8_24.png
+++ b/dev/rebuttal/www/mnist_5to8_24.png
--- a/dev/rebuttal/www/mnist_5to8_25.png
+++ b/dev/rebuttal/www/mnist_5to8_25.png