updated response

d91575ef · Pat Alt · 9c5a4db2 · d91575ef · d91575ef
Commit d91575ef authored 1 year ago by Pat Alt
--- a/paper/aaai/author_response.pdf
+++ b/paper/aaai/author_response.pdf
--- a/paper/aaai/author_response.qmd
+++ b/paper/aaai/author_response.qmd
@@ -15,7 +15,7 @@ A common concern across reviewers was limited evaluation on real-world datasets.

 ### A note on image datasets

-Related work on plausibility of counterfactuals has largely relied on small image datasets like *MNIST* [@dhurandhar2018explanations;@schut2021generating;@delaney2023counterfactual]. This may be due to the fact that generating counterfactuals for high-dimensional input data is computationally very challenging. An exception to this rule is the work on *REVISE* [@joshi2019realistic], which uses a larger image dataset. *REVISE* is suitable for this task, because it maps counterfactuals to a lower-dimensional latent space. Similarly, our proposed *ECCCo+* should also be applicable to high-dimensional input data. In our benchmarks, however, we include other generators that search directly in the input space. Since our benchmarks required us to generate a very large number of counterfactuals, it was not at this time feasible to include larger image datasets. That is despite our best efforts to optimize the code and parallelize the computations through multi-threading and multi-processing on a high-performance computing cluster. 
+Related work on plausibility of counterfactuals has largely relied on small image datasets like *MNIST* [@dhurandhar2018explanations;@schut2021generating]. This may be due to the fact that generating counterfactuals for high-dimensional input data is computationally very challenging. An exception to this rule is the work on *REVISE* [@joshi2019realistic], which uses a larger image dataset. *REVISE* is suitable for this task, because it maps counterfactuals to a lower-dimensional latent space. Similarly, our proposed *ECCCo+* should also be applicable to high-dimensional input data. In our benchmarks, however, we include other generators that search directly in the input space. Since our benchmarks required us to generate a very large number of counterfactuals, it was not at this time feasible to include larger image datasets. That is despite our best efforts to optimize the code and parallelize the computations through multi-threading and multi-processing on a high-performance computing cluster. 

 ## Constraining Energy Directly

@@ -43,7 +43,11 @@ Another common concern was that *ECCCo* primiarly achieved good results for JEMs

 ## Mathematical notation and concepts

-One reviewer took issue with our mathematical notation, a concern that was not shared by any of the other reviewers. Nonetheless, we have revisited the notation and hope that it is now more clear. That same reviewer also raised concern about our definitions of plausibility and faithfulness that rely on distributional properties. We have extensively argued our case during the rebuttal and pointed to a potential reviwer misunderstanding in this context. None of the other reviewers found any issue with our definitions and we have made no changes in this regard. We did, however, make a minor change with respect to the related evaluation metrics. We are now more careful about our choice of the distance function. In particular, we investigated various distance metrics for image data and decided to rely on structural dissimilarity. For all other data we use the L2 Norm, where we previously used the L1 Norm. This has no impact on the results, but there was no obvious reason to use the L1 Norm in the first place other than the fact that it is typically used to assess closeness. 
+One reviewer took issue with our mathematical notation, a concern that was not shared by any of the other reviewers. Nonetheless, we have revisited the notation and hope that it is now more clear. That same reviewer also raised concern about our definitions of plausibility and faithfulness that rely on distributional properties. In particular, the reviewer argued that "[...] the class-condition distribution $p(\mathbf{x}|\mathbf{y^{+}})$ is existed but unknown and learning this distribution is very challenging especially for structural data". We have extensively argued our case during the rebuttal and pointed to a potential reviewer misunderstanding in this context. In particular, we argued:
+
+> We do not see this as a weakness of our paper. While we agree that learning this distribution is not always trivial, we note that this task is at the very core of Generative Modelling and AI&mdash;a field that has recently enjoyed success, especially in the context of large unstructured data like images and language. Learning the generative task is also at the core of related approaches mentioned in the paper like REVISE: as we mention in line 89, the authors of REVISE "propose using a generative model such as a Variational Autoencoder (VAE)" to learn $p(\mathbf{x})$. We also point to other related approaches towards plausibility that all centre around learning the data-generating process of the inputs $X$ (lines 85 to 104). Learning $p(\mathbf{x}|\mathbf{y^{+}})$ should generally be easier than learning the unconditional distribution $p(\mathbf{x})$, because the information contained in labels can be leveraged in the latter case. 
+
+None of the other reviewers found any issue with our definitions and we have made no changes in this regard. We did, however, make a minor change with respect to the related evaluation metrics. We are now more careful about our choice of the distance function. In particular, we investigated various distance metrics for image data and decided to rely on structural dissimilarity. For all other data we use the L2 Norm, where we previously used the L1 Norm. This has no impact on the results, but there was no obvious reason to use the L1 Norm in the first place other than the fact that it is typically used to assess closeness. 

 ## Conformal prediction was introduced too suddenly

@@ -55,7 +59,4 @@ We have extended the limitations section to address reviewer concerns.

 ## Other improvements

-As discussed above, counterfactual explanations do not scale very well to high-dimensional input data. The NeurIPS feedback has motivated us to work on this issue by enabling intuitive support for multi-threading and multi-processing to our code. This has not only allowed us to include additional datasets but also to run extensive experiments to fine-tune hyperparameter choices. All of our code will be open-sourced as a package and we hope that it will be as useful to the community as was to us during our research.
-
-## References
-   
\ No newline at end of file
+As discussed above, counterfactual explanations do not scale very well to high-dimensional input data. The NeurIPS feedback has motivated us to work on this issue by enabling intuitive support for multi-threading and multi-processing to our code. This has not only allowed us to include additional datasets but also to run extensive experiments to fine-tune hyperparameter choices. All of our code will be open-sourced as a package and we hope that it will be as useful to the community as was to us during our research.
\ No newline at end of file