Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CARD Regression Toy Example Notebook #18

Open
nilsleh opened this issue Aug 16, 2023 · 14 comments
Open

CARD Regression Toy Example Notebook #18

nilsleh opened this issue Aug 16, 2023 · 14 comments

Comments

@nilsleh
Copy link

nilsleh commented Aug 16, 2023

Thank you for the interesting Paper and publishing your code. I am trying to create a small notebook that demonstrates the CARD training and evaluation procedure on a toy dataset in order to better understand the mechanics and details of the CARD method. I am aware that you have runable scripts for some toy examples, however, they all expect config files and the large regression/main.py script which I found a bit difficult to follow since it is so large with lots of additional functionality. Therefore, I was hoping to create a small notebook that demonstrates the method on a small toy example and where I could add comments and descriptions to better understand the method. Here is a google colab notebook in which I tried to extract the code pieces into a small reproducible example. The central question I have is how I would properly extract a form of predictive uncertainty (with or without a Normal distribution assumption like used for NLL) and show it for the toy example to compare its qualities to other methods. Thanks in advance. If you think such a notebook is useful, I would also be happy to contribute it in a PR.

@XzwHan
Copy link
Owner

XzwHan commented Sep 2, 2023

Hi @nilsleh, thank you for checking out our paper and repo. Our work focuses solely on modeling (recovering) the aleatoric uncertainty of the true $p(\boldsymbol{y}\ |\ \boldsymbol{x})$ distribution, and we proposed an empirical metric, Quantile Interval Coverage Error (QICE), to evaluate how well the learned conditional distribution matches the true data. You may check out Section 4.1.1 of the paper for the details of this metric, and the function compute_true_coverage_by_gen_QI here for our implementation. This metric doesn't assume a Gaussian distribution for $p(\boldsymbol{y}\ |\ \boldsymbol{x})$, and works well empirically for distributions with multimodality (see Figure 1 and Table 22).

We have written scripts that are pre-configured for the toy examples. For instance, you may run

bash training_scripts/run_toy_sinusoidal_regression_mdn.sh

after cd into the regression directory for the sinusoidal regression task.

@nilsleh
Copy link
Author

nilsleh commented Sep 4, 2023

Hi @XzwHan , thank you for your reply! I understood the benefit of your approach and like your proposed metric. I have run the preconfigured scripts for the toy examples but have the following question.

In Table 2 of your paper you report the NLL on the UCI regression tasks and to compute the NLL of a single data point one requires a predictive uncertainty. It appears in the code that in order to do the NLL computation, the predictive variance is the variance computed over all samples. However, I am not entirely clear what timestep t you use to then report the NLL.

Or maybe more general, given a new data point, what is the corresponding uncertainty that CARDS would compute for

  1. the case where you assume a Gaussian Distribution like the NLL Table 2
  2. the case where you do not make that assumption

@XzwHan
Copy link
Owner

XzwHan commented Sep 9, 2023

We would compute NLL at all timesteps (so that we can make plots like Figure 8 to see the change of NLL during the reverse diffusion process); the reported NLL in Table 2 is at $t=0$. You may check out the implementation of store_nll_at_step_t in the same file, and how it is used.

For one new data point $(\boldsymbol{x}', y')$, QICE would return the information as to which quantile interval $y'$ falls in. QICE is an empirical distribution matching metric: it's designed to check how well the empirical samples from two distributions match. Meanwhile, only one sample from the true data distribution cannot reflect this distribution very well. Similar statements can be made with NLL.

In other words, both NLL and QICE are metrics for distribution matching, instead of "uncertainty". Aleatoric uncertainty is part of the data-generating true distribution, thus when we observe the learned distribution matches well with the true distribution, we can say that the learned distribution has captured the aleatoric uncertainty of the true distribution.

@nilsleh
Copy link
Author

nilsleh commented Sep 24, 2023

@XzwHan Thank you again for your reply. I think what I am trying to ask is given the scenario that you have fitted your CARD model on the regression training data and you are supposed to make a prediction for a new data point $(x', y')$. How would you answer if someone asked "what is your uncertainty about the prediction $\hat{y}$ that your model generated for this test point"?

In section 4.2.1 of the paper you talk about Instance Level Confidence in the classification setting to which you previously state:

"We intend to provide an alternative sense of uncertainty, by introducing the idea of model
confidence at the instance level, i.e., how sure the model is about each of its predictions,
through the stochasticity of outputs from a generative model."

Thus if you interpret predictive uncertainty as "confidence" and are given a single regression test instance $(x', y')$, what would your models "confidence" in the regression setting be and how would you quantify it before moving to any metric on how to evaluate and compare predictions?

From the toy examples in Figure 1 of your paper, it is quiet powerful to see what a variety of distributions CARD can recover. It is clear that for instance the "Full Circle" experiment could not be modeled by a standard BNN with gaussian assumption and here, it also wouldn't make sense to talk about a predictive uncertainty as a variance or std since that would just be trying to match a Gaussian and yield unreasonable results. However, in cases like the "Log-Log-Linear" example one could train a heteroscedastic model and get uncertainty bands at an instance level for each new data point $(x', y')$ and define a confidence level of the model prediction in that way. In such a case would you consider taking the variance of the CARD samples at that new data point (and making a gaussian assumption thereby) as a way to express confidence for that prediction?

@wrkhard
Copy link

wrkhard commented Oct 24, 2023

@nilsleh @XzwHan Might one also just 'ensemble' CARD models to get at a full predictive uncertainty? Providing uncertainty bands would be quite useful for downstream applications.

@nilsleh
Copy link
Author

nilsleh commented Nov 6, 2023

@wrkhard I suppose if you use an Ensemble, you also have to make some distributional assumption to define your predictive distribution. For example, Deep Ensembles make an approximate assumption of a Gaussian Mixture Model over the predictions from the individual ensemble members. So one could make a similar assumption with the samples from the CARD model to get uncertainty bands. This assumption is also done in other common approaches like MC-Dropout.

I suppose the other important detail one could get into is aleatoric vs epistemic uncertainty, where Ensembles and CARD vary, so I think there are more things to consider.

@XzwHan
Copy link
Owner

XzwHan commented Nov 6, 2023

Hi @nilsleh and @wrkhard, thank you very much for your questions and comments.

In our work, the parameterization of the diffusion model is a deterministic function (the forward noise prediction network $\boldsymbol{\epsilon_{\theta}}$), thus we are not modeling epistemic uncertainty: for the log-log linear regression and log-log cubic regression toy examples in CARD, we are solely modeling aleatoric uncertainty, since heteroscedasticity is still part of the properties of the true distribution $p(\boldsymbol{y}\ |\ \boldsymbol{x})$.

Meanwhile, to answer the question "what is your uncertainty about the prediction that your model generated for this test point", we would need epistemic uncertainty, which for regression tasks would be quite useful under an out-of-distribution setting, or in general new data coming from regions where training data is sparse: as @wrkhard mentioned, you could train an ensemble of models, and see whether or not different models would give drastically different predictions. A good reference is this review paper about aleatoric vs. epistemic uncertainty: their figures illustrate the concepts well. We also briefly discussed about the differences between these two types of uncertainty in our paper: please check Section A.2.1 and A.2.3.

For our construction of instance-level model confidence for classification, we are hesitant to frame it with the current taxonomy of uncertainty: conceptually it is very similar to epistemic uncertainty, but we only train one deterministic model, instead of training multiple "hypotheses" as what's usually done to obtain epistemic uncertainty. Therefore, we just worded it as "an alternative way of measuring model confidence".

Hope this helps to clarify some parts in our paper!

@wrkhard
Copy link

wrkhard commented Dec 3, 2023

Hi @XzwHan thank you for the response.

@nilsleh Were you able to make progress on the notebook? @XzwHan I agree with nilsleh that a simple to use notebook or clearer instruction in the README on how to use CARD Regression with an independent dataset would be quite valuable.

@nilsleh
Copy link
Author

nilsleh commented Dec 5, 2023

Hi @wrkhard ,

Actually yes. We began writing a UQ-library for pytorch with lightning with the aim of laying the ground work for an open-source effort for the implementation of a variety of UQ-methods that are accessible for practitioners - called Lightning-UQ-Box. We also tried to port over the CARD implementation and have a notebook on our documentation page which can be run in google colab (little rocket icon on the top) and where we try to recreate the result for the Toy Donut dataset from Figure 1. Let me know if this is helpful for you, we are in the early stages of development so appreciate any feedback :)

@XzwHan I tried to distill the scripts card_regression.py and card_classification.py into a lightning module that is a "minimal" implementation without all the additional functionality you needed to create the figures and experiments for your paper. Nevertheless, since those were long nested files, I am not sure whether I implemented it correctly, so if you have any feedback or find bugs, we would really appreciate it. The main goal is to make your CARD method more accessible to a wider audience (our background is Earth Observation data for example), so I hope you find this effort useful. If you have any comments/feedback etc, feel free to reach out, and thanks again for your interesting work :)

EDIT: updated link to notebook

@wrkhard
Copy link

wrkhard commented Dec 10, 2023

Hi @nilsleh

Terrific! I'll be sure to pass along your framework. I too am in Earth based remote sensing and these techniques are quite important but not always immediately available for use.

@XzwHan thank you again for your wonderful work with CARD!

@XzwHan
Copy link
Owner

XzwHan commented Dec 10, 2023

Hi @wrkhard @nilsleh, thank you for your feedback and suggestions!

Would you guys be willing to elaborate a bit more about the particular tasks with earth based data you are working on, such that CARD, or uncertainty-based methods in general, could potentially be helpful (e.g., are they regression/classification, which metrics you would check, what functions in our code could be most useful)?

Making the code more modular is in our plans as one next step to improve CARD, and we are actively searching for suitable applications for our method, thus your perspectives and use cases would be greatly appreciated for us in the next phase of development.

@wrkhard
Copy link

wrkhard commented Dec 10, 2023

Hi @XzwHan

Many tasks in Earth based remote sensing and modeling are inverse problems, and there is interest in fast methods that can capture complex posteriors. Even for non-inverse problems, robust UQ is often a requirement in any operational science products. If there was a way to include an easy to use UQ with CARD such that one may abstain when uncertainty is high - that would be a great feature I would think.

Most problems in my work are primarily regression and I am often interested in finding a robust prediction interval.

@wrkhard
Copy link

wrkhard commented Dec 10, 2023

@XzwHan

Could CARD be used in a similar fashion to this work: https://neurips.cc/virtual/2022/event/56948

There seem to be many connections between Flows and Diffusion.

@XzwHan
Copy link
Owner

XzwHan commented Dec 11, 2023

Hi @wrkhard, we introduced the mechanism to obtain instance-level confidence under the context of classification (Section 4.2.1 of our paper), which might be most closely related to the goal of "abstain when uncertainty is high". For regression tasks, CARD aims for the "fidelity" of the true underlying distribution, where we don't assume the presence of outliers or noisy data; meanwhile, conformal prediction might be helpful for the tasks you described as well.

For inverse problems, diffusion models could be a reasonable modeling choice due to its ability to learn multimodal distributions, but might be restricted by its sampling speed — this would no longer be an issue if the recent developments in diffusion distillation could be applied under this setting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants