Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

catalog-based m-test #196

Open
Tracked by #269
wsavran opened this issue Aug 23, 2022 · 3 comments
Open
Tracked by #269

catalog-based m-test #196

wsavran opened this issue Aug 23, 2022 · 3 comments

Comments

@wsavran
Copy link
Collaborator

wsavran commented Aug 23, 2022

the catalog-based m-test has issues when the ratio of forecasted events to observed events is large. @Serra314 can you upload the figures that you've made showing this behavior?

@Serra314
Copy link

We discovered that the difference in the number of events between the simulated catalogues composing a forecasts and the number of observations may cause problems to the M-test.

The main reason behind is that we are estimating the score probability distribution calculating the score between the union of the simulated catalogues and each simulated catalogue. This means that if the simulated catalogues have an average of N events we are estimating the score probability distribution for a sample of length N. If the number of observations is different from N then the score between the union of the simulated catalogues and the observed events is coming from a different distribution than the one we have estimated using the forecasts. This leads to value of γ that, instead of being uniformly distributed, are very concentrated around 1 or 0 (depending on the fact that we are overestimating or underestimating the number of events).

One way to solve this is to estimate the score probability distribution using samples from the union of simulated catalogues instead of the forecasts in a bootstrap fashion. Sampling from the union of the forecasts a number of events equals to the number of observed, and calculating the score between each bootstrap sample and the union of the simulated catalogues, yelds a sample of score values under the null hypothesis that the forecast and the observation comes from the same distribution. The γ values obtained in this way are correct and this can be applied to many different scores.

Below an example in which the simulated magnitudes and the observed ones both come from a GR law with b-value equals 1. I have used an observed sample with 100 events, while each simulated catalogues has a number of events which comes from a Poisson distribution with mean 1000. I have calculated the gamma values considering 1000 different observed samples against the same forecast. They correctly look uniform in all cases.

b1

Those are instead in cases where the forecasts come from a GR law with a different b-value, while the observations still come to a GR law with b-value equals 1. We can see how the distribution of %gamma; values departs from uniformity. The faster it departs from uniformity changing b-value the more sensitive the score is to incoherence in magnitude distribution.

Rplot

@lmizrahi
Copy link

lmizrahi commented Feb 8, 2024

I have made similar observations with the catalogue-based M-test, S-test, and PL-test. The way I think about it is that if the simulated catalogs over- or underestimate the number of events, their scores will, on average, be "more perfect" and "less perfect", respectively. (Perfect meaning being close to the overall distribution coming from all simulations.) So, I have seen two models that use the same spatial distribution pass the S-test in one case, but fail in another case, because they didn't forecast the same number of events. Same for M- and PL-test.

I think that this can even lead to an M-test being passed when it shouldn't pass, if the number of forecasted events is wrong in the right way.

@Serra314
Copy link

Serra314 commented Feb 8, 2024

Hi Leila,

Your observations are correct. It is exactly what happens and the problem is not limited to the M-test. I also believe it is possible to find the "correct" number of forecasted event to make an incorrect magnitude forecast to pass the test ( I have not tried though).

I am currently writing a paper on this showing the problem and proposing the resampling of the union forecast as possible solution. In practice, instead of using the synthetic catalogs as they are we create the fake synthetic catalogues by sampling the union forecast as many time as observations. This solves the problem at least when the magnitude distribution is the same across different synthetic catalogues (which I believe is the most common case). I'll share a link when we are close to submitting it.

@pabloitu pabloitu mentioned this issue Nov 25, 2024
22 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants