-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
option to downsample the data to calculate the variogram #29
Comments
@kvanlombeek Would this be a different functionality than the Edit: ok, so the goal is to use only a subset of data for the variogram, I missed the description in the issue title ) |
I am keen to understand the opinion of the kriging community on this topic. Now down sampling I would think will diminish the krige prediction confidence interval (variance will increase) locally. I can not immediately see the value as to why you would want to do this compared to the |
I believe they are two different things. I am talking about downsampling the data just to calculate the variogram, not to actualy Krige. Calculating the variogram is a coslty operation as you have to calculate all the pairwise distances. You don't loose a lot of information if you fit the variogram on only a 1000 points for example. (the fitting of the variogram implies only calculating three parameters, sill, nugget and range). The n_closest_points is usefull when you do the actual interpolation. What this option does is to limit the number of surrounding points that you take into account when you actually krige. The variogram is by that time already calculated. So by no means I have the intention to drop values, on the contrary, the updates I have made locally are to make sure the kriging algorithm works with a large dataset so that I can extract all local effects. (on www.rockestate.be you can see our kriged map) |
OK @kvanlombeek
Sure that will be fine. In this case then sampling (x/y/z) is equivalent to a stochastic approach in a large learning problem. Did I get that right? |
Yes, I see how this can be a problem. Down-sampling would certainly be useful, but I would worry that the (random) sample might not be representative of the entire dataset (assuming stationarity of the entire dataset). Maybe we could randomly sub-sample the data multiple times and each time estimate the variogram, then compare all resulting variograms in order to figure out the most representative model (kind of like the RANSAC algorithm). Maybe there's some approach from the machine learning literature that could be implemented here to accomplish this and to make the computations efficient. @basaks, seems like you're familiar with this kind of machine learning analysis, what do you think? |
@bsmurphy @kvanlombeek One can manually downsample outside PyKrige and then use PyKrige functionality. I use On your concerns on downsampling, one can bin the observations (both spatially and by magnitude) and then sample from each bin for a representative set for Kriging. I actually do very similar operations. I think this is very problem specific and should be left to the user. One can also go the other way in machine learning, i.e., create artificial observations when there are not enough observations. @kvanlombeek what do you think? I will be, however, very interested to know how you down sample your observations. |
I think we are discussing next to each other a bit, but your post from yesterday @basaks was exactly what I wanted to do. Can I ask to all of you on how much data you have already fitted this algorithm? 100, 1.000, 100.000 or 1.000.000 points? Not to show of or something, just to see if it worked. I fit the algorithm on around 50k to 200k datapoints, and then I needed these littles tweaks. I propose that I give it a shot, open a PR and continue to discuss based on that? |
hi @kvanlombeek I have one dataset with ~2.2million observations across all of Australia. For many machine learning models, I use ~50k, sometimes upto 200k observations. |
@kvanlombeek It's great that you are using PyKrige with large datasets and can help with the scaling issues. However, I would tend to agree with @basaks and @bsmurphy on this one; there is potentially a lot of ways this could be done, and it can also be problem specific. Can you not downsample the input X, Y, Z arrays used to train the variogram before providing it to OrdinaryKriging in the way you need? I'm +1 on adding some additional function for downsampling that would make sens in the Kriging context (if such a thing exists), and -1 on adding it inside the OrdinaryKriging classes which already have quite a bit of parameters IMO.. Unless I'm misunderstanding the issue.. |
Instead of downsampling, I think what is more useful, and seems like an already used technique in the kriging community, is the
|
@rth added the moving window function, which is similar to what you're suggesting @basaks except that it assumes a stationary variogram. Adding in the extra layer of re-estimating the variogram for local neighborhoods could certainly be done at some level, but it would require a calculation for each moving window location; therefore, not sure how much it would really boost speed... Anyways, both of these ideas would be nice to implement at some point: the local variogram estimation for max flexibility and the downsampling in global variogram estimation as a useful tool (could be put in the kriging tools module)... |
@basaks Feel free to open a new issue with the local variogram model.. |
Actually, looking into this anew, as I understand it the issue is that currently, the API does not allow to calculate the variogram on a random subset of the data, then run all the remaining calculations on the full dataset. I don't think this would be too difficult to implement... |
True, shouldn't be difficult at all to do something really simple like take a random subset (maybe a couple of times) and compute the variogram estimate. This could be a useful addition in refactoring everything into distinct building blocks as per #56 ... |
In GSTools we provide a sample size for variogram estimation. Maybe we could outsource the variogram estimation. |
When you krige with large datasets (100k datapoints for example), the number of pairwise distances blow up. It would be nice if there is a parameter in the function like max_n_for_variogram that limits the number of datapoints used.
I have personally already implemented this in the ordinary kriging function as our dataset was too large.
The text was updated successfully, but these errors were encountered: