-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cKDTree optimization #62
cKDTree optimization #62
Conversation
6af812c
to
9a9466c
Compare
skgstat/Kriging.py
Outdated
@@ -389,16 +395,23 @@ def _krige(self, idx): | |||
dists = self.transform_dists[idx,:] | |||
|
|||
# find all points within the search distance | |||
idx = np.where(dists <= self.range)[0] | |||
if isinstance(dists, scipy.sparse.spmatrix): | |||
idx = np.array([k[1] for k in dists.keys()]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For non-sparse datasets this might actually be a performance bottleneck, and we should use toarray() or somesuch solution to speed it up.
Heyas @mmaelicke ! Would you have time to review this? Note: There is a new flag "sparse" added. If set to false, all distances are calculated using pdist, which is the fastest way, at least for semi-small datasets, but quickly eats a lot of ram (N*M, where N is number of points kriged to and M is number of points kriged from). If set to true, distances are calculated using ckDTree only for points withing range, and stored in a sparse matrix. This takes considerably less storage, but it slightly slower for lookup, so for smaller datasets this is a disadvantage. For larger datasets this saves your machine from swapping, which would quickly lower your performance. |
@redhog , awesome, thank a lot! |
Maybe the default for sparse needs to be False... Hm... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, thanks a lot!
I think, we need to sort the default value for sparse
out, then I'm completely fine with it.
selected_dists = dists[0, idx].toarray()[0,:] | ||
else: | ||
selected_dists = dists[idx] | ||
sorted_idx = np.argsort(selected_dists, kind="stable") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just out of curiosity: why stable sort here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because without that I couldn't do a regression test that gave the same result with sparse/non-sparse (when point coords coincided exactly for some pair of points, which some did in my test dataset)..
This is superseded by #68 which implements the same feature, but in a separate class that can also be used by Variogram. |
Only merge the other one. Sorry for coding faster than I talked to you... |
I had the other one in parallel, but didn't want to push it due to a bug that I have now squished... |
NP. I will just wait until you want me to review or merge something. Or both. :) |
Closes #58