Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use KDTrees to support nearest neighbor queries/joins on MultiIndexes? #9365

Closed
shoyer opened this issue Jan 28, 2015 · 5 comments
Closed

Use KDTrees to support nearest neighbor queries/joins on MultiIndexes? #9365

shoyer opened this issue Jan 28, 2015 · 5 comments
Labels
Enhancement Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance

Comments

@shoyer
Copy link
Member

shoyer commented Jan 28, 2015

If we're willing to construct KDTrees when necessary, we can support efficient nearest neighbor queries even in multiple dimensions (i.e., on a MultiIndex) and even for unsorted indexes.

For an example of what you can do with this, take a look at this question I just answered on SO: http://stackoverflow.com/a/28186940/809705

See also: 1d nearest neighbor queries (#8845) and an implementation for sorted indexes (#9258).

@shoyer shoyer added Ideas Long-Term Enhancement Discussions Indexing Related to indexing on series/frames, not to indexes themselves labels Jan 28, 2015
@jreback
Copy link
Contributor

jreback commented Jan 28, 2015

I think a limited method might be in-scope for pandas (esp if we are using it for something else). How to prevent this from growing into a full-fledged (which needs support) for arbitrary distance functions (e.g. would this be different from scikit-learn impl?). Should we in effect use their impl for arbitrary things (e.g. import at run-time like we do scipy?) and only have an impl that is needed internally instead?

@shoyer
Copy link
Member Author

shoyer commented Jan 28, 2015

I agree, we definitely don't want to reimplement KDTrees in pandas. We should use scipy or scikit-learn as run-time dependencies for this feature -- both of them have suitable KDTree implementations. The scikit-learn version is slightly more flexible, though, with support for more distance metrics. See here and here for more background.

@jreback
Copy link
Contributor

jreback commented Jan 28, 2015

so just because scipy is already a partial dep (meaning we import in several differnt places) I would lean towards that a bit

@sturlamolden
Copy link

Adding more metrics to cKDTree should not be too difficult, but I have not had time to look at it.

@mroeschke
Copy link
Member

A more comprehensive issue regardinng this topic is in #38650 so closing in favor of that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

4 participants