Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Index.get_nearest method #8845

Closed
shoyer opened this issue Nov 18, 2014 · 9 comments · Fixed by #9258
Closed

API: Index.get_nearest method #8845

shoyer opened this issue Nov 18, 2014 · 9 comments · Fixed by #9258
Labels
API Design Datetime Datetime data dtype
Milestone

Comments

@shoyer
Copy link
Member

shoyer commented Nov 18, 2014

xref #3004
xref #841
xref #7873
xref #7223
xref #8815

Building on @immerrr's excellent refactor in #8753, I would like to propose adding a get_nearest method to pandas.Index that does nearest neighbor lookups. The idea is that nearest neighbor lookups are usually the desirable/sane thing to do when using inexact indexes. Eventually, we might want to add an alternative "wrapper index" (like IntervalIndex), e.g., NearestNeighborIndex which switches the default behavior; this would be an intermediate step in that direction.

The implementation would be a simple wrapper that calls Index.get_slice_bound twice, once to the left and once to the right. Ideally, this would this would even work for array-like arguments, though perhaps there should be separate get_nearest_loc and get_nearest_indexer methods.

@jreback
Copy link
Contributor

jreback commented Nov 18, 2014

I added a couple of cross-refs above. This should ideally encompass / reconcile with:

.asof and the concept of snapping to a the nearest.

So having a get_nearest_loc would then allow easy an almost trivial .asof as well
as .at_time and .nearest (really all just names for the same concept).

Only tricky part here is that in the time-domain you can have a simulatenous asfreq happen, e.g.
you often want to know the nearest 1s to something (but maybe that should be somewhat decoupled as its easy enough to simply round a DatetimeIndex to actually get it (and maybe to actually can implement get_nearest in a similar way).

@jorisvandenbossche
Copy link
Member

See also my recently opened issue on the scope of asof: #8815
and also related: #7223

@shoyer Some questions:

  • what is the difference with asof?
    • apart from that asof now only works for DatetimeIndex but the idea of it
    • and apart from that it could be a better name for asof .. (and more complete implementation)
  • would it only work for monotonic indices?
  • what would it exactly return? the lower or upper nearest, or the 'nearest' nearest? Or this could be a keyword argument to specify this?

@shoyer
Copy link
Member Author

shoyer commented Nov 18, 2014

Ah, I knew we had talked about this before. Somehow I forgot about asof (which, I agree, is a little strange).

@jreback Thanks for adding the references! I agree that this should be reconciled with asof. I don't agree that this is quite the same thing as rounding an index or snapping an index to the nearest second -- those would be a transformation on the index, not the indexer.

@jorisvandenbossche:

what is the difference with asof?

Yes, this would be quite similar, except for the differences you outline. For example, if it really can't find any matches, it should raise an exception rather than returning NA.

would it only work for monotonic indices?

Yes, I think so, unless there is an exact match. I'm generally 👎 on methods that make it easy to do inefficient things without realizing it.

what would it exactly return? the lower or upper nearest, or the 'nearest' nearest? Or this could be a keyword argument to specify this?

I really would like a method that returns the "nearest" nearest. Returning the lower and upper nearest are both useful things to do, but it would be surprising if they were the default for a method named "nearest". A keyword argument 'side' would work (e.g., idx.get_loc_nearest(123, side='left')), or we could even have another name entirely, e.g., idx.get_loc_before(123).

Note: Based on autocomplete considerations, I am now thinking that the right name would be get_loc_nearest, to emphasize the similarity with get_loc. We could potentially add get_indexer_nearest if/as necessary.

@hughesadam87
Copy link

Hey Stephan,

Nice to see this functionality being built-in. I have been used hacked together version of this for scikit-spectra for a while, and really think anyone who uses float-indexed data will find this extremely useful.

Just to add my two cents, I think that the "nearest nearest" mentatility makes the most sense. The keyword side is superfluous because if the user is aware enough to use the keyword, they are also aware enough to just change the value. For example, if I was trying to get data close to 130.5 but I needed to get a value less than 130.5, I'd just index:

get_loc_nearest(130.49)

As opposed to

get_loc_neareset(130.51, side='left')

Or am I misunderstanding?

I also think the name get_nearest() would work, but also like get_loc_neareset()

One issue we ran into was dealing with is what to do when the user oversteps the bounds of the data, do you raise an error or just return the nearest value? For our purposes, it made more sense to throw and error, but the data was strictly monotonically increasing and had clear upper and lower limits. I guess the more general case would be that index floats would have no clear limits and would not necessarily be sorted/monotonic.

What would happen in the case of duplicate values in the index?

And that's how you build a bikeshed.

@shoyer
Copy link
Member Author

shoyer commented Jan 15, 2015

The keyword side is superfluous because if the user is aware enough to use the keyword, they are also aware enough to just change the value. For example, if I was trying to get data close to 130.5 but I needed to get a value less than 130.5, I'd just index: get_loc_nearest(130.49) as opposed to get_loc_neareset(130.51, side='left')

Not sure I follow. Suppose the index in your example is given by pd.Index([129, 131]). get_nearest(130.49) would still return the value corresponding to 131. Basically, this feature is useful for irregularly spaced data. For example, you could use it to return the last record of each hour.

I also think the name get_nearest() would work, but also like get_loc_neareset()

Hmm. We could certainly do get_loc_nearest and get_indexer_nearest instead of a single function. That might make the functionality more obvious.

One issue we ran into was dealing with is what to do when the user oversteps the bounds of the data, do you raise an error or just return the nearest value? For our purposes, it made more sense to throw and error, but the data was strictly monotonically increasing and had clear upper and lower limits.

For this use case, I think you'll want an IntervalIndex. Currently we just return the nearest value, though we could hypothetically add something like a max_distance argument.

What would happen in the case of duplicate values in the index?

Not supported in my current PR (the result is ambiguous for looking up an indexer). The index needs to have unique and sorted values (either ascending or descending).

@hughesadam87
Copy link

Cool. Sorry, I haven't had a chance to use the IntervalIndex because I'm still bogged down in 0.14.

I see what you mean about the side argument now. I was stuck in my own use cases I guess, where we generally know our index, but the float rounding is the pain. IE our data is Float64Index([120.540, 121.483, ...]) and we just want something like get_loc_nearest(121.0). So I see why the side argument is necessary.

Are you planning to have a nearest indexer that would work like .loc or .iloc eg:

data.loc_nearest[130:140, 30:35.33]

IE 2D nearest indexing?

@shoyer
Copy link
Member Author

shoyer commented Jan 16, 2015

@hugadams IntervalIndex hasn't been merged yet -- still sitting in a PR :). I think something like .loc_nearest as an indexer would be a nice addition, though I don't have any concrete plans yet.

@jreback
Copy link
Contributor

jreback commented Feb 19, 2015

@shoyer
Copy link
Member Author

shoyer commented Feb 19, 2015

@jreback yes, but do that sanely we'll need the MultiIndex version (#9365).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Datetime Datetime data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants