API: Index.get_nearest method #8845

shoyer · 2014-11-18T04:41:29Z

xref #3004
xref #841
xref #7873
xref #7223
xref #8815

Building on @immerrr's excellent refactor in #8753, I would like to propose adding a get_nearest method to pandas.Index that does nearest neighbor lookups. The idea is that nearest neighbor lookups are usually the desirable/sane thing to do when using inexact indexes. Eventually, we might want to add an alternative "wrapper index" (like IntervalIndex), e.g., NearestNeighborIndex which switches the default behavior; this would be an intermediate step in that direction.

The implementation would be a simple wrapper that calls Index.get_slice_bound twice, once to the left and once to the right. Ideally, this would this would even work for array-like arguments, though perhaps there should be separate get_nearest_loc and get_nearest_indexer methods.

The text was updated successfully, but these errors were encountered:

jreback · 2014-11-18T11:22:06Z

I added a couple of cross-refs above. This should ideally encompass / reconcile with:

.asof and the concept of snapping to a the nearest.

So having a get_nearest_loc would then allow easy an almost trivial .asof as well
as .at_time and .nearest (really all just names for the same concept).

Only tricky part here is that in the time-domain you can have a simulatenous asfreq happen, e.g.
you often want to know the nearest 1s to something (but maybe that should be somewhat decoupled as its easy enough to simply round a DatetimeIndex to actually get it (and maybe to actually can implement get_nearest in a similar way).

jorisvandenbossche · 2014-11-18T11:32:56Z

See also my recently opened issue on the scope of asof: #8815
and also related: #7223

@shoyer Some questions:

what is the difference with asof?
- apart from that asof now only works for DatetimeIndex but the idea of it
- and apart from that it could be a better name for asof .. (and more complete implementation)
would it only work for monotonic indices?
what would it exactly return? the lower or upper nearest, or the 'nearest' nearest? Or this could be a keyword argument to specify this?

shoyer · 2014-11-18T23:20:40Z

Ah, I knew we had talked about this before. Somehow I forgot about asof (which, I agree, is a little strange).

@jreback Thanks for adding the references! I agree that this should be reconciled with asof. I don't agree that this is quite the same thing as rounding an index or snapping an index to the nearest second -- those would be a transformation on the index, not the indexer.

@jorisvandenbossche:

what is the difference with asof?

Yes, this would be quite similar, except for the differences you outline. For example, if it really can't find any matches, it should raise an exception rather than returning NA.

would it only work for monotonic indices?

Yes, I think so, unless there is an exact match. I'm generally 👎 on methods that make it easy to do inefficient things without realizing it.

what would it exactly return? the lower or upper nearest, or the 'nearest' nearest? Or this could be a keyword argument to specify this?

I really would like a method that returns the "nearest" nearest. Returning the lower and upper nearest are both useful things to do, but it would be surprising if they were the default for a method named "nearest". A keyword argument 'side' would work (e.g., idx.get_loc_nearest(123, side='left')), or we could even have another name entirely, e.g., idx.get_loc_before(123).

Note: Based on autocomplete considerations, I am now thinking that the right name would be get_loc_nearest, to emphasize the similarity with get_loc. We could potentially add get_indexer_nearest if/as necessary.

hughesadam87 · 2015-01-15T16:50:24Z

Hey Stephan,

Nice to see this functionality being built-in. I have been used hacked together version of this for scikit-spectra for a while, and really think anyone who uses float-indexed data will find this extremely useful.

Just to add my two cents, I think that the "nearest nearest" mentatility makes the most sense. The keyword side is superfluous because if the user is aware enough to use the keyword, they are also aware enough to just change the value. For example, if I was trying to get data close to 130.5 but I needed to get a value less than 130.5, I'd just index:

get_loc_nearest(130.49)

As opposed to

get_loc_neareset(130.51, side='left')

Or am I misunderstanding?

I also think the name get_nearest() would work, but also like get_loc_neareset()

One issue we ran into was dealing with is what to do when the user oversteps the bounds of the data, do you raise an error or just return the nearest value? For our purposes, it made more sense to throw and error, but the data was strictly monotonically increasing and had clear upper and lower limits. I guess the more general case would be that index floats would have no clear limits and would not necessarily be sorted/monotonic.

What would happen in the case of duplicate values in the index?

And that's how you build a bikeshed.

shoyer · 2015-01-15T23:19:36Z

The keyword side is superfluous because if the user is aware enough to use the keyword, they are also aware enough to just change the value. For example, if I was trying to get data close to 130.5 but I needed to get a value less than 130.5, I'd just index: get_loc_nearest(130.49) as opposed to get_loc_neareset(130.51, side='left')

Not sure I follow. Suppose the index in your example is given by pd.Index([129, 131]). get_nearest(130.49) would still return the value corresponding to 131. Basically, this feature is useful for irregularly spaced data. For example, you could use it to return the last record of each hour.

I also think the name get_nearest() would work, but also like get_loc_neareset()

Hmm. We could certainly do get_loc_nearest and get_indexer_nearest instead of a single function. That might make the functionality more obvious.

One issue we ran into was dealing with is what to do when the user oversteps the bounds of the data, do you raise an error or just return the nearest value? For our purposes, it made more sense to throw and error, but the data was strictly monotonically increasing and had clear upper and lower limits.

For this use case, I think you'll want an IntervalIndex. Currently we just return the nearest value, though we could hypothetically add something like a max_distance argument.

What would happen in the case of duplicate values in the index?

Not supported in my current PR (the result is ambiguous for looking up an indexer). The index needs to have unique and sorted values (either ascending or descending).

hughesadam87 · 2015-01-16T00:02:27Z

Cool. Sorry, I haven't had a chance to use the IntervalIndex because I'm still bogged down in 0.14.

I see what you mean about the side argument now. I was stuck in my own use cases I guess, where we generally know our index, but the float rounding is the pain. IE our data is Float64Index([120.540, 121.483, ...]) and we just want something like get_loc_nearest(121.0). So I see why the side argument is necessary.

Are you planning to have a nearest indexer that would work like .loc or .iloc eg:

data.loc_nearest[130:140, 30:35.33]

IE 2D nearest indexing?

shoyer · 2015-01-16T02:00:28Z

@hugadams IntervalIndex hasn't been merged yet -- still sitting in a PR :). I think something like .loc_nearest as an indexer would be a nice addition, though I don't have any concrete plans yet.

jreback · 2015-02-19T20:42:38Z

nice question to show perf of nearest
http://stackoverflow.com/questions/28612773/how-to-speed-up-nearest-search-in-pandas

shoyer · 2015-02-19T20:45:04Z

@jreback yes, but do that sanely we'll need the MultiIndex version (#9365).

shoyer mentioned this issue Nov 18, 2014

Feature Request: Array indices which understand units astropy/astropy#3053

Open

jreback added API Design Datetime Datetime data dtype labels Nov 18, 2014

jreback added this to the 0.16.0 milestone Nov 18, 2014

This was referenced Nov 18, 2014

closest_time addition to DatetimeIndex #3004

Closed

df.at_time NotImplementedError {asof] #7873

Closed

shoyer mentioned this issue Jan 15, 2015

API/ENH: add method='nearest' to Index.get_indexer/reindex and method to get_loc #9258

Merged

shoyer mentioned this issue Jan 28, 2015

Use KDTrees to support nearest neighbor queries/joins on MultiIndexes? #9365

Closed

shoyer closed this as completed in #9258 Feb 23, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Index.get_nearest method #8845

API: Index.get_nearest method #8845

shoyer commented Nov 18, 2014

jreback commented Nov 18, 2014

jorisvandenbossche commented Nov 18, 2014

shoyer commented Nov 18, 2014

hughesadam87 commented Jan 15, 2015

shoyer commented Jan 15, 2015

hughesadam87 commented Jan 16, 2015

shoyer commented Jan 16, 2015

jreback commented Feb 19, 2015

shoyer commented Feb 19, 2015

API: Index.get_nearest method #8845

API: Index.get_nearest method #8845

Comments

shoyer commented Nov 18, 2014

jreback commented Nov 18, 2014

jorisvandenbossche commented Nov 18, 2014

shoyer commented Nov 18, 2014

hughesadam87 commented Jan 15, 2015

shoyer commented Jan 15, 2015

hughesadam87 commented Jan 16, 2015

shoyer commented Jan 16, 2015

jreback commented Feb 19, 2015

shoyer commented Feb 19, 2015