implement hash for indices #3885

cpcloud · 2013-06-13T15:28:41Z

would be nice as @jreback says to implement this since indices are supposed to be immutable. currently they try to hash the underlying ndarray which of course fails because it is mutable.

succinct reasoning behind why you need immutability for hashables

python docs quote about using ^ (exclusive or) in the implementation

but see this answer for a way to hash numpy arrays.

The text was updated successfully, but these errors were encountered:

cpcloud · 2013-06-13T16:39:43Z

i've got something going here passing index tests but not multiindex. think there's an issue with tuple array copying possibly. although that doesn't give regular indexes with tuples a problem. might need to hash levels and labels as well for multiindexes.

hayd · 2013-06-13T17:15:58Z

cc #3884.

jtratner · 2013-06-14T21:04:05Z

@cpcloud What's the goal here? - Under the hood, when doing things like key lookups, Python first uses __hash__ to find the right bin for an object, but then uses __eq__ to test whether it's actually the same object, but it looks like Index's __eq__ just delegates to the ndarray __eq__ right? So if you use it as a key, it would end up raising the standard error about bool(ndarray) being undefined. You'd need to change the __eq__ method to be able to take advantage of it as a key... (This is also why it's not vitally important for __hash__ to be unique) If you're doing this, you should add some tests for using the hash, like putting in sets and dicts, so you can be sure that no one changes __eq__ - Something like these:

def test_copies_have_same_hash():
    df = mkdf(...)
    ind = df.index
    ind2 = df.copy().index
    s = set([ind])
    assert ind in s
    assert ind2 in s

def  test_different_indices_can_not_look_each_other_up():
    df = mkdf(...)
    df2 = mkdf(...) # different length or something
    ind = df.index
    ind2 = df2.index
    d = {ind: 5}
    assert ind in s
    assert ind2 not in s
    assert d[ind] == 5

Then you'd also have to change Index's __eq__ to be something like:

def __eq__(self, other):
     if issubclass(other, Index):
          return (self.values == other.values).all()
     else:
          # either return False or do something else...

cpcloud · 2013-06-14T21:09:07Z

ah there's a problem because __eq__ is used to resolve hash collisions. maybe we should abandon this or just table it for someday since @jtratner is right would have to make eq return a scaalar

cpcloud · 2013-06-14T21:09:12Z

nice catch

jtratner · 2013-06-14T21:30:34Z

If there is a need for this, you could create a 'HashableIndex' that has a
'hash' method and the modified 'eq' method that returns a scalar
and otherwise is just a subclass of Index (and maybe it just grabs
underlying array without copying?) .
On Jun 14, 2013 5:09 PM, "Phillip Cloud" [email protected] wrote:

nice catch

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3885#issuecomment-19481833
.

cpcloud · 2013-06-14T23:08:13Z

hm that is not a bad idea...Pul Rekwest™!

jtratner · 2013-06-14T23:43:11Z

Is your hashing scheme working right now? Otherwise could just default to
repr

On Fri, Jun 14, 2013 at 7:08 PM, Phillip Cloud [email protected]:

hm that is not a bad idea...Pul Rekwest ™!

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3885#issuecomment-19486312
.

cpcloud · 2013-06-14T23:50:20Z

the hashing scheme is not working at the moment for multiindexes, and i haven't incorporated ur tests above with it.

currently using sha1 hash of view of array as bytes plus xor-ing that with the sha1 of the index name, dtype, inferred_type, and class name. probably dtype and inferred type might be redundant but dtype hash gives u endian info so prolly should keep it if going to use this scheme. tbh it sounds like u know might know more about this than i do so feel free to take it...

cpcloud · 2013-06-14T23:50:52Z

defaulting to repr will be slow for large indices since indexes repr valid python code iirc

jtratner · 2013-06-14T23:59:20Z

I don't know a ton, but here's my limited understanding: hashes are sorted
into bins, it's important for them to (in general) be random, because you
want to try to put objects into separate bins because if too many end up in
similar bins (same hash?), you get "hash collisions" which make things very
slow because you have to do more comparisons (e.g., issues with web
frameworks where the entire url / request header into a dictionary and so a
malformed request can cause huge and slow processing time). So for our
purposes, we definitely want to create some kind of unique hash, but given
that under the hood there will still be an equality check, it's probably
better to go for a faster hashing algorithm. Anyways, it'll be really easy
to write a HashableIndex and then you can incorporate it into what you
are doing already :)

On Fri, Jun 14, 2013 at 7:51 PM, Phillip Cloud [email protected]:

defaulting to repr will be slow for large indices since indexes repr valid
python code iirc

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3885#issuecomment-19487495
.

cpcloud · 2013-06-15T00:07:28Z

sure yes. basic properties of hashing. i thought u might be a hashing wizard ;) magic constants always make me die a little inside...

jreback · 2013-06-15T00:08:52Z

what's the reason we even want hashability?
what does it gain/allow?

cpcloud · 2013-06-15T00:09:13Z

that really is the question.

cpcloud · 2013-06-15T00:10:34Z

consistency with the idea that indexes are immutable is the only thing that comes to mind so far from me...but i could be thinking small here. original question was about memoization of a frame, but that's really a non-starter anyway since those will never be hashable in the python sense

jtratner · 2013-06-15T00:32:56Z

@cpcloud yeah, I'm definitely no hashing master or whatever. :P anyways, you can see jtratner/pandas@880a8ea for a framework for a kind of hashable index _class_... Personally, I don't think it really gets you anything special. (also, I might have used too much magic there...not sure)

ghost · 2013-06-19T02:59:52Z

related #2461

jreback · 2013-09-22T17:10:03Z

@jtratner I believe with your new identiy fixes we can close this?

jtratner · 2013-09-22T18:01:35Z

it's not quite the same thing (for example, if you pickled and unpickled an Index, the _id and is_() would change)...still don't see the use for this.

jreback · 2013-09-22T18:48:21Z

ok....close ? or explicty define __hash__ to raise to be explicity about it

cpcloud · 2013-09-28T19:56:47Z

Index defines __hash__ ... i've just put up #5019 to give a better error message ... closing

cpcloud closed this as completed Sep 28, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement hash for indices #3885

implement hash for indices #3885

cpcloud commented Jun 13, 2013

cpcloud commented Jun 13, 2013

hayd commented Jun 13, 2013

jtratner commented Jun 14, 2013

cpcloud commented Jun 14, 2013

cpcloud commented Jun 14, 2013

jtratner commented Jun 14, 2013

cpcloud commented Jun 14, 2013

jtratner commented Jun 14, 2013

cpcloud commented Jun 14, 2013

cpcloud commented Jun 14, 2013

jtratner commented Jun 14, 2013

cpcloud commented Jun 15, 2013

jreback commented Jun 15, 2013

cpcloud commented Jun 15, 2013

cpcloud commented Jun 15, 2013

jtratner commented Jun 15, 2013

ghost commented Jun 19, 2013

jreback commented Sep 22, 2013

jtratner commented Sep 22, 2013

jreback commented Sep 22, 2013

cpcloud commented Sep 28, 2013

implement hash for indices #3885

implement hash for indices #3885

Comments

cpcloud commented Jun 13, 2013

cpcloud commented Jun 13, 2013

hayd commented Jun 13, 2013

jtratner commented Jun 14, 2013

cpcloud commented Jun 14, 2013

cpcloud commented Jun 14, 2013

jtratner commented Jun 14, 2013

cpcloud commented Jun 14, 2013

jtratner commented Jun 14, 2013

cpcloud commented Jun 14, 2013

cpcloud commented Jun 14, 2013

jtratner commented Jun 14, 2013

cpcloud commented Jun 15, 2013

jreback commented Jun 15, 2013

cpcloud commented Jun 15, 2013

cpcloud commented Jun 15, 2013

jtratner commented Jun 15, 2013

ghost commented Jun 19, 2013

jreback commented Sep 22, 2013

jtratner commented Sep 22, 2013

jreback commented Sep 22, 2013

cpcloud commented Sep 28, 2013