Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement hash for indices #3885

Closed
cpcloud opened this issue Jun 13, 2013 · 21 comments
Closed

implement hash for indices #3885

cpcloud opened this issue Jun 13, 2013 · 21 comments
Labels
API Design Enhancement Ideas Long-Term Enhancement Discussions Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@cpcloud
Copy link
Member

cpcloud commented Jun 13, 2013

would be nice as @jreback says to implement this since indices are supposed to be immutable. currently they try to hash the underlying ndarray which of course fails because it is mutable.

succinct reasoning behind why you need immutability for hashables

python docs quote about using ^ (exclusive or) in the implementation

but see this answer for a way to hash numpy arrays.

@cpcloud
Copy link
Member Author

cpcloud commented Jun 13, 2013

i've got something going here passing index tests but not multiindex. think there's an issue with tuple array copying possibly. although that doesn't give regular indexes with tuples a problem. might need to hash levels and labels as well for multiindexes.

@hayd
Copy link
Contributor

hayd commented Jun 13, 2013

cc #3884.

@jtratner
Copy link
Contributor

@cpcloud What's the goal here? - Under the hood, when doing things like key lookups, Python first uses __hash__ to find the right bin for an object, but then uses __eq__ to test whether it's actually the same object, but it looks like Index's __eq__ just delegates to the ndarray __eq__ right? So if you use it as a key, it would end up raising the standard error about bool(ndarray) being undefined. You'd need to change the __eq__ method to be able to take advantage of it as a key... (This is also why it's not vitally important for __hash__ to be unique) If you're doing this, you should add some tests for using the hash, like putting in sets and dicts, so you can be sure that no one changes __eq__ - Something like these:

def test_copies_have_same_hash():
    df = mkdf(...)
    ind = df.index
    ind2 = df.copy().index
    s = set([ind])
    assert ind in s
    assert ind2 in s
def  test_different_indices_can_not_look_each_other_up():
    df = mkdf(...)
    df2 = mkdf(...) # different length or something
    ind = df.index
    ind2 = df2.index
    d = {ind: 5}
    assert ind in s
    assert ind2 not in s
    assert d[ind] == 5 

Then you'd also have to change Index's __eq__ to be something like:

def __eq__(self, other):
     if issubclass(other, Index):
          return (self.values == other.values).all()
     else:
          # either return False or do something else...

@cpcloud
Copy link
Member Author

cpcloud commented Jun 14, 2013

ah there's a problem because __eq__ is used to resolve hash collisions. maybe we should abandon this or just table it for someday since @jtratner is right would have to make eq return a scaalar

@cpcloud
Copy link
Member Author

cpcloud commented Jun 14, 2013

nice catch

@jtratner
Copy link
Contributor

If there is a need for this, you could create a 'HashableIndex' that has a
'hash' method and the modified 'eq' method that returns a scalar
and otherwise is just a subclass of Index (and maybe it just grabs
underlying array without copying?) .
On Jun 14, 2013 5:09 PM, "Phillip Cloud" [email protected] wrote:

nice catch


Reply to this email directly or view it on GitHubhttps://github.com//issues/3885#issuecomment-19481833
.

@cpcloud
Copy link
Member Author

cpcloud commented Jun 14, 2013

hm that is not a bad idea...Pul Rekwest™!

@jtratner
Copy link
Contributor

Is your hashing scheme working right now? Otherwise could just default to
repr

On Fri, Jun 14, 2013 at 7:08 PM, Phillip Cloud [email protected]:

hm that is not a bad idea...Pul Rekwest ™!


Reply to this email directly or view it on GitHubhttps://github.com//issues/3885#issuecomment-19486312
.

@cpcloud
Copy link
Member Author

cpcloud commented Jun 14, 2013

the hashing scheme is not working at the moment for multiindexes, and i haven't incorporated ur tests above with it.

currently using sha1 hash of view of array as bytes plus xor-ing that with the sha1 of the index name, dtype, inferred_type, and class name. probably dtype and inferred type might be redundant but dtype hash gives u endian info so prolly should keep it if going to use this scheme. tbh it sounds like u know might know more about this than i do so feel free to take it...

@cpcloud
Copy link
Member Author

cpcloud commented Jun 14, 2013

defaulting to repr will be slow for large indices since indexes repr valid python code iirc

@jtratner
Copy link
Contributor

I don't know a ton, but here's my limited understanding: hashes are sorted
into bins, it's important for them to (in general) be random, because you
want to try to put objects into separate bins because if too many end up in
similar bins (same hash?), you get "hash collisions" which make things very
slow because you have to do more comparisons (e.g., issues with web
frameworks where the entire url / request header into a dictionary and so a
malformed request can cause huge and slow processing time). So for our
purposes, we definitely want to create some kind of unique hash, but given
that under the hood there will still be an equality check, it's probably
better to go for a faster hashing algorithm. Anyways, it'll be really easy
to write a HashableIndex and then you can incorporate it into what you
are doing already :)

On Fri, Jun 14, 2013 at 7:51 PM, Phillip Cloud [email protected]:

defaulting to repr will be slow for large indices since indexes repr valid
python code iirc


Reply to this email directly or view it on GitHubhttps://github.com//issues/3885#issuecomment-19487495
.

@cpcloud
Copy link
Member Author

cpcloud commented Jun 15, 2013

sure yes. basic properties of hashing. i thought u might be a hashing wizard ;) magic constants always make me die a little inside...

@jreback
Copy link
Contributor

jreback commented Jun 15, 2013

what's the reason we even want hashability?
what does it gain/allow?

@cpcloud
Copy link
Member Author

cpcloud commented Jun 15, 2013

that really is the question.

@cpcloud
Copy link
Member Author

cpcloud commented Jun 15, 2013

consistency with the idea that indexes are immutable is the only thing that comes to mind so far from me...but i could be thinking small here. original question was about memoization of a frame, but that's really a non-starter anyway since those will never be hashable in the python sense

@jtratner
Copy link
Contributor

@cpcloud yeah, I'm definitely no hashing master or whatever. :P anyways, you can see jtratner/pandas@880a8ea for a framework for a kind of hashable index _class_... Personally, I don't think it really gets you anything special. (also, I might have used too much magic there...not sure)

@ghost
Copy link

ghost commented Jun 19, 2013

related #2461

@jreback
Copy link
Contributor

jreback commented Sep 22, 2013

@jtratner I believe with your new identiy fixes we can close this?

@jtratner
Copy link
Contributor

it's not quite the same thing (for example, if you pickled and unpickled an Index, the _id and is_() would change)...still don't see the use for this.

@jreback
Copy link
Contributor

jreback commented Sep 22, 2013

ok....close ? or explicty define __hash__ to raise to be explicity about it

@cpcloud
Copy link
Member Author

cpcloud commented Sep 28, 2013

Index defines __hash__ ... i've just put up #5019 to give a better error message ... closing

@cpcloud cpcloud closed this as completed Sep 28, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement Ideas Long-Term Enhancement Discussions Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

No branches or pull requests

4 participants