Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Bug in MultiIndex.has_duplicates when having many levels causes an indexer overflow (GH9075) #9077

Closed
wants to merge 1 commit into from

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Dec 14, 2014

closes #9075

@jreback
Copy link
Contributor Author

jreback commented Dec 15, 2014

so this is implemented on a MultiIndex mainly for a perf benefit (e.g. it is order number of levels, rather than length of the index).

Should implement the trivial version of Index for compat. Essentially its not self.is_unique

@behzadnouri
Copy link
Contributor

this can overflow with int64 as well. it should follow something like this and ideally factorized in one place

@jreback
Copy link
Contributor Author

jreback commented Dec 15, 2014

@behzadnouri you are probably right but I don't this is a practical overflow (unlike a groupby which deals with s theoretical space). can u come up with an example that actually does overflow? (obviously using int64)

@behzadnouri
Copy link
Contributor

this commit , which says:

Fix int64 overflow issue when unstacking MultiIndex with many levels (#2616)

the get_group_index function is technically doing the same thing as in the first lines of has_duplicates function.

@jreback
Copy link
Contributor Author

jreback commented Dec 16, 2014

@behzadnouri

what do you think?

group_index = np.zeros(len(self), dtype='i8')
for i in range(len(shape)):
stride = np.prod([x for x in shape[i + 1:]], dtype='i8')
group_index += self.labels[i] * stride
group_index += _ensure_int64(self.labels[i]) * stride
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will break if any of the self.labels[i] are -1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if their any NaN's, what would you do?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if any of the labels are -1, we just need to lift labels and size by one, just as in _maybe_lift function which is part of this PR.

@jreback
Copy link
Contributor Author

jreback commented Dec 18, 2014

closing in favor of #9101

@jreback jreback closed this Dec 18, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

change in MultiIndex.has_duplicates behavior from 0.15.0 -> 0.15.2
2 participants