-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Concat Index
behavior diverts from pandas
#15649
Comments
I don't think we want to support concatenating indexes, but I think cudf does rely on this behavior internally. I have a vague recollection of trying to remove this at one point and realizing that there were dependent changes required to make that happen. It looks like you've started adding deprecations for this in #15650, which is a great way to check this. Since our test suite runs with warnings as errors, if there are internal usages of concatenating indexes we'll see them as test failures there. Let's see how that goes! |
@vyasr assuming we have reasonable test coverage, based on my PR, the areas with dependent code are related to:
|
OK awesome. It looks like you've started trying to catch those warnings in #15650, which is a good way to make sure that you've tracked down all the cases. However, I don't think we want to merge that exactly as is, because if we do then our users will start seeing a lot of warnings from APIs that they can't do anything about. We do want to verify that |
@vyasr that makes sense, i'll set the PR to draft mode until i can work through 2, 3, 4, & 5 listed above (the only failing unit test for 1 directly calls |
Thanks! Feel free to let us know if you run into any issues and need some help. |
based on the above, it seems to me that we need a function which supports the concatenation of
what do you think? |
Good analysis! You've found most of the relevant places, but what we ought to also look at is how dask implements concatenation for pandas indexes. It looks dask uses append for pandas indexes. You can see the implementation in dask for pandas dataframes here. The append method was removed for Series and DataFrame in pandas 2.0, but it still exists for indexes. So what we probably want to do is update the dispatch in the dask_cudf backend that you linked to to use the index append method. In the long run, we might want to discuss with pandas developers whether there should be more uniformity between the different types and whether they use append or concat. @mroeschke might know how and why pandas ended up in the current heterogeneous state. However, any such changes in pandas would be a longer-term thing, so we should be able to proceed with the above plan in the interim. |
thanks so much for the context, @vyasr! i'll plan to proceed with 1 via the usage of index append -- to me, the implementation in dask pandas is ugly. i think it'd be easier to follow if we separated indices from other objects |
IIRC I'm not sure if |
Hmm OK yeah I guess that sort of makes sense. I really wish there were fewer ways in which we needed to special-case for indexes. |
Hi @er-eis have you had a chance to work on this (or any of the other issues for which you opened PRs)? |
@vyasr, i'll try to carve some time out next week to work on this. been slammed with other responsibilities lately |
No problem, just checking in. Thanks for the update! |
things keep popping up and i'm about to start a project that will last at least 6 months, so if someone else has time in the next 6 months they should wrap this up |
As @wence- points out here #15623 (comment)
We allow concatenating indexes whilst
pandas
does not.Do we like this difference in behavior? Do we want parity? Would users be upset if we changed this behavior?
The text was updated successfully, but these errors were encountered: