Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Concat Index behavior diverts from pandas #15649

Open
er-eis opened this issue May 4, 2024 · 14 comments · May be fixed by #15650
Open

[BUG] Concat Index behavior diverts from pandas #15649

er-eis opened this issue May 4, 2024 · 14 comments · May be fixed by #15650
Labels
bug Something isn't working

Comments

@er-eis
Copy link
Contributor

er-eis commented May 4, 2024

As @wence- points out here #15623 (comment)

We allow concatenating indexes whilst pandas does not.

Do we like this difference in behavior? Do we want parity? Would users be upset if we changed this behavior?

@er-eis er-eis added the bug Something isn't working label May 4, 2024
@er-eis er-eis linked a pull request May 4, 2024 that will close this issue
2 tasks
@vyasr
Copy link
Contributor

vyasr commented May 7, 2024

I don't think we want to support concatenating indexes, but I think cudf does rely on this behavior internally. I have a vague recollection of trying to remove this at one point and realizing that there were dependent changes required to make that happen. It looks like you've started adding deprecations for this in #15650, which is a great way to check this. Since our test suite runs with warnings as errors, if there are internal usages of concatenating indexes we'll see them as test failures there. Let's see how that goes!

@er-eis
Copy link
Contributor Author

er-eis commented May 7, 2024

@vyasr assuming we have reasonable test coverage, based on my PR, the areas with dependent code are related to:

  1. cudf Indexes (not a big shocker there)
  2. cudf parquets
  3. dask_cudf parquets
  4. dask_cudf accessor
  5. dask_cudf core

@vyasr
Copy link
Contributor

vyasr commented May 8, 2024

OK awesome. It looks like you've started trying to catch those warnings in #15650, which is a good way to make sure that you've tracked down all the cases. However, I don't think we want to merge that exactly as is, because if we do then our users will start seeing a lot of warnings from APIs that they can't do anything about. We do want to verify that cudf.concat is itself going to throw the warning when users directly use it in a way that will be unsupported in the future, but we want to make sure that users don't see warnings due to our own code internally using deprecated functionality. We should start working through those APIs to see how best to rewrite them so that they don't rely on the deprecated functionality anymore. Does that make sense? Feel free to ask questions here (or on the PR)!

@er-eis
Copy link
Contributor Author

er-eis commented May 9, 2024

@vyasr that makes sense, i'll set the PR to draft mode until i can work through 2, 3, 4, & 5 listed above (the only failing unit test for 1 directly calls cudf.concat, so i think we don't need an investigation there)

@vyasr
Copy link
Contributor

vyasr commented May 15, 2024

Thanks! Feel free to let us know if you run into any issues and need some help.

@er-eis
Copy link
Contributor Author

er-eis commented May 19, 2024

@vyasr & cc @wence-

dask expects to be able to concat Index objects: https://github.com/dask/dask/blob/main/dask/dataframe/dispatch.py#L50

cudf calls dask, which then calls the concat in cudf via this definition: https://github.com/rapidsai/cudf/blob/branch-24.06/python/dask_cudf/dask_cudf/backends.py#L276

based on the above, it seems to me that we need a function which supports the concatenation of cudf.BaseIndex objects. if you agree this is the case, then i think we shouldn't deprecate this functionality. some possibilities that come to mind (i am missing context here):

  1. move the concatenation of cudf.BaseIndex to a function other than reshape's concat. make the func specific to cudf.BaseIndex. make a new backends func to handle this.
  2. leave the functionality in reshape's concat, accept we diverge from pandas functionality for concat
  3. remove the associated calls to dask so we don't need to worry about what they're doing. this seems like something that wouldn't be worth our time
  4. convince the dask team to deprecate their concatenation? this seems like something not worth pursuing.

what do you think?

@vyasr
Copy link
Contributor

vyasr commented May 20, 2024

Good analysis! You've found most of the relevant places, but what we ought to also look at is how dask implements concatenation for pandas indexes. It looks dask uses append for pandas indexes. You can see the implementation in dask for pandas dataframes here. The append method was removed for Series and DataFrame in pandas 2.0, but it still exists for indexes. So what we probably want to do is update the dispatch in the dask_cudf backend that you linked to to use the index append method.

In the long run, we might want to discuss with pandas developers whether there should be more uniformity between the different types and whether they use append or concat. @mroeschke might know how and why pandas ended up in the current heterogeneous state. However, any such changes in pandas would be a longer-term thing, so we should be able to proceed with the above plan in the interim.

@er-eis
Copy link
Contributor Author

er-eis commented May 21, 2024

thanks so much for the context, @vyasr! i'll plan to proceed with 1 via the usage of index append -- to me, the implementation in dask pandas is ugly. i think it'd be easier to follow if we separated indices from other objects

@mroeschke
Copy link
Contributor

In the long run, we might want to discuss with pandas developers whether there should be more uniformity between the different types and whether they use append or concat. @mroeschke might know how and why pandas ended up in the current heterogeneous state.

IIRC pandas.Series.append and pandas.DataFrame.append were deprecated because similar and more performance functionality existed in pandas.concat; however, pandas.concat never worked on pandas.Index objects that why .append exists on pandas.Index.

I'm not sure if pandas.concat will ever support pandas.Index objects though since it's used for "concat pandas objects that have axes"

@vyasr
Copy link
Contributor

vyasr commented May 21, 2024

Hmm OK yeah I guess that sort of makes sense. I really wish there were fewer ways in which we needed to special-case for indexes.

@vyasr
Copy link
Contributor

vyasr commented Jul 19, 2024

Hi @er-eis have you had a chance to work on this (or any of the other issues for which you opened PRs)?

@er-eis
Copy link
Contributor Author

er-eis commented Jul 21, 2024

@vyasr, i'll try to carve some time out next week to work on this. been slammed with other responsibilities lately

@vyasr
Copy link
Contributor

vyasr commented Jul 22, 2024

No problem, just checking in. Thanks for the update!

@er-eis
Copy link
Contributor Author

er-eis commented Aug 3, 2024

things keep popping up and i'm about to start a project that will last at least 6 months, so if someone else has time in the next 6 months they should wrap this up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants