Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove internal usage of core.index.as_index in favor of cudf.Index #15851

Merged
merged 3 commits into from
Jun 5, 2024

Conversation

mroeschke
Copy link
Contributor

Description

cudf.Index.__init__ essentially calls as_index immediately internally. To avoid both from potentially diverging, the public cudf.Index should be preferred to ensure the public behaviors are used internally

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mroeschke mroeschke added Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 24, 2024
@mroeschke mroeschke requested a review from a team as a code owner May 24, 2024 00:37
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a time when all Index classes implemented __new__ instead of __init__ as a way to allow the parent Index class to construct indexes of the appropriate type. A lot of the class hierarchy was reworked to simplify the constructor logic and ensure that you could still return the right types. pandas 2.0 simplified this pretty substantially by getting rid of all the subclasses of Index corresponding to numerical and string types. Is it still possible for pd.Index(...) to return an object of a type other than Index itself? IIRC we now have to explicitly construct CategoricalIndex, MultiIndex, and RangeIndex, right? The only caveat is the tupleize_cols argument, which throws a bit of a wrench in things; is that planned to remain long-term in pandas?

Basically what I'm wondering is whether we can go even further than this PR currently does (probably in a follow up) to where we remove as_index altogether (or at least make it an internal only helper function used in specific cases; right now I'm sure it gets used externally by projects like cuspatial and cuml) and also reduce the complexity of the index hierarchy without needing to implement __new__, which adds confusion I'd like to avoid if possible.

@mroeschke
Copy link
Contributor Author

Is it still possible for pd.Index(...) to return an object of a type other than Index itself?

Yeah the pandas.Index constructor can still return the remaining subclasses e.g.

In [1]: import pandas as pd

In [2]: import datetime

In [3]: pd.Index([datetime.datetime.now()])
Out[3]: DatetimeIndex(['2024-06-03 15:29:10.717813'], dtype='datetime64[us]', freq=None)

The only caveat is the tupleize_cols argument, which throws a bit of a wrench in things; is that planned to remain long-term in pandas?

Yeah I imagine so in order to give users a way to construct a pandas.Index of tuple objects instead of performing "inference" and converting the tuples to pandas.MulitIndex levels

Basically what I'm wondering is whether we can go even further than this PR ...

I think once we get other RAPIDS libaries to stop using as_index we can inline it in the cudf.Index constructor, but I think it will have to remain in __new__ (or similar) since that's also how pandas dispatches Index to it's subclasses. I also dislike that pandas.Index can return it's subclasses, but since the int and float subclasses were removed in pandas 2.0 maybe there will be more interest from the core team to have Index just be the type to hold more date types that the subclasses handle

@vyasr
Copy link
Contributor

vyasr commented Jun 4, 2024

It would be nice if we could move towards a world where the Index constructor only returned instances of itself. In the meantime, maybe the best that we can do is subclass index with an intermediate class that overrides __new__ and gets rid of the specializing behavior, that way subclasses can just override __init__. That might alleviate some of the complexity that would otherwise be present when overriding a parent class that implements a nontrivial __new__.

@mroeschke
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 22ef063 into rapidsai:branch-24.08 Jun 5, 2024
71 checks passed
@mroeschke mroeschke deleted the ref/as_index/internal branch June 5, 2024 00:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants