Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor cudf.Series.__init__ #14450

Merged
merged 17 commits into from
Jan 8, 2024

Conversation

mroeschke
Copy link
Contributor

@mroeschke mroeschke commented Nov 18, 2023

Description

I found a few issues related to reindexing and name priority when index= and name= are given, so added unit tests for those.

Additionally added some typing in from_pandas methods

The main approach here is to collect the potential column from the if/elif branchs first, call super().__init__({name: columns}, index=index), and then apply the additional keywords to the result

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mroeschke mroeschke added Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 18, 2023
@mroeschke mroeschke requested a review from a team as a code owner November 18, 2023 03:01
@mroeschke mroeschke changed the title REF: cudf.Series.__init__ cudf.Series.__init__ Nov 28, 2023
@mroeschke mroeschke changed the title cudf.Series.__init__ Refactor cudf.Series.__init__ Nov 28, 2023
name_from_data = data.name
column = as_column(data, nan_as_null=nan_as_null, dtype=dtype)
if isinstance(data, (pd.Series, Series)):
index, index_from_data = as_index(data.index), index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming is confusing for me on this line. index is the passed in argument, index_from_data is the index extracted from data. However it seems as if the meaning swapped on this line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I can see how this is confusing, I'll swap the naming here throughout the __init__ to make it clearer

data = data.astype(dtype)

if isinstance(data, dict):
elif isinstance(data, dict) or data is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps separating these two conditions as two branches is clearer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can swap data=None for data={} earlier and simplify this condition

Comment on lines +647 to +648
if dtype is not None:
column = column.astype(dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After passing dtype into every as_column call above, why do we still need to cast the column type here? Just curious!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly defensive for now as I am not sure if as_column currently always respects dtype casting (e.g. I found a case recently in https://github.com/rapidsai/cudf/pull/14686/files) but I think this could be removed in the future!

if index_from_data is not None:
# TODO: This there a better way to do this?
index_from_data = as_index(index_from_data)
reindexed = self.reindex(index=index_from_data, copy=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a _reindex function that can take inplace=True parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried this and there were some test failures around data types not being preserved during the _reindex. I guess I'll just do this for now

Copy link
Contributor

@isVoid isVoid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mroeschke
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit c4e5e8c into rapidsai:branch-24.02 Jan 8, 2024
67 checks passed
@mroeschke mroeschke deleted the ref/from_pandas branch January 8, 2024 22:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants