-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor cudf.Series.__init__ #14450
Conversation
python/cudf/cudf/core/series.py
Outdated
name_from_data = data.name | ||
column = as_column(data, nan_as_null=nan_as_null, dtype=dtype) | ||
if isinstance(data, (pd.Series, Series)): | ||
index, index_from_data = as_index(data.index), index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming is confusing for me on this line. index
is the passed in argument, index_from_data
is the index extracted from data. However it seems as if the meaning swapped on this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I can see how this is confusing, I'll swap the naming here throughout the __init__
to make it clearer
python/cudf/cudf/core/series.py
Outdated
data = data.astype(dtype) | ||
|
||
if isinstance(data, dict): | ||
elif isinstance(data, dict) or data is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps separating these two conditions as two branches is clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can swap data=None
for data={}
earlier and simplify this condition
if dtype is not None: | ||
column = column.astype(dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After passing dtype
into every as_column
call above, why do we still need to cast the column type here? Just curious!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly defensive for now as I am not sure if as_column
currently always respects dtype
casting (e.g. I found a case recently in https://github.com/rapidsai/cudf/pull/14686/files) but I think this could be removed in the future!
python/cudf/cudf/core/series.py
Outdated
if index_from_data is not None: | ||
# TODO: This there a better way to do this? | ||
index_from_data = as_index(index_from_data) | ||
reindexed = self.reindex(index=index_from_data, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be a _reindex
function that can take inplace=True
parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tried this and there were some test failures around data types not being preserved during the _reindex
. I guess I'll just do this for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
/merge |
Description
I found a few issues related to reindexing and name priority when
index=
andname=
are given, so added unit tests for those.Additionally added some typing in
from_pandas
methodsThe main approach here is to collect the potential column from the
if/elif
branchs first, callsuper().__init__({name: columns}, index=index)
, and then apply the additional keywords to the resultChecklist