-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidate DataFrame.__init__ logic to prepare data before calling super #14614
Conversation
if columns is not None: | ||
as_idx_typ = None | ||
if isinstance(columns, list) and len(columns) == 0: | ||
# TODO: Generically, an empty dtype-less container |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can't have the concept of a dtype-less column, so does that idea make sense?
as_idx_typ = None | ||
if isinstance(columns, list) and len(columns) == 0: | ||
# TODO: Generically, an empty dtype-less container | ||
# TODO: Why does as_index([]) return FloatIndex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because cudf.core.column.as_column([])
returns a float column.
# mixed typed elements are allowed e.g. [(1, 2), "a"] | ||
columns = list(columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question, I think, as you noted elsewhere as soon as we have mixed type column names, we can't do many operations (like for instance transposing the frame). Should we instead disallow this?
if not isinstance( | ||
columns, MultiIndex | ||
) and columns.nunique() != len(columns): | ||
raise ValueError("Columns cannot contain duplicate values") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question why is it safe for columns to be non-unique if the columns are a multiindex?
columns = columns.to_pandas() | ||
col_is_rangeindex = isinstance(columns, pd.RangeIndex) | ||
col_is_multiindex = isinstance(columns, pd.MultiIndex) | ||
if not isinstance(columns, pd.MultiIndex): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not isinstance(columns, pd.MultiIndex): | |
if not col_is_multiindex: |
result._data.rangeindex = col_was_rangeindex | ||
result._data.multiindex = col_was_multiindex | ||
result._data.label_dtype = col_label_dtype | ||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion Is frame._slice
the only place where we need to care about carrying over this information? It seems like it might be necessary generally. Hence, should we move this to IndexedFrame._gather
?
result._data.rangeindex = col_was_rangeindex | ||
result._data.multiindex = col_was_multiindex | ||
result._data.label_dtype = col_label_dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly here, should _from_columns_like_self
handle this transfer of information?
# pandas returns Index[object] while this should be an empty RangeIndex | ||
# for empty df/other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question are these pandas bugs that we should mark somehow?
# TODO: This there a better way to do this? | ||
columns_from_data = as_index(columns_from_data) | ||
reindexed = self.reindex( | ||
columns=columns_from_data.to_pandas(), copy=False | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question What are you trying to do conceptually that you would like a better way for?
@@ -665,38 +665,47 @@ class DataFrame(IndexedFrame, Serializable, GetAttrGetItemMixin): | |||
def __init__( | |||
self, data=None, index=None, columns=None, dtype=None, nan_as_null=True | |||
): | |||
super().__init__() | |||
col_is_rangeindex = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion/discussion point Even after this heroic refactoring to make things clearer, this __init__
method is still very long. I haven't yet reviewed everything in detail because I found it quite hard to follow when things are preprocessing to deliver information to a later part of the function compared with preprocessing to produce the final result.
Hence, would it make sense to write the different cases as free functions (or @staticmethod
s) so that we have something that then looks like:
if case_a:
preprocessed_args = handle_case_a(...)
elif case_b:
preprocessed_args = handle_case_b(...)
# or whatever
super().__init__(preprocessed_args)
WDYT?
Sorry for the notification noise. I'll reopen this PR to reset |
Description
I noticed that
DataFrame.__init__
essentially has the following patternI find this pattern fairly brittle and leads to diverging paths for validation and coercion logic. This refactor essentials creates a
ColumnAccessor
from the inputs first and then passes that tosuper
, then does post processing that all the branches can share.This refactor does not touch when
data
is alist
Fixes the following bugs:
DataFrame(dict)
with tuple keys fill with NA instead of empty string like pandasDataFrame(DataFrame(...), index=, column=)
reindexes like pandasDataFrame(dict)
with only scalar values raises like pandasChecklist