Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate DataFrame.__init__ logic to prepare data before calling super #14614

Closed
wants to merge 40 commits into from

Conversation

mroeschke
Copy link
Contributor

@mroeschke mroeschke commented Dec 12, 2023

Description

I noticed that DataFrame.__init__ essentially has the following pattern

super().__init__()

if condition:
    self._data = this
elif condition
    self._data = that

self._data.attribute = other

I find this pattern fairly brittle and leads to diverging paths for validation and coercion logic. This refactor essentials creates a ColumnAccessor from the inputs first and then passes that to super, then does post processing that all the branches can share.

This refactor does not touch when data is a list

Fixes the following bugs:

  • Ensure DataFrame(dict) with tuple keys fill with NA instead of empty string like pandas
  • Ensure DataFrame(DataFrame(...), index=, column=) reindexes like pandas
  • Ensure DataFrame(dict) with only scalar values raises like pandas

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mroeschke mroeschke added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change python labels Dec 12, 2023
@mroeschke mroeschke requested a review from a team as a code owner December 12, 2023 03:04
if columns is not None:
as_idx_typ = None
if isinstance(columns, list) and len(columns) == 0:
# TODO: Generically, an empty dtype-less container
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can't have the concept of a dtype-less column, so does that idea make sense?

as_idx_typ = None
if isinstance(columns, list) and len(columns) == 0:
# TODO: Generically, an empty dtype-less container
# TODO: Why does as_index([]) return FloatIndex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because cudf.core.column.as_column([]) returns a float column.

Comment on lines +681 to +682
# mixed typed elements are allowed e.g. [(1, 2), "a"]
columns = list(columns)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question, I think, as you noted elsewhere as soon as we have mixed type column names, we can't do many operations (like for instance transposing the frame). Should we instead disallow this?

Comment on lines +684 to +687
if not isinstance(
columns, MultiIndex
) and columns.nunique() != len(columns):
raise ValueError("Columns cannot contain duplicate values")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question why is it safe for columns to be non-unique if the columns are a multiindex?

columns = columns.to_pandas()
col_is_rangeindex = isinstance(columns, pd.RangeIndex)
col_is_multiindex = isinstance(columns, pd.MultiIndex)
if not isinstance(columns, pd.MultiIndex):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not isinstance(columns, pd.MultiIndex):
if not col_is_multiindex:

Comment on lines +1917 to +1920
result._data.rangeindex = col_was_rangeindex
result._data.multiindex = col_was_multiindex
result._data.label_dtype = col_label_dtype
return result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion Is frame._slice the only place where we need to care about carrying over this information? It seems like it might be necessary generally. Hence, should we move this to IndexedFrame._gather?

Comment on lines +1936 to +1938
result._data.rangeindex = col_was_rangeindex
result._data.multiindex = col_was_multiindex
result._data.label_dtype = col_label_dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here, should _from_columns_like_self handle this transfer of information?

Comment on lines 7639 to 7640
# pandas returns Index[object] while this should be an empty RangeIndex
# for empty df/other
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question are these pandas bugs that we should mark somehow?

Comment on lines +871 to +875
# TODO: This there a better way to do this?
columns_from_data = as_index(columns_from_data)
reindexed = self.reindex(
columns=columns_from_data.to_pandas(), copy=False
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question What are you trying to do conceptually that you would like a better way for?

@@ -665,38 +665,47 @@ class DataFrame(IndexedFrame, Serializable, GetAttrGetItemMixin):
def __init__(
self, data=None, index=None, columns=None, dtype=None, nan_as_null=True
):
super().__init__()
col_is_rangeindex = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion/discussion point Even after this heroic refactoring to make things clearer, this __init__ method is still very long. I haven't yet reviewed everything in detail because I found it quite hard to follow when things are preprocessing to deliver information to a later part of the function compared with preprocessing to produce the final result.

Hence, would it make sense to write the different cases as free functions (or @staticmethods) so that we have something that then looks like:

if case_a:
   preprocessed_args = handle_case_a(...)
elif case_b:
   preprocessed_args = handle_case_b(...)

# or whatever
super().__init__(preprocessed_args)

WDYT?

@mroeschke mroeschke requested review from a team as code owners January 31, 2024 21:46
@mroeschke mroeschke requested review from shrshi and divyegala and removed request for a team January 31, 2024 21:46
Copy link

copy-pr-bot bot commented Jan 31, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue conda Java Affects Java cuDF API. labels Jan 31, 2024
@mroeschke mroeschke changed the base branch from branch-24.02 to branch-24.04 January 31, 2024 21:47
@mroeschke
Copy link
Contributor Author

Sorry for the notification noise. I'll reopen this PR to reset

@mroeschke mroeschke closed this Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue improvement Improvement / enhancement to an existing function Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants