Consolidate DataFrame.init logic to prepare data before calling super #14614

mroeschke · 2023-12-12T03:04:58Z

Description

I noticed that DataFrame.__init__ essentially has the following pattern

super().__init__()

if condition:
    self._data = this
elif condition
    self._data = that

self._data.attribute = other

I find this pattern fairly brittle and leads to diverging paths for validation and coercion logic. This refactor essentials creates a ColumnAccessor from the inputs first and then passes that to super, then does post processing that all the branches can share.

This refactor does not touch when data is a list

Fixes the following bugs:

Ensure DataFrame(dict) with tuple keys fill with NA instead of empty string like pandas
Ensure DataFrame(DataFrame(...), index=, column=) reindexes like pandas
Ensure DataFrame(dict) with only scalar values raises like pandas

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…me_init

wence- · 2024-01-09T16:45:59Z

python/cudf/cudf/core/dataframe.py

+        if columns is not None:
+            as_idx_typ = None
+            if isinstance(columns, list) and len(columns) == 0:
+                # TODO: Generically, an empty dtype-less container


I think we can't have the concept of a dtype-less column, so does that idea make sense?

wence- · 2024-01-09T16:46:24Z

python/cudf/cudf/core/dataframe.py

+            as_idx_typ = None
+            if isinstance(columns, list) and len(columns) == 0:
+                # TODO: Generically, an empty dtype-less container
+                # TODO: Why does as_index([]) return FloatIndex


Because cudf.core.column.as_column([]) returns a float column.

wence- · 2024-01-09T16:48:44Z

python/cudf/cudf/core/dataframe.py

+                # mixed typed elements are allowed e.g. [(1, 2), "a"]
+                columns = list(columns)


question, I think, as you noted elsewhere as soon as we have mixed type column names, we can't do many operations (like for instance transposing the frame). Should we instead disallow this?

wence- · 2024-01-09T16:49:47Z

python/cudf/cudf/core/dataframe.py

+                if not isinstance(
+                    columns, MultiIndex
+                ) and columns.nunique() != len(columns):
+                    raise ValueError("Columns cannot contain duplicate values")


question why is it safe for columns to be non-unique if the columns are a multiindex?

wence- · 2024-01-09T16:50:23Z

python/cudf/cudf/core/dataframe.py

+                columns = columns.to_pandas()
+                col_is_rangeindex = isinstance(columns, pd.RangeIndex)
+                col_is_multiindex = isinstance(columns, pd.MultiIndex)
+                if not isinstance(columns, pd.MultiIndex):


Suggested change

if not isinstance(columns, pd.MultiIndex):

if not col_is_multiindex:

wence- · 2024-01-09T17:10:40Z

python/cudf/cudf/core/indexed_frame.py

+            result._data.rangeindex = col_was_rangeindex
+            result._data.multiindex = col_was_multiindex
+            result._data.label_dtype = col_label_dtype
+            return result


suggestion Is frame._slice the only place where we need to care about carrying over this information? It seems like it might be necessary generally. Hence, should we move this to IndexedFrame._gather?

wence- · 2024-01-09T17:11:16Z

python/cudf/cudf/core/indexed_frame.py

+        result._data.rangeindex = col_was_rangeindex
+        result._data.multiindex = col_was_multiindex
+        result._data.label_dtype = col_label_dtype


Similarly here, should _from_columns_like_self handle this transfer of information?

wence- · 2024-01-09T17:13:47Z

python/cudf/cudf/tests/test_dataframe.py

+        # pandas returns Index[object] while this should be an empty RangeIndex
+        # for empty df/other


question are these pandas bugs that we should mark somehow?

wence- · 2024-01-09T17:19:47Z

python/cudf/cudf/core/dataframe.py

+            # TODO: This there a better way to do this?
+            columns_from_data = as_index(columns_from_data)
+            reindexed = self.reindex(
+                columns=columns_from_data.to_pandas(), copy=False
+            )


question What are you trying to do conceptually that you would like a better way for?

wence- · 2024-01-09T17:23:15Z

python/cudf/cudf/core/dataframe.py

@@ -665,38 +665,47 @@ class DataFrame(IndexedFrame, Serializable, GetAttrGetItemMixin):
    def __init__(
        self, data=None, index=None, columns=None, dtype=None, nan_as_null=True
    ):
-        super().__init__()
+        col_is_rangeindex = False


suggestion/discussion point Even after this heroic refactoring to make things clearer, this __init__ method is still very long. I haven't yet reviewed everything in detail because I found it quite hard to follow when things are preprocessing to deliver information to a later part of the function compared with preprocessing to produce the final result.

Hence, would it make sense to write the different cases as free functions (or @staticmethods) so that we have something that then looks like:

if case_a: preprocessed_args = handle_case_a(...) elif case_b: preprocessed_args = handle_case_b(...) # or whatever super().__init__(preprocessed_args)

WDYT?

…me_init

copy-pr-bot · 2024-01-31T21:46:50Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

mroeschke · 2024-01-31T21:48:05Z

Sorry for the notification noise. I'll reopen this PR to reset

mroeschke added 24 commits November 27, 2023 19:06

Start refactoring DataFrame init

d08fba7

Merge remote-tracking branch 'upstream/branch-24.02' into ref/datafra…

40268bd

…me_init

Add dataframe reindexing tests, refactor logic

0969065

Merge remote-tracking branch 'upstream/branch-24.02' into ref/datafra…

40f2764

…me_init

Fix more logic

2fa5f3a

Merge remote-tracking branch 'upstream/branch-24.02' into ref/datafra…

dde5f97

…me_init

Adjust dict logic

89f9280

More bugs in dict and array logic

a4da710

Merge remote-tracking branch 'upstream/branch-24.02' into ref/datafra…

d5c2bec

…me_init

Fix mode initialization, remove working xfail now

8a54791

Merge remote-tracking branch 'upstream/branch-24.02' into ref/datafra…

210baf8

…me_init

Merge remote-tracking branch 'upstream/branch-24.02' into ref/datafra…

05d001e

…me_init

Clean up tests, fix more bugs

36b85cc

Fix more tests, test reindex bug

553fe36

Merge remote-tracking branch 'upstream/branch-24.02' into ref/datafra…

df8c261

…me_init

Fix dict like to avoid reindexing

5baac4e

Adjust test_series_data_with_name_with_columns_matching_align

9ce0a69

add comments

5fcce39

Merge remote-tracking branch 'upstream/branch-24.02' into ref/datafra…

3f05824

…me_init

Merge remote-tracking branch 'upstream/branch-24.02' into ref/datafra…

84ee164

…me_init

Fix some tests and a naming bug

df93b63

pass arguments through colaccessor

77ab160

Remove redundant check

4981b05

Adjust test and add another one with defined behavior

3fdeb87

mroeschke added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change python labels Dec 12, 2023

mroeschke requested a review from a team as a code owner December 12, 2023 03:04

mroeschke requested review from wence- and brandon-b-miller December 12, 2023 03:04

mroeschke added 11 commits December 13, 2023 13:51

Merge remote-tracking branch 'upstream/branch-24.02' into ref/datafra…

9bcb768

…me_init

Merge remote-tracking branch 'upstream/branch-24.02' into ref/datafra…

1a7085d

…me_init

Ensure columns are maintained in slicing

baeaa87

Merge remote-tracking branch 'upstream/branch-24.02' into ref/datafra…

3de72e7

…me_init

Fix .columns usage, fix for pandas 2.0 in concat

645cc33

Merge remote-tracking branch 'upstream/branch-24.02' into ref/datafra…

28947b0

…me_init

Address test failures

d1ce06b

Merge remote-tracking branch 'upstream/branch-24.02' into ref/datafra…

2f3c50e

…me_init

Fix mode

c62aaa6

Merge remote-tracking branch 'upstream/branch-24.02' into ref/datafra…

bf9d22f

…me_init

Allow columns to not be an index

498fc75

wence- reviewed Jan 9, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/branch-24.04' into ref/datafra…

1d06e9d

…me_init

mroeschke requested review from a team as code owners January 31, 2024 21:46

mroeschke requested review from shrshi and divyegala and removed request for a team January 31, 2024 21:46

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue conda Java Affects Java cuDF API. labels Jan 31, 2024

mroeschke changed the base branch from branch-24.02 to branch-24.04 January 31, 2024 21:47

mroeschke closed this Jan 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate DataFrame.init logic to prepare data before calling super #14614

Consolidate DataFrame.init logic to prepare data before calling super #14614

mroeschke commented Dec 12, 2023 •

edited

Loading

wence- Jan 9, 2024

wence- Jan 9, 2024

wence- Jan 9, 2024

wence- Jan 9, 2024

wence- Jan 9, 2024

wence- Jan 9, 2024

wence- Jan 9, 2024

wence- Jan 9, 2024

wence- Jan 9, 2024

wence- Jan 9, 2024

copy-pr-bot bot commented Jan 31, 2024

mroeschke commented Jan 31, 2024

		# mixed typed elements are allowed e.g. [(1, 2), "a"]
		columns = list(columns)

	if not isinstance(columns, pd.MultiIndex):
	if not col_is_multiindex:

		# pandas returns Index[object] while this should be an empty RangeIndex
		# for empty df/other

Consolidate DataFrame.__init__ logic to prepare data before calling super #14614

Consolidate DataFrame.__init__ logic to prepare data before calling super #14614

Conversation

mroeschke commented Dec 12, 2023 • edited Loading

Description

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

copy-pr-bot bot commented Jan 31, 2024

mroeschke commented Jan 31, 2024

Consolidate DataFrame.init logic to prepare data before calling super #14614

Consolidate DataFrame.init logic to prepare data before calling super #14614

mroeschke commented Dec 12, 2023 •

edited

Loading