Consolidate 1D pandas object handling in as_column #14394

mroeschke · 2023-11-10T02:30:52Z

Description

Currently as_column has a few different branches handling pandas objects. This PR consolidates Series, Index and ExtensionArray handling into 1 if branch such handling is consistent between the 3.

This also disallows float16 types passed into dtype constructor arguments or typed data arguments

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…mn/pandas_handling

wence- · 2023-11-14T11:11:27Z

python/cudf/cudf/core/column/column.py

+            if cudf.get_option("mode.pandas_compatible"):
+                raise NotImplementedError("not supported")


question: What happens to this type of data if we're not in pandas-compat mode? And why is it not supported if we are?

What happens to this type of data if we're not in pandas-compat mode?

We currently convert this correctly to a corresponding cudf type

In [1]: import cudf; import pandas as pd In [2]: cudf.from_pandas(pd.Series([1], dtype=pd.Int64Dtype())) Out[2]: 0 1 dtype: int64

And why is it not supported if we are?

We disallowed it for cudf.pandas since currently we cannot roundtrip back to pandas correctly since cudf doesn't keep track whether a e.g. numpy.int64 or pandas.Int64Dtype was passed in

Ah, makes sense, thanks.

python/cudf/cudf/core/column/column.py

wence- · 2023-11-14T11:15:32Z

python/cudf/cudf/core/column/column.py

+                # float16
+                arbitrary = arbitrary.astype(np.dtype(np.float32))


question: Because cudf doesn't support float16 natively? Does this potentially cause problems in cudf-pandas mode since the dtype will not be preserved.

Yeah I believe so I was migrating this code below here

arb_dtype = ( cudf.dtype("float32") if arbitrary.dtype == "float16" else cudf.dtype(arbitrary.dtype) )

pandas also does not support float16 (which was made more intentional by raising in pandas 2.0, before it would sometimes coerce to float32 also), so I don't think the dtype preservation here isn't too much of an issue

suggestion:

Given that the pandas-2 behaviour raises, let us take this opportunity to add a deprecation warning here so that we can also raise once we're supporting pandas-2. (See https://docs.rapids.ai/api/cudf/nightly/developer_guide/contributing_guide/#deprecating-and-removing-code)

Ah so my generalization about float16 in pandas 2.0 isn't entirely correct.

Only pandas Index objects will disallow float16 (IIRC there's no hashtable implementation for this type), but Series and DataFrame objects will continue to allow float16.

Ah, but you can't do many things with float16 series (e.g. merge doesn't work).

I think I would rather raise here (as we do at the moment for non-pandas data):

import cudf cudf.Series([1, 2, 3], dtype="float16") # TypeError: Unsupported type float16

Rather than silently upcasting. WDYT? I realise this would be a breaking change (since we currently do upcast in from_pandas).

Yeah agreed raising here is better than silent upcasting, I'll make this raise and mark as a breaking change

python/cudf/cudf/core/column/column.py

wence- · 2023-11-14T11:23:19Z

python/cudf/cudf/core/column/column.py

+                raise TypeError(
+                    f"Cannot convert a object type of {inferred_dtype}"
+                )
+            # TODO: nan_as_na interaction here with from_pandas


Can we open an issue documenting the various TODOs in data ingest so that we have an overview of what is still in-progress? It's unclear to me how much of this needs new implementation in cudf and how much needs thinking about the boundaries.

Yes definitely. Once I get all the existing tests passing I'll make an issue of the edge cases to handle

…mn/pandas_handling

mroeschke · 2023-11-27T21:47:57Z

Should be ready for another review. After another pass I don't think there's any outstanding todo's

wence-

One minor question and a suggestion to add a deprecation warning, but approving now because this looks in great shape.

wence- · 2023-11-28T12:16:40Z

python/cudf/cudf/core/column/column.py

+                # float16
+                arbitrary = arbitrary.astype(np.dtype(np.float32))


suggestion:

Given that the pandas-2 behaviour raises, let us take this opportunity to add a deprecation warning here so that we can also raise once we're supporting pandas-2. (See https://docs.rapids.ai/api/cudf/nightly/developer_guide/contributing_guide/#deprecating-and-removing-code)

wence- · 2023-11-28T12:22:51Z

python/cudf/cudf/core/column/column.py

            if nan_as_null is None or nan_as_null is True:
                data = build_column(buffer, dtype=arbitrary.dtype)
                data = _make_copy_replacing_NaT_with_null(data)
                mask = data.mask
+            else:
+                bool_mask = as_column(~np.isnat(arbitrary))
+                mask = as_buffer(bools_to_mask(bool_mask))


This seems like it turns NaT into nulls, is that the intention?

Yeah I think we still want to consider NaT values when creating the mask (as null values), but not necessarily cast the value to NA as tested in test_series_np_array_nat_nan_as_null_false. cc @galipremsagar

Yeah, at one point of time we supported both NA and NaT for datetime & timedelta columns. Now we just treat NA as NaT and vice-versa for datetime & timedelta column. With recent pandas accelerator mode work, we decided to just repr out NA as NAT. So yes we need to mark the mask when we have NAT's anywhere.

wence- · 2023-11-28T12:23:05Z

python/cudf/cudf/core/column/column.py

            if nan_as_null is None or nan_as_null is True:
                data = build_column(buffer, dtype=arbitrary.dtype)
                data = _make_copy_replacing_NaT_with_null(data)
                mask = data.mask
+            else:
+                bool_mask = as_column(~np.isnat(arbitrary))
+                mask = as_buffer(bools_to_mask(bool_mask))


Same question here.

…mn/pandas_handling

python/cudf/cudf/tests/test_series.py

Co-authored-by: GALI PREM SAGAR <[email protected]>

mroeschke · 2023-11-29T19:59:43Z

/merge

REF: Consolidate 1D pandas object handling in as_column

e9c62bf

mroeschke added Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change python labels Nov 10, 2023

mroeschke requested a review from a team as a code owner November 10, 2023 02:30

mroeschke requested review from wence- and brandon-b-miller November 10, 2023 02:30

mroeschke added 5 commits November 13, 2023 15:12

Merge remote-tracking branch 'upstream/branch-23.12' into ref/as_colu…

615c343

…mn/pandas_handling

Fix series errors

a8b5b1d

Revert from_pandas for now

216f70a

Merge remote-tracking branch 'upstream/branch-23.12' into ref/as_colu…

8cb9004

…mn/pandas_handling

Refactor object na logic

c79b494

wence- reviewed Nov 15, 2023

View reviewed changes

mroeschke added 6 commits November 15, 2023 13:46

Merge remote-tracking branch 'upstream/branch-23.12' into ref/as_colu…

eb4df96

…mn/pandas_handling

Revert ~

8517829

Merge remote-tracking branch 'upstream/branch-23.12' into ref/as_colu…

927b677

…mn/pandas_handling

Fix unit tests

18be376

Comment about .array

98b5a5c

Merge remote-tracking branch 'upstream/branch-23.12' into ref/as_colu…

ada7d6b

…mn/pandas_handling

mroeschke changed the base branch from branch-23.12 to branch-24.02 November 16, 2023 03:09

mroeschke added 2 commits November 27, 2023 13:22

Merge remote-tracking branch 'upstream/branch-24.02' into ref/as_colu…

58822df

…mn/pandas_handling

Add comment, remove todo as subdtype validate above

accee6a

wence- approved these changes Nov 28, 2023

View reviewed changes

mroeschke changed the title ~~REF: Consolidate 1D pandas object handling in as_column~~ Consolidate 1D pandas object handling in as_column Nov 28, 2023

mroeschke added 2 commits November 28, 2023 10:26

Merge remote-tracking branch 'upstream/branch-24.02' into ref/as_colu…

69d59ec

…mn/pandas_handling

Make float16 raise on construction

c69f80e

mroeschke added breaking Breaking change and removed non-breaking Non-breaking change labels Nov 28, 2023

galipremsagar reviewed Nov 28, 2023

View reviewed changes

python/cudf/cudf/tests/test_series.py Outdated Show resolved Hide resolved

Update python/cudf/cudf/tests/test_series.py

4a45752

Co-authored-by: GALI PREM SAGAR <[email protected]>

rapids-bot bot merged commit e15290a into rapidsai:branch-24.02 Nov 29, 2023
67 checks passed

mroeschke deleted the ref/as_column/pandas_handling branch November 29, 2023 19:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate 1D pandas object handling in as_column #14394

Consolidate 1D pandas object handling in as_column #14394

mroeschke commented Nov 10, 2023 •

edited

Loading

wence- Nov 14, 2023

mroeschke Nov 15, 2023

wence- Nov 16, 2023

wence- Nov 14, 2023

mroeschke Nov 15, 2023

wence- Nov 28, 2023

mroeschke Nov 28, 2023

wence- Nov 28, 2023 •

edited

Loading

mroeschke Nov 28, 2023

wence- Nov 14, 2023

mroeschke Nov 15, 2023

wence- Nov 16, 2023

mroeschke commented Nov 27, 2023

wence- left a comment

wence- Nov 28, 2023

wence- Nov 28, 2023

mroeschke Nov 28, 2023

galipremsagar Nov 28, 2023

wence- Nov 28, 2023

mroeschke commented Nov 29, 2023

		if cudf.get_option("mode.pandas_compatible"):
		raise NotImplementedError("not supported")

Consolidate 1D pandas object handling in as_column #14394

Consolidate 1D pandas object handling in as_column #14394

Conversation

mroeschke commented Nov 10, 2023 • edited Loading

Description

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wence- Nov 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke commented Nov 27, 2023

wence- left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke commented Nov 29, 2023

mroeschke commented Nov 10, 2023 •

edited

Loading

wence- Nov 28, 2023 •

edited

Loading