Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate 1D pandas object handling in as_column #14394

Merged

Conversation

mroeschke
Copy link
Contributor

@mroeschke mroeschke commented Nov 10, 2023

Description

Currently as_column has a few different branches handling pandas objects. This PR consolidates Series, Index and ExtensionArray handling into 1 if branch such handling is consistent between the 3.

This also disallows float16 types passed into dtype constructor arguments or typed data arguments

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mroeschke mroeschke added Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change python labels Nov 10, 2023
@mroeschke mroeschke requested a review from a team as a code owner November 10, 2023 02:30
Comment on lines +2066 to +2067
if cudf.get_option("mode.pandas_compatible"):
raise NotImplementedError("not supported")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: What happens to this type of data if we're not in pandas-compat mode? And why is it not supported if we are?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens to this type of data if we're not in pandas-compat mode?

We currently convert this correctly to a corresponding cudf type

In [1]: import cudf; import pandas as pd

In [2]: cudf.from_pandas(pd.Series([1], dtype=pd.Int64Dtype()))
Out[2]: 
0    1
dtype: int64

And why is it not supported if we are?

We disallowed it for cudf.pandas since currently we cannot roundtrip back to pandas correctly since cudf doesn't keep track whether a e.g. numpy.int64 or pandas.Int64Dtype was passed in

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, makes sense, thanks.

python/cudf/cudf/core/column/column.py Show resolved Hide resolved
Comment on lines 2097 to 2098
# float16
arbitrary = arbitrary.astype(np.dtype(np.float32))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Because cudf doesn't support float16 natively? Does this potentially cause problems in cudf-pandas mode since the dtype will not be preserved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I believe so I was migrating this code below here

               arb_dtype = (
                    cudf.dtype("float32")
                    if arbitrary.dtype == "float16"
                    else cudf.dtype(arbitrary.dtype)
                )

pandas also does not support float16 (which was made more intentional by raising in pandas 2.0, before it would sometimes coerce to float32 also), so I don't think the dtype preservation here isn't too much of an issue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion:

Given that the pandas-2 behaviour raises, let us take this opportunity to add a deprecation warning here so that we can also raise once we're supporting pandas-2. (See https://docs.rapids.ai/api/cudf/nightly/developer_guide/contributing_guide/#deprecating-and-removing-code)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah so my generalization about float16 in pandas 2.0 isn't entirely correct.

Only pandas Index objects will disallow float16 (IIRC there's no hashtable implementation for this type), but Series and DataFrame objects will continue to allow float16.

Copy link
Contributor

@wence- wence- Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, but you can't do many things with float16 series (e.g. merge doesn't work).

I think I would rather raise here (as we do at the moment for non-pandas data):

import cudf
cudf.Series([1, 2, 3], dtype="float16")
# TypeError: Unsupported type float16

Rather than silently upcasting. WDYT? I realise this would be a breaking change (since we currently do upcast in from_pandas).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agreed raising here is better than silent upcasting, I'll make this raise and mark as a breaking change

python/cudf/cudf/core/column/column.py Outdated Show resolved Hide resolved
raise TypeError(
f"Cannot convert a object type of {inferred_dtype}"
)
# TODO: nan_as_na interaction here with from_pandas
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we open an issue documenting the various TODOs in data ingest so that we have an overview of what is still in-progress? It's unclear to me how much of this needs new implementation in cudf and how much needs thinking about the boundaries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes definitely. Once I get all the existing tests passing I'll make an issue of the edge cases to handle

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mroeschke mroeschke changed the base branch from branch-23.12 to branch-24.02 November 16, 2023 03:09
@mroeschke
Copy link
Contributor Author

Should be ready for another review. After another pass I don't think there's any outstanding todo's

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor question and a suggestion to add a deprecation warning, but approving now because this looks in great shape.

Comment on lines 2097 to 2098
# float16
arbitrary = arbitrary.astype(np.dtype(np.float32))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion:

Given that the pandas-2 behaviour raises, let us take this opportunity to add a deprecation warning here so that we can also raise once we're supporting pandas-2. (See https://docs.rapids.ai/api/cudf/nightly/developer_guide/contributing_guide/#deprecating-and-removing-code)

if nan_as_null is None or nan_as_null is True:
data = build_column(buffer, dtype=arbitrary.dtype)
data = _make_copy_replacing_NaT_with_null(data)
mask = data.mask
else:
bool_mask = as_column(~np.isnat(arbitrary))
mask = as_buffer(bools_to_mask(bool_mask))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it turns NaT into nulls, is that the intention?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think we still want to consider NaT values when creating the mask (as null values), but not necessarily cast the value to NA as tested in test_series_np_array_nat_nan_as_null_false. cc @galipremsagar

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, at one point of time we supported both NA and NaT for datetime & timedelta columns. Now we just treat NA as NaT and vice-versa for datetime & timedelta column. With recent pandas accelerator mode work, we decided to just repr out NA as NAT. So yes we need to mark the mask when we have NAT's anywhere.

if nan_as_null is None or nan_as_null is True:
data = build_column(buffer, dtype=arbitrary.dtype)
data = _make_copy_replacing_NaT_with_null(data)
mask = data.mask
else:
bool_mask = as_column(~np.isnat(arbitrary))
mask = as_buffer(bools_to_mask(bool_mask))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here.

@mroeschke mroeschke changed the title REF: Consolidate 1D pandas object handling in as_column Consolidate 1D pandas object handling in as_column Nov 28, 2023
@mroeschke mroeschke added breaking Breaking change and removed non-breaking Non-breaking change labels Nov 28, 2023
@mroeschke
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit e15290a into rapidsai:branch-24.02 Nov 29, 2023
67 checks passed
@mroeschke mroeschke deleted the ref/as_column/pandas_handling branch November 29, 2023 19:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change improvement Improvement / enhancement to an existing function Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants