Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up special casing in as_column for non-typed input #14636

Conversation

mroeschke
Copy link
Contributor

Description

Cleans up a lot of special casing for array-like inputs in as_column that are not typed.

The logic now is essentially

if dtype is not None:
    if a pandas type or object type:
        # parse with pandas
    else
        # parse with arrow
else:
    try:
        # parse with arrow
    except:
        # parse with pandas
# create column

Allows sharing of logic in the pandas/arrow code paths for consistency for different inputs and kinda aligns with the comment in #14627 (comment) cc @wence-

This also fixes a bug where as_column(list-like, dtype=cudf.CategoricalDtype|IntervalDtype(...)) did not keep its specific metadata

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mroeschke mroeschke added bug Something isn't working Python Affects Python cuDF API. non-breaking Non-breaking change labels Dec 14, 2023
@mroeschke mroeschke requested a review from a team as a code owner December 14, 2023 22:47
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good direction!

Comment on lines 1914 to 1919
def can_memoryview(arbitrary: Any) -> bool:
try:
memoryview(arbitrary)
return True
except TypeError:
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion

Suggested change
def can_memoryview(arbitrary: Any) -> bool:
try:
memoryview(arbitrary)
return True
except TypeError:
return False
def as_memoryview(arbitrary: Any) -> Optional[memoryview]:
try:
return memoryview(arbitrary)
except TypeError:
return None

Comment on lines 2336 to 2339
elif can_memoryview(arbitrary):
data = as_column(
memoryview(arbitrary), dtype=dtype, nan_as_null=nan_as_null
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
elif can_memoryview(arbitrary):
data = as_column(
memoryview(arbitrary), dtype=dtype, nan_as_null=nan_as_null
)
elif (view := as_memoryview(arbitrary)) is not None:
data = as_column(view, dtype=dtype, nan_as_null=nan_as_null)

Then one could merge this handler into the isinstance(arbitrary, memoryview) case as well with:

if isinstance((view := as_memoryview(arbitrary)), memoryview):
    data = as_column(np.asarray(view), dtype=dtype, nan_as_null=nan_as_null)

Which brings me to a further suggestion for this function. One of the issues is that it's very difficult to read the control flow. AIUI, we basically have lots of different branches to check if we can produce a Column, assign that to data and then eventually return data. But it's unclear (for example here) when we do data = as_column(memoryview(arb), ...) whether the is any further processing to be done, or if we are now ready to return the final answer.

I think I would prefer if all of these branches return ... rather than assigning to a variable and going to fallthrough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would prefer if all of these branches return ... rather than assigning to a variable and going to fallthrough.

Good point. I'll try to have every branch do return ... for clarity. One aspect I'm not sure if when dtype= is specified that an "astype" is done if the input is typed for all branches

)
np_type = np_dtype.type
pa_type = np_to_pa_dtype(np_dtype)
from_pandas = True if nan_as_null is None else nan_as_null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from_pandas = True if nan_as_null is None else nan_as_null
from_pandas = nan_as_null is None or nan_as_null

):
arbitrary = arbitrary.cast(typ)
except (pa.ArrowInvalid, pa.ArrowTypeError):
if any(is_column_like(val) for val in arbitrary):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: are we guaranteed that arbitrary is iterable at this point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe so since an initial pa.array call proxy-checks that, but this would fail here for generators for example so I'll make sure this branch casts the input to list or similar

data = as_column(arbitrary, nan_as_null=nan_as_null)
else:
try:
arbitrary = pa.array(arbitrary, from_pandas=from_pandas)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to go via arrow first and only then try and fall back to pandas? Or should we just go via pandas? I note this PR does not fix #14627 because I think we still go down the same (different) arrow vs pandas codepath.

@wence-
Copy link
Contributor

wence- commented Dec 15, 2023

The logic now is essentially

if dtype is not None:
    if a pandas type or object type:
        # parse with pandas
    else
        # parse with arrow
else:
    try:
        # parse with arrow
    except:
        # parse with pandas
# create column

question I still wonder if this is more complicated than it needs to be. If we're in these branches we have an "arbitrary" object that wasn't a Series/array/arrow-array.

Is there a disadvantage to (in all cases) doing: data = parse_with_pandas(arbitrary). It seems like the logic as shown above could have different behaviour passing a pandas scalar object in (that arrow recognises) depending on whether one additionally specifies dtype.

Concretely:

as_column(pd_scalar_obj) # parsed with arrow
as_column(pd_scalar_obj, dtype=pd_scalar_obj.dtype) # parsed with pandas

Are there objects that arrow can convert that pandas cannot?

@mroeschke
Copy link
Contributor Author

Are there objects that arrow can convert that pandas cannot?

Parsing via Arrow first is beneficial for

  1. Returning List or Struct typed arrays. (Though looks like we need a carve-out for a list of Series for example which we support as List type)
  2. nan_as_null keyword flexibility and not having an NA like value always cast the result to float/object
In [1]: import pandas as pd

In [2]: pd.Series([1, None])
Out[2]: 
0    1.0
1    NaN
dtype: float64

In [3]: import pyarrow as pa

In [4]: pa.array([1, None])
Out[4]: 
<pyarrow.lib.Int64Array object at 0x7face4728580>
[
  1,
  null
]

We could at this point do data introspection for these two cases and parse with arrow if we see either case otherwise always parse with pandas.

rapids-bot bot pushed a commit that referenced this pull request Dec 18, 2023
@mroeschke mroeschke marked this pull request as draft January 11, 2024 20:04
@mroeschke mroeschke changed the base branch from branch-24.02 to branch-24.04 January 31, 2024 22:05
rapids-bot bot pushed a commit that referenced this pull request Mar 8, 2024
…14961)

Broken off of #14636, these cases are strict about a `dtype` being set so no need to be in a try except

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #14961
@mroeschke
Copy link
Contributor Author

Going to close in favor of #15276

@mroeschke mroeschke closed this Mar 12, 2024
@mroeschke mroeschke deleted the ref/as_column/misc_list_like_handling branch March 12, 2024 00:55
rapids-bot bot pushed a commit that referenced this pull request Apr 17, 2024
Redo at #14636

Clean up special casing for non-typed inputs to essentially do:

```
try:
     arbitrary = pa.array(arbitrary)
except:
     arbitrary = pd.Series(arbitrary)
return as_column(arbitrary)
```

Additionally, this change matches a behavior with pandas that will parse string data with `dtype=datetime64` type similar to the 2.2 behavior (fail if the resolution of the type doesn't match the string data)

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #15276
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants