-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up special casing in as_column for non-typed input #14636
Clean up special casing in as_column for non-typed input #14636
Conversation
…mn/misc_list_like_handling
…mn/misc_list_like_handling
…mn/misc_list_like_handling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good direction!
def can_memoryview(arbitrary: Any) -> bool: | ||
try: | ||
memoryview(arbitrary) | ||
return True | ||
except TypeError: | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion
def can_memoryview(arbitrary: Any) -> bool: | |
try: | |
memoryview(arbitrary) | |
return True | |
except TypeError: | |
return False | |
def as_memoryview(arbitrary: Any) -> Optional[memoryview]: | |
try: | |
return memoryview(arbitrary) | |
except TypeError: | |
return None |
elif can_memoryview(arbitrary): | ||
data = as_column( | ||
memoryview(arbitrary), dtype=dtype, nan_as_null=nan_as_null | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elif can_memoryview(arbitrary): | |
data = as_column( | |
memoryview(arbitrary), dtype=dtype, nan_as_null=nan_as_null | |
) | |
elif (view := as_memoryview(arbitrary)) is not None: | |
data = as_column(view, dtype=dtype, nan_as_null=nan_as_null) |
Then one could merge this handler into the isinstance(arbitrary, memoryview)
case as well with:
if isinstance((view := as_memoryview(arbitrary)), memoryview):
data = as_column(np.asarray(view), dtype=dtype, nan_as_null=nan_as_null)
Which brings me to a further suggestion for this function. One of the issues is that it's very difficult to read the control flow. AIUI, we basically have lots of different branches to check if we can produce a Column
, assign that to data
and then eventually return data
. But it's unclear (for example here) when we do data = as_column(memoryview(arb), ...)
whether the is any further processing to be done, or if we are now ready to return the final answer.
I think I would prefer if all of these branches return ...
rather than assigning to a variable and going to fallthrough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would prefer if all of these branches return ... rather than assigning to a variable and going to fallthrough.
Good point. I'll try to have every branch do return ...
for clarity. One aspect I'm not sure if when dtype=
is specified that an "astype" is done if the input is typed for all branches
) | ||
np_type = np_dtype.type | ||
pa_type = np_to_pa_dtype(np_dtype) | ||
from_pandas = True if nan_as_null is None else nan_as_null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from_pandas = True if nan_as_null is None else nan_as_null | |
from_pandas = nan_as_null is None or nan_as_null |
): | ||
arbitrary = arbitrary.cast(typ) | ||
except (pa.ArrowInvalid, pa.ArrowTypeError): | ||
if any(is_column_like(val) for val in arbitrary): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: are we guaranteed that arbitrary
is iterable at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe so since an initial pa.array
call proxy-checks that, but this would fail here for generators for example so I'll make sure this branch casts the input to list or similar
data = as_column(arbitrary, nan_as_null=nan_as_null) | ||
else: | ||
try: | ||
arbitrary = pa.array(arbitrary, from_pandas=from_pandas) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to go via arrow first and only then try and fall back to pandas? Or should we just go via pandas? I note this PR does not fix #14627 because I think we still go down the same (different) arrow vs pandas codepath.
question I still wonder if this is more complicated than it needs to be. If we're in these branches we have an "arbitrary" object that wasn't a Series/array/arrow-array. Is there a disadvantage to (in all cases) doing: Concretely:
Are there objects that arrow can convert that pandas cannot? |
…mn/misc_list_like_handling
Parsing via Arrow first is beneficial for
In [1]: import pandas as pd
In [2]: pd.Series([1, None])
Out[2]:
0 1.0
1 NaN
dtype: float64
In [3]: import pyarrow as pa
In [4]: pa.array([1, None])
Out[4]:
<pyarrow.lib.Int64Array object at 0x7face4728580>
[
1,
null
] We could at this point do data introspection for these two cases and parse with arrow if we see either case otherwise always parse with pandas. |
Implemented @wence- 's suggestion in #14636 (comment) Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Bradley Dice (https://github.com/bdice) URL: #14643
…mn/misc_list_like_handling
…mn/misc_list_like_handling
…mn/misc_list_like_handling
…mn/misc_list_like_handling
…14961) Broken off of #14636, these cases are strict about a `dtype` being set so no need to be in a try except Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #14961
Going to close in favor of #15276 |
Redo at #14636 Clean up special casing for non-typed inputs to essentially do: ``` try: arbitrary = pa.array(arbitrary) except: arbitrary = pd.Series(arbitrary) return as_column(arbitrary) ``` Additionally, this change matches a behavior with pandas that will parse string data with `dtype=datetime64` type similar to the 2.2 behavior (fail if the resolution of the type doesn't match the string data) Authors: - Matthew Roeschke (https://github.com/mroeschke) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #15276
Description
Cleans up a lot of special casing for array-like inputs in
as_column
that are not typed.The logic now is essentially
Allows sharing of logic in the pandas/arrow code paths for consistency for different inputs and kinda aligns with the comment in #14627 (comment) cc @wence-
This also fixes a bug where
as_column(list-like, dtype=cudf.CategoricalDtype|IntervalDtype(...))
did not keep its specific metadataChecklist