-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add arrow schema extraction dispatch #9169
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give a minimal reproducer of the dask-cudf failure itself? Might help to add a gpuCI test in this PR to ensure that a breaking change like this isn't made in the future
Here is the minimal repro: df = cudf.DataFrame(
{
"a": ["abc", "def"],
"b": ["a", "z"],
}
)
ddf = dask_cudf.from_cudf(df, 3)
ddf.to_parquet("sample.parquet")
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../envs/cudfdev/lib/python3.9/contextlib.py:79: in inner
return func(*args, **kwds)
python/dask_cudf/dask_cudf/core.py:278: in to_parquet
return to_parquet(self, path, *args, **kwargs)
../envs/cudfdev/lib/python3.9/site-packages/dask-2022.3.0+153.g256622943.dirty-py3.9.egg/dask/dataframe/io/parquet/core.py:844: in to_parquet
i_offset, fmd, metadata_file_exists, extra_write_kwargs = engine.initialize_write(
../envs/cudfdev/lib/python3.9/site-packages/dask-2022.3.0+153.g256622943.dirty-py3.9.egg/dask/dataframe/io/parquet/arrow.py:516: in initialize_write
inferred_schema = pa.Schema.from_pandas(
pyarrow/types.pxi:1492: in pyarrow.lib.Schema.from_pandas
???
../envs/cudfdev/lib/python3.9/site-packages/pyarrow/pandas_compat.py:525: in dataframe_to_types
values = c.values
../envs/cudfdev/lib/python3.9/contextlib.py:79: in inner
return func(*args, **kwds)
../envs/cudfdev/lib/python3.9/site-packages/cudf/core/single_column_frame.py:118: in values
return self._column.values
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <cudf.core.column.string.StringColumn object at 0x7f6373e9f1c0>
[
"cat",
"dog"
]
dtype: object
@property
def values(self) -> cupy.ndarray:
"""
Return a CuPy representation of the StringColumn.
"""
> raise TypeError("String Arrays is not yet implemented in cudf")
E TypeError: String Arrays is not yet implemented in cudf
../envs/cudfdev/lib/python3.9/site-packages/cudf/core/column/string.py:5321: TypeError I've added a GPU test for the same in this PR. |
This PR relaxes `dask` & `distributed` pinnings for `22.08` development. Requires: dask/dask#9169 This PR also includes `pyarrow_schema_dispatch` implementation for `dask-cudf` Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) - https://github.com/jakirkham Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Charles Blackmon-Luca (https://github.com/charlesbluca) - https://github.com/jakirkham URL: #11058
After #9131 was merged, it brought in a change to default parameter
schema
fromNone
to"infer"
, this results in a failure when we try to use parquet writer indask_cudf
because of pandas specific code. This PR adds apyarrow_schema_dispatch
that will handle the pyarrow schema extraction for their respective backends.pre-commit run --all-files