Add arrow schema extraction dispatch #9169

galipremsagar · 2022-06-07T05:18:39Z

After #9131 was merged, it brought in a change to default parameter schema from None to "infer", this results in a failure when we try to use parquet writer in dask_cudf because of pandas specific code. This PR adds a pyarrow_schema_dispatch that will handle the pyarrow schema extraction for their respective backends.

Closes #xxxx
Tests added / passed
Passes pre-commit run --all-files

charlesbluca

Can you give a minimal reproducer of the dask-cudf failure itself? Might help to add a gpuCI test in this PR to ensure that a breaking change like this isn't made in the future

dask/dataframe/backends.py

galipremsagar · 2022-06-07T14:49:13Z

Can you give a minimal reproducer of the dask-cudf failure itself? Might help to add a gpuCI test in this PR to ensure that a breaking change like this isn't made in the future

Here is the minimal repro:

df = cudf.DataFrame(
            {
                "a": ["abc", "def"],
                "b": ["a", "z"],
            }
        )
ddf = dask_cudf.from_cudf(df, 3)
ddf.to_parquet("sample.parquet")

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../envs/cudfdev/lib/python3.9/contextlib.py:79: in inner
    return func(*args, **kwds)
python/dask_cudf/dask_cudf/core.py:278: in to_parquet
    return to_parquet(self, path, *args, **kwargs)
../envs/cudfdev/lib/python3.9/site-packages/dask-2022.3.0+153.g256622943.dirty-py3.9.egg/dask/dataframe/io/parquet/core.py:844: in to_parquet
    i_offset, fmd, metadata_file_exists, extra_write_kwargs = engine.initialize_write(
../envs/cudfdev/lib/python3.9/site-packages/dask-2022.3.0+153.g256622943.dirty-py3.9.egg/dask/dataframe/io/parquet/arrow.py:516: in initialize_write
    inferred_schema = pa.Schema.from_pandas(
pyarrow/types.pxi:1492: in pyarrow.lib.Schema.from_pandas
    ???
../envs/cudfdev/lib/python3.9/site-packages/pyarrow/pandas_compat.py:525: in dataframe_to_types
    values = c.values
../envs/cudfdev/lib/python3.9/contextlib.py:79: in inner
    return func(*args, **kwds)
../envs/cudfdev/lib/python3.9/site-packages/cudf/core/single_column_frame.py:118: in values
    return self._column.values
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <cudf.core.column.string.StringColumn object at 0x7f6373e9f1c0>
[
  "cat",
  "dog"
]
dtype: object

    @property
    def values(self) -> cupy.ndarray:
        """
        Return a CuPy representation of the StringColumn.
        """
>       raise TypeError("String Arrays is not yet implemented in cudf")
E       TypeError: String Arrays is not yet implemented in cudf

../envs/cudfdev/lib/python3.9/site-packages/cudf/core/column/string.py:5321: TypeError

I've added a GPU test for the same in this PR.

dask/dataframe/io/tests/test_parquet.py

This PR relaxes `dask` & `distributed` pinnings for `22.08` development. Requires: dask/dask#9169 This PR also includes `pyarrow_schema_dispatch` implementation for `dask-cudf` Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) - https://github.com/jakirkham Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Charles Blackmon-Luca (https://github.com/charlesbluca) - https://github.com/jakirkham URL: #11058

add arrow schema dispatch

f4457e0

github-actions bot added dataframe dispatch Related to `Dispatch` extension objects io labels Jun 7, 2022

galipremsagar mentioned this pull request Jun 7, 2022

[REVIEW] Unpin dask & distributed for development rapidsai/cudf#11058

Merged

galipremsagar added 2 commits June 6, 2022 22:36

fix import

752067e

isort

2566229

charlesbluca reviewed Jun 7, 2022

View reviewed changes

rjzamora reviewed Jun 7, 2022

View reviewed changes

dask/dataframe/backends.py Outdated Show resolved Hide resolved

galipremsagar added 2 commits June 7, 2022 07:46

add gpu test

4ffd4ac

remove try

ff0bcad

galipremsagar requested review from rjzamora and charlesbluca June 7, 2022 14:49

Update test_parquet.py

30bb5a8

galipremsagar commented Jun 7, 2022

View reviewed changes

dask/dataframe/io/tests/test_parquet.py Outdated Show resolved Hide resolved

charlesbluca approved these changes Jun 7, 2022

View reviewed changes

rjzamora reviewed Jun 7, 2022

View reviewed changes

dask/dataframe/io/tests/test_parquet.py Outdated Show resolved Hide resolved

galipremsagar commented Jun 7, 2022

View reviewed changes

dask/dataframe/io/tests/test_parquet.py Outdated Show resolved Hide resolved

galipremsagar added 2 commits June 7, 2022 11:13

Update dask/dataframe/io/tests/test_parquet.py

191f89e

address reviews

9ec2a64

galipremsagar requested a review from rjzamora June 7, 2022 16:18

rjzamora approved these changes Jun 7, 2022

View reviewed changes

rjzamora merged commit 6369cdb into dask:main Jun 7, 2022

fbunt mentioned this pull request Jun 14, 2022

dask 2022.6.0 causes ArrowTypeError in to_parquet geopandas/dask-geopandas#198

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add arrow schema extraction dispatch #9169

Add arrow schema extraction dispatch #9169

galipremsagar commented Jun 7, 2022

charlesbluca left a comment

galipremsagar commented Jun 7, 2022

Add arrow schema extraction dispatch #9169

Add arrow schema extraction dispatch #9169

Conversation

galipremsagar commented Jun 7, 2022

charlesbluca left a comment

Choose a reason for hiding this comment

galipremsagar commented Jun 7, 2022