-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Calling to_pandas() on a dask dataframe containing a 'struct' column can result in a raised PyArrow exception #13305
Comments
I apologize! I missed that you dropped a link to it :) |
Here is a more minimal reproducer, it appears to be some interaction between a column of empty structs with nulls and the way dask_cudf initialises the data. import cudf
import dask_cudf
s = cudf.Series([None, {}])
ds = dask_cudf.from_cudf(s, npartitions=1)
ds.compute() |
OK, this is a bug slicing struct columns: import cudf
s = cudf.Series([None, {}])
s.iloc[0:2] # => `ArrowInvalid` error |
The null mask wrapped up by the cython layer's |
An empty struct column (dtype of StructDtype({})) has no children, and hence a base_size of zero. However, it may still have a non-zero size and non-empty null mask. When slicing such a column, the mask size must be transferred over correctly by inspecting the size and offset of the owning column. Previously, we incorrectly determined the sliced column to have a mask buffer of zero bytes in this case. Closes #13305.
An empty struct column (dtype of StructDtype({})) has no children, and hence a base_size of zero. However, it may still have a non-zero size and non-empty null mask. When slicing such a column, the mask size must be transferred over correctly by inspecting the size and offset of the owning column. Previously, we incorrectly determined the sliced column to have a mask buffer of zero bytes in this case. Closes #13305. Authors: - Lawrence Mitchell (https://github.com/wence-) - Ashwin Srinath (https://github.com/shwina) Approvers: - Bradley Dice (https://github.com/bdice) URL: #13315
Describe the bug
Calling
to_pandas()
on a dask dataframe created from a cuDF dataframe containing a 'struct' column can result in a raised PyArrow exception:Steps/Code to reproduce bug
Using this example file
azure_ad_logs.json
: https://github.com/nv-morpheus/Morpheus/blob/branch-23.07/tests/tests_data/azure_ad_logs.jsonExpected behavior
A pandas dataframe should be returned.
Environment overview (please complete the following information)
Conda environment
grep 'cudf|arrow|dask' arrow 1.2.3 pypi_0 pypi arrow-cpp 10.0.1 ha770c72_14_cuda conda-forge cudf 23.02.00 cuda_11_py310_g5ad4a85b9d_0 rapidsai dask 2023.1.1 pyhd8ed1ab_0 conda-forge dask-core 2023.1.1 pyhd8ed1ab_0 conda-forge dask-cuda 23.2.1 pyhd8ed1ab_1 conda-forge dask-cudf 23.02.00 cuda_11_py310_g5ad4a85b9d_0 rapidsai libarrow 10.0.1 h255618e_14_cuda conda-forge libcudf 23.02.00 cuda11_g5ad4a85b9d_0 rapidsai pyarrow 10.0.1 py310hc81d9b2_14_cuda conda-forge
Environment details
Please run and paste the output of the
cudf/print_env.sh
script here, to gather any other relevant environment detailsAdditional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: