-
-
Notifications
You must be signed in to change notification settings - Fork 723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Object columns cause P2P shuffling errors after v2023.9.2 #8200
Comments
We ran into the same issue. For us it only occurs if we use strings in the columns. This is the minimum example I came up with:
In my case the error only happens if I use how="outer", with how="inner" there is no crash. Environment
|
P2P shuffling is breaking multiple things. Earlier, the merge operation with left_index=True and right_index=True was not working, and now this error is occurring. |
@aiudirog |
@aiudirog and @sil-lnagel, thanks for reporting your issues with reproducers. I've been able to reproduce this bug and am working on a fix. @aiudirog: Great job for digging deeper into this! You are correct that using a single schema from Dask's |
@hendrikmakait Thanks for fixing this! Hopefully in the future we'll be able to completely replace object columns with specialized PyArrow types @JaguarPaw2409 I have not. The closest I got was using |
Looks like issues about using complex PyArrow types have already been opened: |
The error still persists in v2024.5.2 |
@Cognitus-Stuti Do you have an example that triggers the error? I just tested my original example and it didn't error on v2024.5.2 or v2024.8.0. |
Describe the issue:
Since v2023.9.2, using object columns with
dataframe.convert-string
disabled causes aP2P shuffling <ID> failed during unpack phase
error stemming from a PyArrow errorUnsupported cast from string to null using function cast_null
. I have automatic string conversions disabled because I sometimes have need to nest arrays as object columns and I don't want them converted to strings.Minimal Complete Verifiable Example:
Anything else we need to know?:
I bisected the issue and found e57d1c55 to be the cause, specifically this line:
distributed/distributed/shuffle/_arrow.py
Line 119 in e57d1c5
At this point it has arrays of the correct PyArrow types but the schema has
null
for the object columns. If I remove the_copy_table()
function and instead addpromote=True
to the calls topa.concat_tables()
, the example above no longer errors.I suspect the goal of using a single schema object was to avoid any schema collision issues, but maybe in the cases where the schema contains nulls promote should be used instead like the old
convert_partition()
function did?I have also attempted to update my code to use nested arrays declared with a
pd.ArrowDtype
, but that ironically causes different arrows in PyArrow itself when it attempts to parse the datatype back in after serialization.Environment:
The text was updated successfully, but these errors were encountered: