Object columns cause P2P shuffling errors after v2023.9.2 #8200

aiudirog · 2023-09-21T04:04:00Z

Describe the issue:

Since v2023.9.2, using object columns with dataframe.convert-string disabled causes a P2P shuffling <ID> failed during unpack phase error stemming from a PyArrow error Unsupported cast from string to null using function cast_null. I have automatic string conversions disabled because I sometimes have need to nest arrays as object columns and I don't want them converted to strings.

Minimal Complete Verifiable Example:

import dask
import dask.distributed as dd
import dask.dataframe as ddf
import numpy as np
import pandas as pd


if __name__ == '__main__':
    dask.config.set({"dataframe.convert-string": False})
    df = pd.DataFrame({
        'ints': [1, 2, 3],
        'arrays': [
            np.asarray([1, 2, 3]),
            np.asarray([4, 5, 6]),
            np.asarray([7, 8, 9]),
        ],
    })

    with dd.Client(processes=True) as client:
        ddf.from_pandas(
            df,
            npartitions=1,
        ).groupby(
            'ints',
        ).apply(
            lambda x: x,
            meta=df.iloc[:0],
        ).compute()

Anything else we need to know?:

I bisected the issue and found e57d1c55 to be the cause, specifically this line:

distributed/distributed/shuffle/_arrow.py

Line 119 in e57d1c5

return pa.table(data=arrs, schema=schema)

At this point it has arrays of the correct PyArrow types but the schema has null for the object columns. If I remove the _copy_table() function and instead add promote=True to the calls to pa.concat_tables(), the example above no longer errors.

I suspect the goal of using a single schema object was to avoid any schema collision issues, but maybe in the cases where the schema contains nulls promote should be used instead like the old convert_partition() function did?

I have also attempted to update my code to use nested arrays declared with a pd.ArrowDtype, but that ironically causes different arrows in PyArrow itself when it attempts to parse the datatype back in after serialization.

Environment:

Dask version: 2023.9.2
Python version: Python 3.11.5
Operating System: Debian Bookworm (Docker Container)
Install method (conda, pip, source): pip & source for bisect

The text was updated successfully, but these errors were encountered:

sil-lnagel · 2023-10-02T10:03:39Z

We ran into the same issue. For us it only occurs if we use strings in the columns. This is the minimum example I came up with:

import dask
import dask.dataframe as dd
import pandas as pd
from dask.distributed import Client


if __name__ == '__main__':
    dask.config.set({"dataframe.convert-string": False})
    client = Client("tcp://192.168.179.28:9003")

    d1 = dd.from_pandas(pd.DataFrame({'x':[1,2], 'y':[3, 4]}), chunksize=1000)
    d2 = dd.from_pandas(pd.DataFrame({'x':[1,2], 'z':['hi', 'hey']}), chunksize=1000)

    d1.merge(d2, how="outer", on="x", shuffle='p2p').compute()

In my case the error only happens if I use how="outer", with how="inner" there is no crash.

Environment

Dask version: 2023.9.2 and 2023.9.3
Python version: Python 3.11.5
Operating System: Debian Bookworm (Docker Container)
Install method (conda, pip, source): pip

JaguarPaw2409 · 2023-10-05T09:41:56Z

P2P shuffling is breaking multiple things. Earlier, the merge operation with left_index=True and right_index=True was not working, and now this error is occurring.

JaguarPaw2409 · 2023-10-05T09:43:03Z

@aiudirog
Is there any way to have nested array in a column without disabling dataframe.convert-string?

hendrikmakait · 2023-10-05T11:05:15Z

@aiudirog and @sil-lnagel, thanks for reporting your issues with reproducers. I've been able to reproduce this bug and am working on a fix.

@aiudirog: Great job for digging deeper into this! You are correct that using a single schema from Dask's meta was meant to help us avoid schema collisions. It would have also had the benefit of letting us reduce the amount of data sent over the network in the future. However, as you've shown, meta does not contain all the typing information needed for this, so we'll remove this and enforce the dtypes later in the process.

aiudirog · 2023-10-05T15:31:21Z

@hendrikmakait Thanks for fixing this! Hopefully in the future we'll be able to completely replace object columns with specialized PyArrow types

@JaguarPaw2409 I have not. The closest I got was using pd.ArrowDtype(pa.list_(pa.int32())) but that gets stored in the Arrow metadata as "list<item: int32>[pyarrow]" which pandas.core.dtypes.common.pandas_dtype() can't recover.

aiudirog · 2023-10-05T15:43:12Z

Looks like issues about using complex PyArrow types have already been opened:

BUG: read_parquet doesn't work for ArrowDtype dictionary types. pandas-dev/pandas#54392
BUG: read_parquet converts pyarrow list type to numpy dtype pandas-dev/pandas#53011

Cognitus-Stuti · 2024-08-14T13:58:53Z

The error still persists in v2024.5.2

aiudirog · 2024-08-15T18:11:09Z

@Cognitus-Stuti Do you have an example that triggers the error? I just tested my original example and it didn't error on v2024.5.2 or v2024.8.0.

github-actions bot added the needs triage label Sep 21, 2023

hendrikmakait added shuffle bug Something is broken regression and removed needs triage regression labels Oct 5, 2023

hendrikmakait mentioned this issue Oct 5, 2023

Avoid schema enforcement from meta on Arrow data in P2P shuffling #8235

Merged

2 tasks

hendrikmakait closed this as completed in #8235 Oct 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Object columns cause P2P shuffling errors after v2023.9.2 #8200

Object columns cause P2P shuffling errors after v2023.9.2 #8200

aiudirog commented Sep 21, 2023

sil-lnagel commented Oct 2, 2023 •

edited

Loading

JaguarPaw2409 commented Oct 5, 2023

JaguarPaw2409 commented Oct 5, 2023

hendrikmakait commented Oct 5, 2023

aiudirog commented Oct 5, 2023

aiudirog commented Oct 5, 2023

Cognitus-Stuti commented Aug 14, 2024

aiudirog commented Aug 15, 2024

Object columns cause P2P shuffling errors after v2023.9.2 #8200

Object columns cause P2P shuffling errors after v2023.9.2 #8200

Comments

aiudirog commented Sep 21, 2023

sil-lnagel commented Oct 2, 2023 • edited Loading

JaguarPaw2409 commented Oct 5, 2023

JaguarPaw2409 commented Oct 5, 2023

hendrikmakait commented Oct 5, 2023

aiudirog commented Oct 5, 2023

aiudirog commented Oct 5, 2023

Cognitus-Stuti commented Aug 14, 2024

aiudirog commented Aug 15, 2024

sil-lnagel commented Oct 2, 2023 •

edited

Loading