Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Object columns cause P2P shuffling errors after v2023.9.2 #8200

Closed
aiudirog opened this issue Sep 21, 2023 · 8 comments · Fixed by #8235
Closed

Object columns cause P2P shuffling errors after v2023.9.2 #8200

aiudirog opened this issue Sep 21, 2023 · 8 comments · Fixed by #8235
Labels
bug Something is broken regression shuffle

Comments

@aiudirog
Copy link

Describe the issue:

Since v2023.9.2, using object columns with dataframe.convert-string disabled causes a P2P shuffling <ID> failed during unpack phase error stemming from a PyArrow error Unsupported cast from string to null using function cast_null. I have automatic string conversions disabled because I sometimes have need to nest arrays as object columns and I don't want them converted to strings.

Minimal Complete Verifiable Example:

import dask
import dask.distributed as dd
import dask.dataframe as ddf
import numpy as np
import pandas as pd


if __name__ == '__main__':
    dask.config.set({"dataframe.convert-string": False})
    df = pd.DataFrame({
        'ints': [1, 2, 3],
        'arrays': [
            np.asarray([1, 2, 3]),
            np.asarray([4, 5, 6]),
            np.asarray([7, 8, 9]),
        ],
    })

    with dd.Client(processes=True) as client:
        ddf.from_pandas(
            df,
            npartitions=1,
        ).groupby(
            'ints',
        ).apply(
            lambda x: x,
            meta=df.iloc[:0],
        ).compute()

Anything else we need to know?:

I bisected the issue and found e57d1c55 to be the cause, specifically this line:

return pa.table(data=arrs, schema=schema)

At this point it has arrays of the correct PyArrow types but the schema has null for the object columns. If I remove the _copy_table() function and instead add promote=True to the calls to pa.concat_tables(), the example above no longer errors.

I suspect the goal of using a single schema object was to avoid any schema collision issues, but maybe in the cases where the schema contains nulls promote should be used instead like the old convert_partition() function did?

I have also attempted to update my code to use nested arrays declared with a pd.ArrowDtype, but that ironically causes different arrows in PyArrow itself when it attempts to parse the datatype back in after serialization.

Environment:

  • Dask version: 2023.9.2
  • Python version: Python 3.11.5
  • Operating System: Debian Bookworm (Docker Container)
  • Install method (conda, pip, source): pip & source for bisect
@sil-lnagel
Copy link

sil-lnagel commented Oct 2, 2023

We ran into the same issue. For us it only occurs if we use strings in the columns. This is the minimum example I came up with:

import dask
import dask.dataframe as dd
import pandas as pd
from dask.distributed import Client


if __name__ == '__main__':
    dask.config.set({"dataframe.convert-string": False})
    client = Client("tcp://192.168.179.28:9003")

    d1 = dd.from_pandas(pd.DataFrame({'x':[1,2], 'y':[3, 4]}), chunksize=1000)
    d2 = dd.from_pandas(pd.DataFrame({'x':[1,2], 'z':['hi', 'hey']}), chunksize=1000)

    d1.merge(d2, how="outer", on="x", shuffle='p2p').compute()

In my case the error only happens if I use how="outer", with how="inner" there is no crash.

Environment

  • Dask version: 2023.9.2 and 2023.9.3
  • Python version: Python 3.11.5
  • Operating System: Debian Bookworm (Docker Container)
  • Install method (conda, pip, source): pip

@JaguarPaw2409
Copy link

P2P shuffling is breaking multiple things. Earlier, the merge operation with left_index=True and right_index=True was not working, and now this error is occurring.

@JaguarPaw2409
Copy link

@aiudirog
Is there any way to have nested array in a column without disabling dataframe.convert-string?

@hendrikmakait
Copy link
Member

@aiudirog and @sil-lnagel, thanks for reporting your issues with reproducers. I've been able to reproduce this bug and am working on a fix.

@aiudirog: Great job for digging deeper into this! You are correct that using a single schema from Dask's meta was meant to help us avoid schema collisions. It would have also had the benefit of letting us reduce the amount of data sent over the network in the future. However, as you've shown, meta does not contain all the typing information needed for this, so we'll remove this and enforce the dtypes later in the process.

@aiudirog
Copy link
Author

aiudirog commented Oct 5, 2023

@hendrikmakait Thanks for fixing this! Hopefully in the future we'll be able to completely replace object columns with specialized PyArrow types

@JaguarPaw2409 I have not. The closest I got was using pd.ArrowDtype(pa.list_(pa.int32())) but that gets stored in the Arrow metadata as "list<item: int32>[pyarrow]" which pandas.core.dtypes.common.pandas_dtype() can't recover.

@aiudirog
Copy link
Author

aiudirog commented Oct 5, 2023

@Cognitus-Stuti
Copy link

The error still persists in v2024.5.2

@aiudirog
Copy link
Author

@Cognitus-Stuti Do you have an example that triggers the error? I just tested my original example and it didn't error on v2024.5.2 or v2024.8.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken regression shuffle
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants