pickle function call uses kwargs added in python 3.8 #3843

samaust · 2020-05-31T17:26:36Z

A pickle function call uses a kwargs added in python 3.8 (https://github.com/dask/distributed/blob/master/distributed/protocol/pickle.py#L64).

The setyp.py file only requires python >=3.6 (https://github.com/dask/distributed/blob/master/setup.py#L29)

Python version 3.7.4
dask version 2.17.2
distributed version 2.17.0

Error returned

    return pickle.loads(x, buffers=buffers)
TypeError: 'buffers' is an invalid keyword argument for loads()

The function call is
return pickle.loads(x, buffers=buffers)

Python 3.8.3
https://docs.python.org/3/library/pickle.html#pickle.loads

Changed in version 3.8: The buffers argument was added.

Python 3.7.7
https://docs.python.org/3.7/library/pickle.html#pickle.loads

Does not have kwargs buffers

EDIT :
Here is some more information. It looks like the package pickle5 provides a backport for python 3.5, 3.6 and 3.7.

https://docs.python.org/3/library/pickle.html#pickle-oob
https://www.python.org/dev/peps/pep-0574/
https://www.python.org/dev/peps/pep-0574/#pickle5-pypi
https://pypi.org/project/pickle5/

I tried again using python 3.8.3. This time I get another error at the same line

  File "c:\project\env38\lib\site-packages\distributed\protocol\pickle.py", line 64, in loads
    return pickle.loads(x, buffers=buffers)
_pickle.UnpicklingError: pickle data was truncated

The text was updated successfully, but these errors were encountered:

jakirkham · 2020-06-01T00:51:18Z

Can you please include an MRE?

samaust · 2020-06-01T06:05:12Z

I figured out a mistake in my code. I was passing a dtype of str instead of string in a call to dd.read_csv. Fixing that solves my issue.

However, I'm still wondering why that mistake triggers that error message.

There's probably a better way but here's how I was passing a string type for all columns.

    import dask
    import dask.dataframe as dd

    from dask.distributed import Client, LocalCluster
    cluster = LocalCluster(n_workers = 4, memory_limit = '14GiB')
    client = Client(cluster)

    import pandas as pd

    path = 'c:\somepath\file.csv'
    sep = ';'
    header = 0
    nrows = 10
    index_col = 'timestamp'
    # Get column names
    df_tmp = pd.read_csv(path, sep = sep, header = header, engine='c', nrows = nrows, low_memory=False)    
    cols = list(df_tmp.columns.values)
    
    # Generate dtypes
    dtypes = {}
    for col in cols:
        #dtypes[col] = 'string' # Works
        dtypes[col] = 'str' # Fails
    
    # Read csv file using dask
    dfsample1 = dd.read_csv(path, blocksize = '400MB', sep = sep, header = header, dtype=dtypes, parse_dates=[index_col], date_parser=lambda col: pd.to_datetime(col, utc=True, format='%Y-%m-%d %H:%M:%S'))

I tried to make a more complete mre but I failed. I tried to generated a new csv file of about 1GiB and passing it to dd.read_csv with the wrong dtype and it does not trigger the error message. The first column contains a datetime string (such as '2020-01-01 00:00:00') and the other strings (mixture of text and numbers). I don't know if it's the size or content that triggers the issue.

When instead I set the path to a csv file that's roughly 48,9 GiB and pass the wrong dtype, I get the error message.

I'm running the code on a system with 64GiB of RAM. I'm running Windows 10. When I look at the diagnostic dashboard of the workers, they each use about 2 to 3 GiB and the overall system memory usage is low.

I'm wondering

Is there a safety check in place either in distributed, dask, or pandas that returns an exception when an invalid dtype is passed in read_csv?
Should there be a safety check in place either in distributed, dask, or pandas that returns an exception when an invalid dtype is passed in read_csv?

jakirkham · 2020-06-01T08:21:42Z

Those seem like different issues not really related to the pickle implementation (or Distributed). Could you please raise separate issues on Dask's issue tracker?

samaust · 2020-06-01T11:35:06Z

Alright, closing since this is unrelated to the pickle implementation.

BTW, I might be wrong about this 'str' vs 'string' dtype. This needs more reading and testing on my part. I'm new to dask and pandas and still learning. I'm trying to read a csv file, do some processing and save it to parquet (using either fastparquet or pyarrow).

samaust changed the title ~~pickle function call uses kwargs added in python 3.8 and the package requires python >=3.6~~ pickle function call uses kwargs added in python 3.8 May 31, 2020

samaust closed this as completed Jun 1, 2020

mnarodovitch mentioned this issue Jun 3, 2020

pickle function call uses kwargs added in python 3.8 #3851

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pickle function call uses kwargs added in python 3.8 #3843

pickle function call uses kwargs added in python 3.8 #3843

samaust commented May 31, 2020 •

edited

Loading

jakirkham commented Jun 1, 2020

samaust commented Jun 1, 2020 •

edited

Loading

jakirkham commented Jun 1, 2020

samaust commented Jun 1, 2020

pickle function call uses kwargs added in python 3.8 #3843

pickle function call uses kwargs added in python 3.8 #3843

Comments

samaust commented May 31, 2020 • edited Loading

jakirkham commented Jun 1, 2020

samaust commented Jun 1, 2020 • edited Loading

jakirkham commented Jun 1, 2020

samaust commented Jun 1, 2020

samaust commented May 31, 2020 •

edited

Loading

samaust commented Jun 1, 2020 •

edited

Loading