Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pickle function call uses kwargs added in python 3.8 #3843

Closed
samaust opened this issue May 31, 2020 · 4 comments
Closed

pickle function call uses kwargs added in python 3.8 #3843

samaust opened this issue May 31, 2020 · 4 comments

Comments

@samaust
Copy link

samaust commented May 31, 2020

A pickle function call uses a kwargs added in python 3.8 (https://github.com/dask/distributed/blob/master/distributed/protocol/pickle.py#L64).

The setyp.py file only requires python >=3.6 (https://github.com/dask/distributed/blob/master/setup.py#L29)

Python version 3.7.4
dask version 2.17.2
distributed version 2.17.0

Error returned

    return pickle.loads(x, buffers=buffers)
TypeError: 'buffers' is an invalid keyword argument for loads()

The function call is
return pickle.loads(x, buffers=buffers)

Python 3.8.3
https://docs.python.org/3/library/pickle.html#pickle.loads

Changed in version 3.8: The buffers argument was added.

Python 3.7.7
https://docs.python.org/3.7/library/pickle.html#pickle.loads

Does not have kwargs buffers

EDIT :
Here is some more information. It looks like the package pickle5 provides a backport for python 3.5, 3.6 and 3.7.

https://docs.python.org/3/library/pickle.html#pickle-oob
https://www.python.org/dev/peps/pep-0574/
https://www.python.org/dev/peps/pep-0574/#pickle5-pypi
https://pypi.org/project/pickle5/

I tried again using python 3.8.3. This time I get another error at the same line

  File "c:\project\env38\lib\site-packages\distributed\protocol\pickle.py", line 64, in loads
    return pickle.loads(x, buffers=buffers)
_pickle.UnpicklingError: pickle data was truncated
@samaust samaust changed the title pickle function call uses kwargs added in python 3.8 and the package requires python >=3.6 pickle function call uses kwargs added in python 3.8 May 31, 2020
@jakirkham
Copy link
Member

Can you please include an MRE?

@samaust
Copy link
Author

samaust commented Jun 1, 2020

I figured out a mistake in my code. I was passing a dtype of str instead of string in a call to dd.read_csv. Fixing that solves my issue.

However, I'm still wondering why that mistake triggers that error message.

There's probably a better way but here's how I was passing a string type for all columns.

    import dask
    import dask.dataframe as dd

    from dask.distributed import Client, LocalCluster
    cluster = LocalCluster(n_workers = 4, memory_limit = '14GiB')
    client = Client(cluster)

    import pandas as pd

    path = 'c:\somepath\file.csv'
    sep = ';'
    header = 0
    nrows = 10
    index_col = 'timestamp'
    # Get column names
    df_tmp = pd.read_csv(path, sep = sep, header = header, engine='c', nrows = nrows, low_memory=False)    
    cols = list(df_tmp.columns.values)
    
    # Generate dtypes
    dtypes = {}
    for col in cols:
        #dtypes[col] = 'string' # Works
        dtypes[col] = 'str' # Fails
    
    # Read csv file using dask
    dfsample1 = dd.read_csv(path, blocksize = '400MB', sep = sep, header = header, dtype=dtypes, parse_dates=[index_col], date_parser=lambda col: pd.to_datetime(col, utc=True, format='%Y-%m-%d %H:%M:%S')) 

I tried to make a more complete mre but I failed. I tried to generated a new csv file of about 1GiB and passing it to dd.read_csv with the wrong dtype and it does not trigger the error message. The first column contains a datetime string (such as '2020-01-01 00:00:00') and the other strings (mixture of text and numbers). I don't know if it's the size or content that triggers the issue.

When instead I set the path to a csv file that's roughly 48,9 GiB and pass the wrong dtype, I get the error message.

I'm running the code on a system with 64GiB of RAM. I'm running Windows 10. When I look at the diagnostic dashboard of the workers, they each use about 2 to 3 GiB and the overall system memory usage is low.

I'm wondering

  • Is there a safety check in place either in distributed, dask, or pandas that returns an exception when an invalid dtype is passed in read_csv?
  • Should there be a safety check in place either in distributed, dask, or pandas that returns an exception when an invalid dtype is passed in read_csv?

@jakirkham
Copy link
Member

Those seem like different issues not really related to the pickle implementation (or Distributed). Could you please raise separate issues on Dask's issue tracker?

@samaust
Copy link
Author

samaust commented Jun 1, 2020

Alright, closing since this is unrelated to the pickle implementation.

BTW, I might be wrong about this 'str' vs 'string' dtype. This needs more reading and testing on my part. I'm new to dask and pandas and still learning. I'm trying to read a csv file, do some processing and save it to parquet (using either fastparquet or pyarrow).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants