-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pickle function call uses kwargs added in python 3.8 #3843
Comments
Can you please include an MRE? |
I figured out a mistake in my code. I was passing a dtype of However, I'm still wondering why that mistake triggers that error message. There's probably a better way but here's how I was passing a string type for all columns. import dask
import dask.dataframe as dd
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(n_workers = 4, memory_limit = '14GiB')
client = Client(cluster)
import pandas as pd
path = 'c:\somepath\file.csv'
sep = ';'
header = 0
nrows = 10
index_col = 'timestamp'
# Get column names
df_tmp = pd.read_csv(path, sep = sep, header = header, engine='c', nrows = nrows, low_memory=False)
cols = list(df_tmp.columns.values)
# Generate dtypes
dtypes = {}
for col in cols:
#dtypes[col] = 'string' # Works
dtypes[col] = 'str' # Fails
# Read csv file using dask
dfsample1 = dd.read_csv(path, blocksize = '400MB', sep = sep, header = header, dtype=dtypes, parse_dates=[index_col], date_parser=lambda col: pd.to_datetime(col, utc=True, format='%Y-%m-%d %H:%M:%S')) I tried to make a more complete mre but I failed. I tried to generated a new csv file of about 1GiB and passing it to dd.read_csv with the wrong dtype and it does not trigger the error message. The first column contains a datetime string (such as '2020-01-01 00:00:00') and the other strings (mixture of text and numbers). I don't know if it's the size or content that triggers the issue. When instead I set the path to a csv file that's roughly 48,9 GiB and pass the wrong dtype, I get the error message. I'm running the code on a system with 64GiB of RAM. I'm running Windows 10. When I look at the diagnostic dashboard of the workers, they each use about 2 to 3 GiB and the overall system memory usage is low. I'm wondering
|
Those seem like different issues not really related to the pickle implementation (or Distributed). Could you please raise separate issues on Dask's issue tracker? |
Alright, closing since this is unrelated to the pickle implementation. BTW, I might be wrong about this 'str' vs 'string' dtype. This needs more reading and testing on my part. I'm new to dask and pandas and still learning. I'm trying to read a csv file, do some processing and save it to parquet (using either fastparquet or pyarrow). |
A pickle function call uses a kwargs added in python 3.8 (https://github.com/dask/distributed/blob/master/distributed/protocol/pickle.py#L64).
The setyp.py file only requires python >=3.6 (https://github.com/dask/distributed/blob/master/setup.py#L29)
Python version 3.7.4
dask version 2.17.2
distributed version 2.17.0
Error returned
The function call is
return pickle.loads(x, buffers=buffers)
Python 3.8.3
https://docs.python.org/3/library/pickle.html#pickle.loads
Python 3.7.7
https://docs.python.org/3.7/library/pickle.html#pickle.loads
Does not have kwargs buffers
EDIT :
Here is some more information. It looks like the package pickle5 provides a backport for python 3.5, 3.6 and 3.7.
https://docs.python.org/3/library/pickle.html#pickle-oob
https://www.python.org/dev/peps/pep-0574/
https://www.python.org/dev/peps/pep-0574/#pickle5-pypi
https://pypi.org/project/pickle5/
I tried again using python 3.8.3. This time I get another error at the same line
The text was updated successfully, but these errors were encountered: