-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"distributed.protocol.core - CRITICAL - Failed to Serialize" with PyArrow 0.13 #2597
Comments
It appears that, in the absence of a serialization mechanism directly in pyarrow, we'll have to write a dask specific one for these schema objects, at least for now. I can have a go early this coming week, if no one beats me to it. In the mean time @johnbensnyder , you can try to downgrade to pyarrow 0.12, or use fastparquet. |
Sorry to add noise in this issue, but is this also similar to this -> #2581 (comment)?. If yes, I would like to try to solve it for pytorch, too. |
@muammar , that issue is in deed the same type of thing happening, but not related directly to this one, as it's a completely different sort of object which is causing the problem. |
@johnbensnyder , I have not managed to reproduce this locally; could you specify which versions you are using and how you generated the data, please? |
It is happening for me when I use the latest git version and install |
OK, thanks, updating to current master did indeed reveal the situation you are seeing. At first glance, I do not know how to fix it. The new pyarrow ParquetDataset includes an opaque parquet schema instance (a C++-managed object) as an attribute, which has no de/serialise mechanism on the python side; apparently an arrow schema instance is not enough here anymore (I assume conversion to/from thrift is implemented within c++). I am including this as information for someone who might know how to fix these problems, perhaps on the (py)arrow side. |
My guess is that if you want to engage the pyarrow devs that you should raise a JIRA on their issue tracker. I recommend trying to reproduce the issue using only cloudpickle and pyarrow and then raising there. |
Thanks for reporting upstream @martindurant ! |
OK, so we haven't gotten any response on the arrow issue tracker. Any thoughts on how we should handle this? @martindurant if you have any inclination, I would love it if you were willing to handle/track this problem and find some solution (technical, social, or otherwise). |
I have no technical solution from our side, and the cython code for this looks... rather complicated. My initial hunch points to this changeset apache/arrow@f2fb02b By experimenting, I could get serialisation of the ParquetDataset by removing the schema attribute and also from its pieces; but the hanging function defs certainly in this diff still fouled pickle if not cloudpickle. Other connected commits: I don't know who at arrow can help ( @kszucs perhaps?) |
Yes, some suggestions on how we might proceed:
These are things that I might do when trying to resolve this problem |
Sorry, I've missed the notification. I can take a look tomorrow, if that's OK for You. |
(I can also handle this) |
apache/arrow#4156 should resolve the pickling issue for Note, that this is a new feature, pickling wasn't implemented for these objects, see https://issues.apache.org/jira/browse/ARROW-5144 |
I am having a very similar problem using Versions in use are:
I haven't been able to find any version combo yet that works... |
^ this is completely without pyarrow 0.13? |
correct.. I was not using |
bcolz is not that much used so, although there have certainly been some comm changes in distributed, I would suspect this might just be a coincidence. Still, it would be good to work out what in bcolz doesn't serialise, when this happened, and whether it was a change in bcolz or in distributed or something else, like cloudpickle. |
The bcolz issue appears to be separate. I recommend raising a separate
issue so that this issue doesn't get off track.
…On Fri, Apr 26, 2019 at 10:34 AM Martin Durant ***@***.***> wrote:
bcolz is not that much used so, although there have certainly been some
comm changes in distributed, I would suspect this might just be a
coincidence. Still, it would be good to work out what in bcolz doesn't
serialise, when this happened, and whether it was a change in bcolz or in
distributed or something else, like cloudpickle.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2597 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AACKZTC4RFXOKGOATTFPICTPSMOJBANCNFSM4HD76L3Q>
.
|
I Have a Dask DataFrame and when I call map_partitions in it. & Inside every Dataframe I am trying to call cascated_union from shapely in a list of Points, it gives me the very same Error. |
@pankalos most likely it's a similar, but different issue to this one. You may want to read through https://distributed.dask.org/en/latest/serialization.html, and experiment with why a list of shapely points apparently can't be serialized. |
@pankalos @TomAugspurger For reference, I had the same problem as @pankalos and it got solved by downgrading pyarrow to 0.12 fixed the problem. |
This ought to now work with pyarrow 0.14, if someone would like to try. |
Hi,
and I'm able to serialize and read parquet. |
I have a somewhat similar issue to @pankalos but with trying to use the dask DataFrame and read_csv function. The weird thing is that I've used dask (distributed) to create the series of .csv files, but when I try to read them back in (in distributed mode) I get the following error:
I loaded Dask using the following:
Versions: dask = 2.0.0 'packages': {'required': (('dask', '2.0.0'), Is this similar? This is a new issue, I've never had this problem before, and running with Any thoughts? |
Hi,
We tried to set we did not manage to get additional information using %debug |
I can confirm that the HDFS file system instances are pickleable, and so are open-file instances that reference them. I made a dataframe from csv with a file on HDFS, as in the example, and it pickles fine with cloudpickle, at least on my current setup. In short, I cannot reproduce this problem. In this issue, we still have no idea what it is that is failing to serialise, and since I cannot reproduce, I'm afraid it's down to the rest of you to try and find out! |
After running update on the |
I have a similar issue. Downgrade pandas from 0.25 to 0.24.2 worked for me. |
I confirm that the problem seems to be due to pandas 0.25, since I had similar issue using the azure connection and as such had no dependency on pyarrow. The problem was solved as indicated by @tsonic. |
@TomAugspurger , could there be something in new pandas that is not serialisable? |
I'll take a look if someone has a reproducible example: http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports |
The following code (in a import dask.dataframe as dd
from dask.distributed import Client
import pandas as pd
if __name__ == "__main__":
df = pd.DataFrame(
data=[("key_%i" % (i // 2), i) for i in range(100000)],
columns=["key", "value"]
)
df.to_csv("./data.csv", index=False)
c = Client("localhost:8786")
dd_data = dd.read_csv("./data.csv", blocksize=100000)
dd_data.groupby("key")["value"].sum().compute() Launched by doing: python3 -m venv venv
source venv/bin/activate
pip install pandas==0.25 dask==2.1.0 distributed==2.1.0
dask-scheduler &
dask-worker localhost:8786 &
python test.py The same thing with |
Works with dask 2.2.0+38.g266c314 (master) and pandas 0.25.0 |
I am using conda-forge packages. Pandas 0.25.0 causes the issue as shown in @AlexisMignon 's comment for dask+dask-distributed 2.3.0. Downgrading pandas to 0.24.2, which also causes dask+dask-distributed to downgrade to 2.2.0, works without TypeError. |
@hongzmsft do you have a reproducible example? I think we're still looking for one. Alternatively, does the example from #2597 (comment) work for you? |
@TomAugspurger I am using the example from #2597 (comment) . The program runs fine with dask=2.2.0, distributed=2.2.0, pandas=0.24.2, python=3.6*. It also runs fine with dask=2.3.0, distributed=2.3.0, pandas=0.25.0, python=3.7.0. but raise a TypeError as reported in this issue with dask=2.3.0, distributed=2.3.0, pandas=0.25.0, python=3.6*. Here is a docker file to reproduce the issue: FROM debian:buster-slim
RUN apt-get update && \
apt-get install -qy wget && \
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
bash Miniconda3-latest-Linux-x86_64.sh -b -p /opt/conda && \
rm Miniconda3-latest-Linux-x86_64.sh
COPY test.py /
RUN chmod +x /test.py
ENTRYPOINT ["/test.py"]
RUN /opt/conda/bin/conda install -c conda-forge dask=2.3* distributed=2.3* pandas=0.25.0 python=3.6* A slightly modified test.py from #2597 (comment) : #!/opt/conda/bin/python
import dask.dataframe as dd
from dask.distributed import Client
import pandas as pd
if __name__ == "__main__":
df = pd.DataFrame(
data=[("key_%i" % (i // 2), i) for i in range(100000)],
columns=["key", "value"]
)
df.to_csv("./data.csv", index=False)
c = Client(n_workers=2)
dd_data = dd.read_csv("./data.csv", blocksize=100000)
dd_data.groupby("key")["value"].sum().compute() The error is like:
The error seems slightly different from what the original issue is about, but consistent with other responses. |
Thanks for the nice example @hongzmsft and @AlexisMignon! I'm able to reproduce the After some digging, it looks like this serialization issue has to do with |
I can confirm that, with the change in cloudpipe/cloudpickle#299, the example in #2597 (comment) runs successfully on Python 3.6 |
Could you please elaborate on how you implemented that change in cloudpickle locally? I've tried replacing cloudpickle.py and cloudpickle_py of cloudpipe/cloudpickle#299 in my site-packages libraries manually since the version that patches that bug isn't available yet. But it didn't worked. |
Once there's a new release of
Hopefully that helps @gcoimbra |
I had to upgrade PyArrow from 0.12.1 to 0.14 because dask asked me to it. But it worked! |
cloudpickle 1.2.2 was released. |
The problem came back even with cloudpickle 1.2.2. It happens with PyArrow and fastparquet. Shoud I post this to dask/dask#5317?
The error message is bigger because it happens on other threads too. |
@gcoimbra , as before, it would be really useful to us if you were to go through the values in that list, which are probably also in |
It seems that I was causing the problem. The problem doesn't seems to be with PyArrow or Fastparquet. Because it happens in when I try to read a csv with dask.dataframe.read_csv using usecols optional argument passing a dict_keys object instead of a list. Then, dask tries to serialize the following object (that can be seen in my previous post) is created: ['usecols', dict_keys(...)] Removing the usecols or using usecols=list() argument fixes the problem. I'm sorry for the trouble. Do you want me to try to fix the problem and submit a pull request? |
I'm not sure there's anything to fix: the (pandas) docstring says that |
In any case, I'll close this, since we now know what's going on. |
I got the same error in Jul 10, 2020 How to solve the error? Is this because I have a single laptop and for distributed we need a cluster of multiple computers? # imports
import numpy as np
import pandas as pd
import dask
import dask.dataframe as dd
import dask.array as da
import dask_ml
import pyarrow
print([(x.__name__,x.__version__) for x in
[np,pd, dask, dask_ml,pyarrow]])
[('numpy', '1.18.5'), ('pandas', '1.0.5'), ('dask', '2.20.0'), ('dask_ml', '1.5.0'), ('pyarrow', '0.17.1')]
# data
a = da.random.normal(size=(2000, 2000), chunks=(1000, 1000)) # data
res = a.dot(a.T).mean(axis=0) # operation
res = res.persist() # start computation in the background
# code
from dask.distributed import Client, progress
client = Client() # use dask.distributed by default
progress(res) # watch progress
res.compute() # convert to final result when done if desired Error
|
Can't we warn users with a ValueError exception? |
I don't think we're generally in a position to check the types of arguments that we pass on to other functions. |
Reading from Parquet is failing with PyArrow 0.13. Downgrading to PyArrow 0.12.1 seems to fix the problem. I've only encountered this when using the distributed client. Using a Dask dataframe by itself does not appear to be affected.
For example,
Gives
Similarly,
Causes this error
The text was updated successfully, but these errors were encountered: