-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large delayed DataFrame fails to serialize #842
Comments
Python 2 pickle doesn't support values greater than 2 gigabytes. I recommend using either Python 3 or breaking apart your partitions into smaller blocks (which is a good idea anyway.) Closing this in favor of #614 which is the general solution to this problem. |
Verified that this works fine in Python 3. Not that that makes this a non-issue, but just that it helps to verify that it's likely the known pickle issue. |
I'll try to take a look at this soonish. In the mean time I recommend trying to find a way to break up your computation into smaller pieces. You'll probably want to regardless for performance reasons. |
Thanks for the quick response! I broke up the computation into smaller pieces as you suggested and that has solved it for now. |
I'm working with some reasonably large DataFrames (1 million rows, 100 columns, about 700 MB), reading them from a file through delayed. I get consistent errors from
merge_frames
when trying to call compute.This reproduces the problem:
with error:
This seems related to #817. Both
distributed.protocol.serialize
anddeserialize
work fine on the DataFrame. I also tried making my own serializer as suggested there but it didn't solve the problem. I also see very different sizes in the two lengths in that assert.I'm using
python 2.7.12
anddistributed 1.15.2
if it helps.The text was updated successfully, but these errors were encountered: