Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large delayed DataFrame fails to serialize #842

Closed
baxen opened this issue Jan 31, 2017 · 4 comments
Closed

Large delayed DataFrame fails to serialize #842

baxen opened this issue Jan 31, 2017 · 4 comments

Comments

@baxen
Copy link

baxen commented Jan 31, 2017

I'm working with some reasonably large DataFrames (1 million rows, 100 columns, about 700 MB), reading them from a file through delayed. I get consistent errors from merge_frames when trying to call compute.

This reproduces the problem:

import dask
import pandas as pd
import numpy as np

from distributed import Client


@dask.delayed
def dummy(nrow):
    return pd.DataFrame(dict((i, np.random.rand(nrow)) for i in xrange(100)))


c = Client()
df = dummy(int(1e6)).compute()

with error:

CRITICAL:/Users/baxen/.virtualenvs/email/lib/python2.7/site-packages/distributed/protocol/core.pyc:Failed to deserialize
Traceback (most recent call last):
  File "/Users/baxen/.virtualenvs/email/lib/python2.7/site-packages/distributed/protocol/core.py", line 114, in loads
    fs = merge_frames(head, fs)
  File "/Users/baxen/.virtualenvs/email/lib/python2.7/site-packages/distributed/protocol/utils.py", line 53, in merge_frames
    assert sum(lengths) == sum(map(len, frames))
AssertionError
distributed.utils - ERROR - 

This seems related to #817. Both distributed.protocol.serialize and deserialize work fine on the DataFrame. I also tried making my own serializer as suggested there but it didn't solve the problem. I also see very different sizes in the two lengths in that assert.

I'm using python 2.7.12 and distributed 1.15.2 if it helps.

@mrocklin
Copy link
Member

Python 2 pickle doesn't support values greater than 2 gigabytes. I recommend using either Python 3 or breaking apart your partitions into smaller blocks (which is a good idea anyway.) Closing this in favor of #614 which is the general solution to this problem.

@mrocklin
Copy link
Member

Verified that this works fine in Python 3. Not that that makes this a non-issue, but just that it helps to verify that it's likely the known pickle issue.

@mrocklin
Copy link
Member

I'll try to take a look at this soonish. In the mean time I recommend trying to find a way to break up your computation into smaller pieces. You'll probably want to regardless for performance reasons.

@baxen
Copy link
Author

baxen commented Jan 31, 2017

Thanks for the quick response! I broke up the computation into smaller pieces as you suggested and that has solved it for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants