Large delayed DataFrame fails to serialize #842

baxen · 2017-01-31T01:14:03Z

I'm working with some reasonably large DataFrames (1 million rows, 100 columns, about 700 MB), reading them from a file through delayed. I get consistent errors from merge_frames when trying to call compute.

This reproduces the problem:

import dask
import pandas as pd
import numpy as np

from distributed import Client


@dask.delayed
def dummy(nrow):
    return pd.DataFrame(dict((i, np.random.rand(nrow)) for i in xrange(100)))


c = Client()
df = dummy(int(1e6)).compute()

with error:

CRITICAL:/Users/baxen/.virtualenvs/email/lib/python2.7/site-packages/distributed/protocol/core.pyc:Failed to deserialize
Traceback (most recent call last):
  File "/Users/baxen/.virtualenvs/email/lib/python2.7/site-packages/distributed/protocol/core.py", line 114, in loads
    fs = merge_frames(head, fs)
  File "/Users/baxen/.virtualenvs/email/lib/python2.7/site-packages/distributed/protocol/utils.py", line 53, in merge_frames
    assert sum(lengths) == sum(map(len, frames))
AssertionError
distributed.utils - ERROR -

This seems related to #817. Both distributed.protocol.serialize and deserialize work fine on the DataFrame. I also tried making my own serializer as suggested there but it didn't solve the problem. I also see very different sizes in the two lengths in that assert.

I'm using python 2.7.12 and distributed 1.15.2 if it helps.

The text was updated successfully, but these errors were encountered:

mrocklin · 2017-01-31T04:18:52Z

Python 2 pickle doesn't support values greater than 2 gigabytes. I recommend using either Python 3 or breaking apart your partitions into smaller blocks (which is a good idea anyway.) Closing this in favor of #614 which is the general solution to this problem.

mrocklin · 2017-01-31T04:23:30Z

Verified that this works fine in Python 3. Not that that makes this a non-issue, but just that it helps to verify that it's likely the known pickle issue.

mrocklin · 2017-01-31T04:25:43Z

I'll try to take a look at this soonish. In the mean time I recommend trying to find a way to break up your computation into smaller pieces. You'll probably want to regardless for performance reasons.

baxen · 2017-01-31T17:05:09Z

Thanks for the quick response! I broke up the computation into smaller pieces as you suggested and that has solved it for now.

mrocklin closed this as completed Jan 31, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large delayed DataFrame fails to serialize #842

Large delayed DataFrame fails to serialize #842

baxen commented Jan 31, 2017

mrocklin commented Jan 31, 2017

mrocklin commented Jan 31, 2017

mrocklin commented Jan 31, 2017

baxen commented Jan 31, 2017

Large delayed DataFrame fails to serialize #842

Large delayed DataFrame fails to serialize #842

Comments

baxen commented Jan 31, 2017

mrocklin commented Jan 31, 2017

mrocklin commented Jan 31, 2017

mrocklin commented Jan 31, 2017

baxen commented Jan 31, 2017