Series.values.compute() leads to "TypeError: can't concat buffer to bytearray" #1179

bluenote10 · 2017-06-16T08:12:56Z

Running a local dask-scheduler + dask-worker pair, the following code leads to a crash of the worker:

import dask.dataframe as dd
from dask.distributed import Executor
e = Executor('127.0.0.1:8786', set_as_default=True)
df = pd.DataFrame({"A": [1, 2, 3] * 10})
ddf = dd.from_pandas(df, npartitions=3)
ddf["A"].values.compute()

The worker crashes with:

Traceback (most recent call last):
  File "/home/PHI-TPS/fkeller/.virtualenvs/dask/local/lib/python2.7/site-packages/distributed/core.py", line 259, in handle_comm
    result = yield result
  File "/home/PHI-TPS/fkeller/.virtualenvs/dask/local/lib/python2.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/PHI-TPS/fkeller/.virtualenvs/dask/local/lib/python2.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "/home/PHI-TPS/fkeller/.virtualenvs/dask/local/lib/python2.7/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/PHI-TPS/fkeller/.virtualenvs/dask/local/lib/python2.7/site-packages/distributed/worker.py", line 439, in get_data
    compressed = yield comm.write(msg)
  File "/home/PHI-TPS/fkeller/.virtualenvs/dask/local/lib/python2.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/PHI-TPS/fkeller/.virtualenvs/dask/local/lib/python2.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "/home/PHI-TPS/fkeller/.virtualenvs/dask/local/lib/python2.7/site-packages/tornado/gen.py", line 292, in wrapper
    result = func(*args, **kwargs)
  File "/home/PHI-TPS/fkeller/.virtualenvs/dask/local/lib/python2.7/site-packages/distributed/comm/tcp.py", line 196, in write
    stream.write(frame)
  File "/home/PHI-TPS/fkeller/.virtualenvs/dask/local/lib/python2.7/site-packages/tornado/iostream.py", line 395, in write
    self._write_buffer += data
TypeError: can't concat buffer to bytearray

Other computations like ddf["A"].compute() or even ddf.values.compute() work fine though.

The text was updated successfully, but these errors were encountered:

bluenote10 · 2017-06-16T08:24:59Z

I noticed that I can compute ddf["A"].values.compute() if I initialize with client = Client('127.0.0.1:8786') instead of using Executor. However then I get the worker crash on computing ddf["A"].unique().compute(), but only after successfully computing ddf["A"].values.compute() once. I can't really put my finger on whether it is related to the initialization, these specific operations, or even the order of operations.

mrocklin · 2017-06-16T12:20:39Z

To be clear, Executor is just an alias to Client

Executor = Client

This runs fine for me. My first guess is that you have a version mismatch. Can you verify that the following does not err:

client.get_versions(check=True)

bluenote10 · 2017-06-16T12:26:55Z

That's the output I get:

{'client': {'host': [('python', '2.7.3.final.0'),
   ('python-bits', 64),
   ('OS', 'Linux'),
   ('OS-release', '3.13.0-100-generic'),
   ('machine', 'x86_64'),
   ('processor', 'x86_64'),
   ('byteorder', 'little'),
   ('LC_ALL', 'None'),
   ('LANG', 'en_US.UTF-8'),
   ('LOCALE', 'en_US.UTF-8')],
  'packages': {'optional': [('numpy', '1.13.0'), ('pandas', u'0.20.2')],
   'required': [('dask', u'0.15.0'),
    ('distributed', u'1.17.0'),
    ('msgpack', '0.4.8'),
    ('cloudpickle', '0.3.1'),
    ('toolz', '0.8.2')]}},
 'scheduler': {'host': [['python', '2.7.3.final.0'],
   ['python-bits', 64],
   ['OS', 'Linux'],
   ['OS-release', '3.13.0-100-generic'],
   ['machine', 'x86_64'],
   ['processor', 'x86_64'],
   ['byteorder', 'little'],
   ['LC_ALL', 'None'],
   ['LANG', 'en_US.UTF-8'],
   ['LOCALE', 'None.None']],
  'packages': {'optional': [['numpy', '1.13.0'], ['pandas', u'0.20.2']],
   'required': [['dask', u'0.15.0'],
    ['distributed', u'1.17.0'],
    ['msgpack', '0.4.8'],
    ['cloudpickle', '0.3.1'],
    ['toolz', '0.8.2']]}},
 'workers': {'tcp://10.128.4.209:41229': {'host': [('python', '2.7.3.final.0'),
    ('python-bits', 64),
    ('OS', 'Linux'),
    ('OS-release', '3.13.0-100-generic'),
    ('machine', 'x86_64'),
    ('processor', 'x86_64'),
    ('byteorder', 'little'),
    ('LC_ALL', 'None'),
    ('LANG', 'en_US.UTF-8'),
    ('LOCALE', 'None.None')],
   'packages': {'optional': [('numpy', '1.13.0'), ('pandas', u'0.20.2')],
    'required': [('dask', u'0.15.0'),
     ('distributed', u'1.17.0'),
     ('msgpack', '0.4.8'),
     ('cloudpickle', '0.3.1'),
     ('toolz', '0.8.2')]}}}}

The installation should be up-to-date versions of both dask + distributed. Client + scheduler + worker are all running from the same venv on the same host.

mrocklin · 2017-06-16T12:44:13Z

I tried reproducing this with an environment like the following and unfortunately everything worked.

conda create -n gh-1179 python=2.7 distributed=1.17.0 numpy=1.13.0 pandas ipython

Any recommendations on what to check that might differ between our environments?

bluenote10 · 2017-06-16T12:48:26Z

Maybe it's related to tornado, since that's where the error is thrown? My tornado version is 4.5.1.

mrocklin · 2017-06-16T12:49:27Z

Same

mrocklin · 2017-06-16T12:52:03Z

There was also a micro-release shortly after 1.17.1. This resolved some things with moving around memoryviews. I don't think that this is likely to affect you but you might try updating and see if that has an effect.

bluenote10 · 2017-06-16T13:00:33Z

Updated to 1.17.1 but the worker still throws the exception with slightly modified behavior: I did get the computation result back in the client on the first attempt although the worker raised the exception. Repeating the command a second time hangs in the client. I feel that the issue manifests itself in a somewhat erratic way.

Right now I have no idea what could be wrong...

mrocklin · 2017-06-16T13:03:56Z

Lets wait for @pitrou to take a look at the error. He is more familiar with the networking stack than I am (I think he wrote the bytearray code in tornado) and may have thoughts. Unfortunately I think he's on vacation until Monday.

bluenote10 · 2017-06-16T13:47:03Z

Adding a few more observations: Currently the frame in the the loop for frame in frames (distributed/comm/tcp.py line 192) are typically of type str, which is fine to pass on to Tornado's stream.write(frame). For some reason the frames list sometimes contains raw buffer objects, which Tornado can't handle. I can avoid the issue by a dirty fix like:

if "buffer" in type(frame).__name__:
    stream.write(b"{}".format(frame))
else:
    stream.write(frame)

But maybe that is not the right place to fix the issue and probably we still want to understand why the problem isn't easily reproducible.

Update: Maybe this isn't really a fix. It doesn't crash any more, but now my computation results differ from running with the single machine schedulers.

mrocklin · 2017-06-16T13:51:51Z

If I recall correctly we recently started allowing buffers and memoryviews through, but only if the stream supported them. Perhaps we're making this judgment incorrectly. This is definitely a situation where @pitrou would know more.

bluenote10 · 2017-06-20T09:40:18Z

Ah, I think I found what is causing the issue: It might be related to allocators. I'm encountering the issue in a setting where the driver is using jemalloc while the scheduler/worker are using glibc. If I use glibc everywhere the issue disappears.

xhochy · 2017-06-20T11:38:33Z

Same as in the other issue: I cannot reproduce the failures here with a newer jemalloc version / on OSX. I rather suspect that there is some usage of unitialised memory. This will probably not be a bug in distributed but one of the libraries it depends on that call native code.

bluenote10 · 2017-06-20T16:27:41Z

Dang, I'm getting the TypeError: can't concat buffer to bytearray now in a pure glibc setting as well :(. But it takes more complex code now to trigger the issue, the simple example from above works. This is a tough issue.

pitrou · 2017-06-20T18:33:58Z

@bluenote10, which exact Python version are you using?

pitrou · 2017-06-20T18:34:34Z

Oops, I see, it should be 2.7.3.final.0.

pitrou · 2017-06-20T18:39:49Z

Unfortunately, the issue you're having (concatenating a buffer to a bytearray) seems to have been fixed in Python 2.7.4... which was released more than 4 years ago.
I really recommend you upgrade to a recent bugfix release of Python 2.7. Continuum's Anaconda may help you with that.

pitrou · 2017-06-20T18:42:08Z

For the record:

$ ./python -V
Python 2.7.3
$ ./python -c "b = bytearray(); b+= buffer(b'123'); print(b)"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
TypeError: can't concat buffer to bytearray

$ ./python -V
Python 2.7.4
$ ./python -c "b = bytearray(); b+= buffer(b'123'); print(b)"
123

bluenote10 · 2017-06-20T19:18:28Z

So you are saying that from a networking perspective it does make sense that the frames list contains a buffer? I'm asking because it can't happen very often, otherwise it would be crashing all the time for me, and my impression was that this was maybe just a consequence of some other problem.

Because if it is a valid value it should be easy to fix this for Python 2.7.3 by properly converting to e.g. str or bytearray.

It is not so much about fixing the Python version locally on my machine. We still have a need to run on Debian 7 servers as well, which are also Python 2.7.3.

pitrou · 2017-06-20T19:33:47Z

So you are saying that from a networking perspective it does make sense that the frames list contains a buffer?

It can make sense, yes. It means in some circumstance we were able to avoid a memory copy and instead passed a view of some existing memory area (most likely the data of a Numpy array). Why this only seems to happen sporadically for you I'm not sure, though.

pitrou · 2017-06-21T12:10:50Z

Because if it is a valid value it should be easy to fix this for Python 2.7.3 by properly converting to e.g. str or bytearray.

It is initially easy, but then needs to be maintained. We are unlikely to run any CI builds with Python 2.7.3, so it may get broken unexpectedly again. Besides, without wanting to spread FUD, there are many other issues that were fixed in recent 2.7 releases and thay may pop up with 2.7.3 (basically anything that is above this line in Misc/NEWS).

Therefore I'm reluctant to add workarounds for such an old Python version, except if it's part of a commercial support contract.

mrocklin · 2017-07-05T12:57:45Z

What is the status of this issue? Should it be closed?

bluenote10 · 2017-07-05T14:20:47Z

The issue still exists, but we can close it for now. Possible work-arounds are not using jemalloc with Python 2.7.3 or using the patch posted above. If need be we can think about a PR either for Dask or maybe Tornado.

bluenote10 mentioned this issue Jun 20, 2017

Distributed hello world fails when using jemalloc #1190

Closed

bluenote10 closed this as completed Jul 5, 2017

bluenote10 mentioned this issue Jul 21, 2017

Added explicit worker GC (throttled) #1255

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Series.values.compute() leads to "TypeError: can't concat buffer to bytearray" #1179

Series.values.compute() leads to "TypeError: can't concat buffer to bytearray" #1179

bluenote10 commented Jun 16, 2017

bluenote10 commented Jun 16, 2017

mrocklin commented Jun 16, 2017

bluenote10 commented Jun 16, 2017

mrocklin commented Jun 16, 2017

bluenote10 commented Jun 16, 2017

mrocklin commented Jun 16, 2017

mrocklin commented Jun 16, 2017

bluenote10 commented Jun 16, 2017

mrocklin commented Jun 16, 2017

bluenote10 commented Jun 16, 2017 •

edited

Loading

mrocklin commented Jun 16, 2017

bluenote10 commented Jun 20, 2017

xhochy commented Jun 20, 2017

bluenote10 commented Jun 20, 2017

pitrou commented Jun 20, 2017

pitrou commented Jun 20, 2017

pitrou commented Jun 20, 2017

pitrou commented Jun 20, 2017

bluenote10 commented Jun 20, 2017

pitrou commented Jun 20, 2017

pitrou commented Jun 21, 2017

mrocklin commented Jul 5, 2017

bluenote10 commented Jul 5, 2017

Series.values.compute() leads to "TypeError: can't concat buffer to bytearray" #1179

Series.values.compute() leads to "TypeError: can't concat buffer to bytearray" #1179

Comments

bluenote10 commented Jun 16, 2017

bluenote10 commented Jun 16, 2017

mrocklin commented Jun 16, 2017

bluenote10 commented Jun 16, 2017

mrocklin commented Jun 16, 2017

bluenote10 commented Jun 16, 2017

mrocklin commented Jun 16, 2017

mrocklin commented Jun 16, 2017

bluenote10 commented Jun 16, 2017

mrocklin commented Jun 16, 2017

bluenote10 commented Jun 16, 2017 • edited Loading

mrocklin commented Jun 16, 2017

bluenote10 commented Jun 20, 2017

xhochy commented Jun 20, 2017

bluenote10 commented Jun 20, 2017

pitrou commented Jun 20, 2017

pitrou commented Jun 20, 2017

pitrou commented Jun 20, 2017

pitrou commented Jun 20, 2017

bluenote10 commented Jun 20, 2017

pitrou commented Jun 20, 2017

pitrou commented Jun 21, 2017

mrocklin commented Jul 5, 2017

bluenote10 commented Jul 5, 2017

bluenote10 commented Jun 16, 2017 •

edited

Loading