Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Series.values.compute() leads to "TypeError: can't concat buffer to bytearray" #1179

Closed
bluenote10 opened this issue Jun 16, 2017 · 23 comments
Closed

Comments

@bluenote10
Copy link
Contributor

Running a local dask-scheduler + dask-worker pair, the following code leads to a crash of the worker:

import dask.dataframe as dd
from dask.distributed import Executor
e = Executor('127.0.0.1:8786', set_as_default=True)
df = pd.DataFrame({"A": [1, 2, 3] * 10})
ddf = dd.from_pandas(df, npartitions=3)
ddf["A"].values.compute()

The worker crashes with:

Traceback (most recent call last):
  File "/home/PHI-TPS/fkeller/.virtualenvs/dask/local/lib/python2.7/site-packages/distributed/core.py", line 259, in handle_comm
    result = yield result
  File "/home/PHI-TPS/fkeller/.virtualenvs/dask/local/lib/python2.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/PHI-TPS/fkeller/.virtualenvs/dask/local/lib/python2.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "/home/PHI-TPS/fkeller/.virtualenvs/dask/local/lib/python2.7/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/PHI-TPS/fkeller/.virtualenvs/dask/local/lib/python2.7/site-packages/distributed/worker.py", line 439, in get_data
    compressed = yield comm.write(msg)
  File "/home/PHI-TPS/fkeller/.virtualenvs/dask/local/lib/python2.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/PHI-TPS/fkeller/.virtualenvs/dask/local/lib/python2.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "/home/PHI-TPS/fkeller/.virtualenvs/dask/local/lib/python2.7/site-packages/tornado/gen.py", line 292, in wrapper
    result = func(*args, **kwargs)
  File "/home/PHI-TPS/fkeller/.virtualenvs/dask/local/lib/python2.7/site-packages/distributed/comm/tcp.py", line 196, in write
    stream.write(frame)
  File "/home/PHI-TPS/fkeller/.virtualenvs/dask/local/lib/python2.7/site-packages/tornado/iostream.py", line 395, in write
    self._write_buffer += data
TypeError: can't concat buffer to bytearray

Other computations like ddf["A"].compute() or even ddf.values.compute() work fine though.

@bluenote10
Copy link
Contributor Author

I noticed that I can compute ddf["A"].values.compute() if I initialize with client = Client('127.0.0.1:8786') instead of using Executor. However then I get the worker crash on computing ddf["A"].unique().compute(), but only after successfully computing ddf["A"].values.compute() once. I can't really put my finger on whether it is related to the initialization, these specific operations, or even the order of operations.

@mrocklin
Copy link
Member

To be clear, Executor is just an alias to Client

Executor = Client

This runs fine for me. My first guess is that you have a version mismatch. Can you verify that the following does not err:

client.get_versions(check=True)

@bluenote10
Copy link
Contributor Author

That's the output I get:

{'client': {'host': [('python', '2.7.3.final.0'),
   ('python-bits', 64),
   ('OS', 'Linux'),
   ('OS-release', '3.13.0-100-generic'),
   ('machine', 'x86_64'),
   ('processor', 'x86_64'),
   ('byteorder', 'little'),
   ('LC_ALL', 'None'),
   ('LANG', 'en_US.UTF-8'),
   ('LOCALE', 'en_US.UTF-8')],
  'packages': {'optional': [('numpy', '1.13.0'), ('pandas', u'0.20.2')],
   'required': [('dask', u'0.15.0'),
    ('distributed', u'1.17.0'),
    ('msgpack', '0.4.8'),
    ('cloudpickle', '0.3.1'),
    ('toolz', '0.8.2')]}},
 'scheduler': {'host': [['python', '2.7.3.final.0'],
   ['python-bits', 64],
   ['OS', 'Linux'],
   ['OS-release', '3.13.0-100-generic'],
   ['machine', 'x86_64'],
   ['processor', 'x86_64'],
   ['byteorder', 'little'],
   ['LC_ALL', 'None'],
   ['LANG', 'en_US.UTF-8'],
   ['LOCALE', 'None.None']],
  'packages': {'optional': [['numpy', '1.13.0'], ['pandas', u'0.20.2']],
   'required': [['dask', u'0.15.0'],
    ['distributed', u'1.17.0'],
    ['msgpack', '0.4.8'],
    ['cloudpickle', '0.3.1'],
    ['toolz', '0.8.2']]}},
 'workers': {'tcp://10.128.4.209:41229': {'host': [('python', '2.7.3.final.0'),
    ('python-bits', 64),
    ('OS', 'Linux'),
    ('OS-release', '3.13.0-100-generic'),
    ('machine', 'x86_64'),
    ('processor', 'x86_64'),
    ('byteorder', 'little'),
    ('LC_ALL', 'None'),
    ('LANG', 'en_US.UTF-8'),
    ('LOCALE', 'None.None')],
   'packages': {'optional': [('numpy', '1.13.0'), ('pandas', u'0.20.2')],
    'required': [('dask', u'0.15.0'),
     ('distributed', u'1.17.0'),
     ('msgpack', '0.4.8'),
     ('cloudpickle', '0.3.1'),
     ('toolz', '0.8.2')]}}}}

The installation should be up-to-date versions of both dask + distributed. Client + scheduler + worker are all running from the same venv on the same host.

@mrocklin
Copy link
Member

I tried reproducing this with an environment like the following and unfortunately everything worked.

conda create -n gh-1179 python=2.7 distributed=1.17.0 numpy=1.13.0 pandas ipython

Any recommendations on what to check that might differ between our environments?

@bluenote10
Copy link
Contributor Author

Maybe it's related to tornado, since that's where the error is thrown? My tornado version is 4.5.1.

@mrocklin
Copy link
Member

Same

@mrocklin
Copy link
Member

There was also a micro-release shortly after 1.17.1. This resolved some things with moving around memoryviews. I don't think that this is likely to affect you but you might try updating and see if that has an effect.

@bluenote10
Copy link
Contributor Author

Updated to 1.17.1 but the worker still throws the exception with slightly modified behavior: I did get the computation result back in the client on the first attempt although the worker raised the exception. Repeating the command a second time hangs in the client. I feel that the issue manifests itself in a somewhat erratic way.

Right now I have no idea what could be wrong...

@mrocklin
Copy link
Member

Lets wait for @pitrou to take a look at the error. He is more familiar with the networking stack than I am (I think he wrote the bytearray code in tornado) and may have thoughts. Unfortunately I think he's on vacation until Monday.

@bluenote10
Copy link
Contributor Author

bluenote10 commented Jun 16, 2017

Adding a few more observations: Currently the frame in the the loop for frame in frames (distributed/comm/tcp.py line 192) are typically of type str, which is fine to pass on to Tornado's stream.write(frame). For some reason the frames list sometimes contains raw buffer objects, which Tornado can't handle. I can avoid the issue by a dirty fix like:

if "buffer" in type(frame).__name__:
    stream.write(b"{}".format(frame))
else:
    stream.write(frame)

But maybe that is not the right place to fix the issue and probably we still want to understand why the problem isn't easily reproducible.

Update: Maybe this isn't really a fix. It doesn't crash any more, but now my computation results differ from running with the single machine schedulers.

@mrocklin
Copy link
Member

If I recall correctly we recently started allowing buffers and memoryviews through, but only if the stream supported them. Perhaps we're making this judgment incorrectly. This is definitely a situation where @pitrou would know more.

@bluenote10
Copy link
Contributor Author

Ah, I think I found what is causing the issue: It might be related to allocators. I'm encountering the issue in a setting where the driver is using jemalloc while the scheduler/worker are using glibc. If I use glibc everywhere the issue disappears.

@xhochy
Copy link

xhochy commented Jun 20, 2017

Same as in the other issue: I cannot reproduce the failures here with a newer jemalloc version / on OSX. I rather suspect that there is some usage of unitialised memory. This will probably not be a bug in distributed but one of the libraries it depends on that call native code.

@bluenote10
Copy link
Contributor Author

Dang, I'm getting the TypeError: can't concat buffer to bytearray now in a pure glibc setting as well :(. But it takes more complex code now to trigger the issue, the simple example from above works. This is a tough issue.

@pitrou
Copy link
Member

pitrou commented Jun 20, 2017

@bluenote10, which exact Python version are you using?

@pitrou
Copy link
Member

pitrou commented Jun 20, 2017

Oops, I see, it should be 2.7.3.final.0.

@pitrou
Copy link
Member

pitrou commented Jun 20, 2017

Unfortunately, the issue you're having (concatenating a buffer to a bytearray) seems to have been fixed in Python 2.7.4... which was released more than 4 years ago.
I really recommend you upgrade to a recent bugfix release of Python 2.7. Continuum's Anaconda may help you with that.

@pitrou
Copy link
Member

pitrou commented Jun 20, 2017

For the record:

$ ./python -V
Python 2.7.3
$ ./python -c "b = bytearray(); b+= buffer(b'123'); print(b)"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
TypeError: can't concat buffer to bytearray
$ ./python -V
Python 2.7.4
$ ./python -c "b = bytearray(); b+= buffer(b'123'); print(b)"
123

@bluenote10
Copy link
Contributor Author

So you are saying that from a networking perspective it does make sense that the frames list contains a buffer? I'm asking because it can't happen very often, otherwise it would be crashing all the time for me, and my impression was that this was maybe just a consequence of some other problem.

Because if it is a valid value it should be easy to fix this for Python 2.7.3 by properly converting to e.g. str or bytearray.

It is not so much about fixing the Python version locally on my machine. We still have a need to run on Debian 7 servers as well, which are also Python 2.7.3.

@pitrou
Copy link
Member

pitrou commented Jun 20, 2017

So you are saying that from a networking perspective it does make sense that the frames list contains a buffer?

It can make sense, yes. It means in some circumstance we were able to avoid a memory copy and instead passed a view of some existing memory area (most likely the data of a Numpy array). Why this only seems to happen sporadically for you I'm not sure, though.

@pitrou
Copy link
Member

pitrou commented Jun 21, 2017

Because if it is a valid value it should be easy to fix this for Python 2.7.3 by properly converting to e.g. str or bytearray.

It is initially easy, but then needs to be maintained. We are unlikely to run any CI builds with Python 2.7.3, so it may get broken unexpectedly again. Besides, without wanting to spread FUD, there are many other issues that were fixed in recent 2.7 releases and thay may pop up with 2.7.3 (basically anything that is above this line in Misc/NEWS).

Therefore I'm reluctant to add workarounds for such an old Python version, except if it's part of a commercial support contract.

@mrocklin
Copy link
Member

mrocklin commented Jul 5, 2017

What is the status of this issue? Should it be closed?

@bluenote10
Copy link
Contributor Author

The issue still exists, but we can close it for now. Possible work-arounds are not using jemalloc with Python 2.7.3 or using the patch posted above. If need be we can think about a PR either for Dask or maybe Tornado.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants