Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask-Cudf + XGBoost fails with jit_unspill=True #1136

Closed
ChrisJar opened this issue Feb 28, 2023 · 3 comments · Fixed by #1137
Closed

Dask-Cudf + XGBoost fails with jit_unspill=True #1136

ChrisJar opened this issue Feb 28, 2023 · 3 comments · Fixed by #1137
Assignees
Labels
bug Something isn't working

Comments

@ChrisJar
Copy link

Training an XGboost model on a dask-cudf dataframe fails due to a serialization error when run on a cluster with jit_unspill=True:

import xgboost as xgb
import cudf
import dask_cudf as dd

from dask_cuda import LocalCUDACluster
from dask.distributed import Client

cluster = LocalCUDACluster(jit_unspill=True)
client = Client(cluster)

df = cudf.DataFrame({"a":[1,1,2,2], "b":[1,2,3,4], "c":[0,0,1,1]})
ddf = dd.from_cudf(df, npartitions=1)
X, y = ddf.drop("c", axis=1), ddf["c"]

clf = xgb.dask.DaskXGBClassifier()
clf.client = client
clf.fit(X,y)

prediction = clf.predict(X).compute()

throws:

Exception: TypeError("can not serialize 'numpy.int64' object")

This same code works as expected when jit_unspill=False

Full Traceback

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Cell In[1], line 19
     16 clf.client = client
     17 clf.fit(X,y)
---> 19 prediction = clf.predict(X).compute()

File ~/mambaforge/envs/xgboost-2-27/lib/python3.10/site-packages/dask/base.py:314, in DaskMethodsMixin.compute(self, **kwargs)
290 def compute(self, **kwargs):
291 """Compute this dask collection
292
293 This turns a lazy Dask collection into its in-memory equivalent.
(...)
312 dask.base.compute
313 """
--> 314 (result,) = compute(self, traverse=False, **kwargs)
315 return result

File ~/mambaforge/envs/xgboost-2-27/lib/python3.10/site-packages/dask/base.py:599, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs)
596 keys.append(x.dask_keys())
597 postcomputes.append(x.dask_postcompute())
--> 599 results = schedule(dsk, keys, **kwargs)
600 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])

File ~/mambaforge/envs/xgboost-2-27/lib/python3.10/site-packages/distributed/client.py:3136, in Client.get(self, dsk, keys, workers, allow_other_workers, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
3134 should_rejoin = False
3135 try:
-> 3136 results = self.gather(packed, asynchronous=asynchronous, direct=direct)
3137 finally:
3138 for f in futures.values():

File ~/mambaforge/envs/xgboost-2-27/lib/python3.10/site-packages/distributed/client.py:2305, in Client.gather(self, futures, errors, direct, asynchronous)
2303 else:
2304 local_worker = None
-> 2305 return self.sync(
2306 self._gather,
2307 futures,
2308 errors=errors,
2309 direct=direct,
2310 local_worker=local_worker,
2311 asynchronous=asynchronous,
2312 )

File ~/mambaforge/envs/xgboost-2-27/lib/python3.10/site-packages/distributed/utils.py:338, in SyncMethodMixin.sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
336 return future
337 else:
--> 338 return sync(
339 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
340 )

File ~/mambaforge/envs/xgboost-2-27/lib/python3.10/site-packages/distributed/utils.py:405, in sync(loop, func, callback_timeout, *args, **kwargs)
403 if error:
404 typ, exc, tb = error
--> 405 raise exc.with_traceback(tb)
406 else:
407 return result

File ~/mambaforge/envs/xgboost-2-27/lib/python3.10/site-packages/distributed/utils.py:378, in sync..f()
376 future = asyncio.wait_for(future, callback_timeout)
377 future = asyncio.ensure_future(future)
--> 378 result = yield future
379 except Exception:
380 error = sys.exc_info()

File ~/mambaforge/envs/xgboost-2-27/lib/python3.10/site-packages/tornado/gen.py:769, in Runner.run(self)
766 exc_info = None
768 try:
--> 769 value = future.result()
770 except Exception:
771 exc_info = sys.exc_info()

File ~/mambaforge/envs/xgboost-2-27/lib/python3.10/site-packages/distributed/client.py:2197, in Client._gather(self, futures, errors, direct, local_worker)
2195 else:
2196 self._gather_future = future
-> 2197 response = await future
2199 if response["status"] == "error":
2200 log = logger.warning if errors == "raise" else logger.debug

File ~/mambaforge/envs/xgboost-2-27/lib/python3.10/site-packages/distributed/client.py:2248, in Client._gather_remote(self, direct, local_worker)
2245 response["data"].update(data2)
2247 else: # ask scheduler to gather data for us
-> 2248 response = await retry_operation(self.scheduler.gather, keys=keys)
2250 return response

File ~/mambaforge/envs/xgboost-2-27/lib/python3.10/site-packages/distributed/utils_comm.py:434, in retry_operation(coro, operation, *args, **kwargs)
428 retry_delay_min = parse_timedelta(
429 dask.config.get("distributed.comm.retry.delay.min"), default="s"
430 )
431 retry_delay_max = parse_timedelta(
432 dask.config.get("distributed.comm.retry.delay.max"), default="s"
433 )
--> 434 return await retry(
435 partial(coro, *args, **kwargs),
436 count=retry_count,
437 delay_min=retry_delay_min,
438 delay_max=retry_delay_max,
439 operation=operation,
440 )

File ~/mambaforge/envs/xgboost-2-27/lib/python3.10/site-packages/distributed/utils_comm.py:413, in retry(coro, count, delay_min, delay_max, jitter_fraction, retry_on_exceptions, operation)
411 delay *= 1 + random.random() * jitter_fraction
412 await asyncio.sleep(delay)
--> 413 return await coro()

File ~/mambaforge/envs/xgboost-2-27/lib/python3.10/site-packages/distributed/core.py:1227, in PooledRPCCall.getattr..send_recv_from_rpc(**kwargs)
1225 prev_name, comm.name = comm.name, "ConnectionPool." + key
1226 try:
-> 1227 return await send_recv(comm=comm, op=key, **kwargs)
1228 finally:
1229 self.pool.reuse(self.addr, comm)

File ~/mambaforge/envs/xgboost-2-27/lib/python3.10/site-packages/distributed/core.py:1011, in send_recv(comm, reply, serializers, deserializers, **kwargs)
1009 _, exc, tb = clean_exception(**response)
1010 assert exc
-> 1011 raise exc.with_traceback(tb)
1012 else:
1013 raise Exception(response["exception_text"])

File ~/mambaforge/envs/xgboost-2-27/lib/python3.10/site-packages/distributed/core.py:820, in _handle_comm()
818 result = handler(**msg)
819 if inspect.iscoroutine(result):
--> 820 result = await result
821 elif inspect.isawaitable(result):
822 raise RuntimeError(
823 f"Comm handler returned unknown awaitable. Expected coroutine, instead got {type(result)}"
824 )

File ~/mambaforge/envs/xgboost-2-27/lib/python3.10/site-packages/distributed/scheduler.py:5684, in gather()
5681 else:
5682 who_has[key] = []
-> 5684 data, missing_keys, missing_workers = await gather_from_workers(
5685 who_has, rpc=self.rpc, close=False, serializers=serializers
5686 )
5687 if not missing_keys:
5688 result = {"status": "OK", "data": data}

File ~/mambaforge/envs/xgboost-2-27/lib/python3.10/site-packages/distributed/utils_comm.py:91, in gather_from_workers()
89 for worker, c in coroutines.items():
90 try:
---> 91 r = await c
92 except OSError:
93 missing_workers.add(worker)

File ~/mambaforge/envs/xgboost-2-27/lib/python3.10/site-packages/distributed/utils_comm.py:434, in retry_operation()
428 retry_delay_min = parse_timedelta(
429 dask.config.get("distributed.comm.retry.delay.min"), default="s"
430 )
431 retry_delay_max = parse_timedelta(
432 dask.config.get("distributed.comm.retry.delay.max"), default="s"
433 )
--> 434 return await retry(
435 partial(coro, *args, **kwargs),
436 count=retry_count,
437 delay_min=retry_delay_min,
438 delay_max=retry_delay_max,
439 operation=operation,
440 )

File ~/mambaforge/envs/xgboost-2-27/lib/python3.10/site-packages/distributed/utils_comm.py:413, in retry()
411 delay *= 1 + random.random() * jitter_fraction
412 await asyncio.sleep(delay)
--> 413 return await coro()

File ~/mambaforge/envs/xgboost-2-27/lib/python3.10/site-packages/distributed/worker.py:2903, in get_data_from_worker()
2901 comm.name = "Ephemeral Worker->Worker for gather"
2902 try:
-> 2903 response = await send_recv(
2904 comm,
2905 serializers=serializers,
2906 deserializers=deserializers,
2907 op="get_data",
2908 keys=keys,
2909 who=who,
2910 max_connections=max_connections,
2911 )
2912 try:
2913 status = response["status"]

File ~/mambaforge/envs/xgboost-2-27/lib/python3.10/site-packages/distributed/core.py:1013, in send_recv()
1011 raise exc.with_traceback(tb)
1012 else:
-> 1013 raise Exception(response["exception_text"])
1014 return response

Exception: TypeError("can not serialize 'numpy.int64' object")

Versions:
Dask-cuDF: 23.04 (also occurs with 23.02)
Dask-cuda: 23.04 (also occurs with 23.02)
XGBoost: 1.7.1

@pentschev
Copy link
Member

@madsbk I have the impression that this shouldn't be happening, would you mind taking a look if this is a known limitation or a bug?

madsbk added a commit to madsbk/dask-cuda that referenced this issue Mar 1, 2023
@madsbk
Copy link
Member

madsbk commented Mar 1, 2023

@ChrisJar, can you confirm that #1137 fixes the issue?

@madsbk madsbk self-assigned this Mar 1, 2023
@madsbk madsbk added the bug Something isn't working label Mar 1, 2023
@ChrisJar
Copy link
Author

ChrisJar commented Mar 1, 2023

@madsbk Yep #1137 fixes this! Thanks!

@rapids-bot rapids-bot bot closed this as completed in #1137 Mar 6, 2023
rapids-bot bot pushed a commit that referenced this issue Mar 6, 2023
The proxied's `name` attribute might contain types not support by msgpack. We now pickle the fixed attributes when serializing. 

Closes #1136

Authors:
  - Mads R. B. Kristensen (https://github.com/madsbk)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)
  - Peter Andreas Entschev (https://github.com/pentschev)

URL: #1137
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants