Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] MNMG KMeans fit fails with NCCL errors in multi-worker cluster. #3261

Closed
drobison00 opened this issue Dec 4, 2020 · 7 comments · Fixed by rapidsai/raft#120
Closed
Assignees
Labels
bug Something isn't working Dask / cuml.dask Issue/PR related to Python level dask or cuml.dask features.

Comments

@drobison00
Copy link
Contributor

drobison00 commented Dec 4, 2020

Describe the bug

Calling fit in the cluster, on a known good data set results in the following:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-11-33694609ce08> in <module>
      9            )
     10 
---> 11 km.fit(higgs_df[higgs_df.columns.difference(['label'])])

~/anaconda3/envs/rapids-0.16/lib/python3.8/site-packages/cuml/common/memory_utils.py in cupy_rmm_wrapper(*args, **kwargs)
     54     def cupy_rmm_wrapper(*args, **kwargs):
     55         with cupy_using_allocator(rmm.rmm_cupy_allocator):
---> 56             return func(*args, **kwargs)
     57 
     58     return cupy_rmm_wrapper

~/anaconda3/envs/rapids-0.16/lib/python3.8/site-packages/cuml/dask/cluster/kmeans.py in fit(self, X)
    145                       for idx, wf in enumerate(data.worker_to_parts.items())]
    146 
--> 147         wait_and_raise_from_futures(kmeans_fit)
    148 
    149         comms.destroy()

~/anaconda3/envs/rapids-0.16/lib/python3.8/site-packages/cuml/dask/common/utils.py in wait_and_raise_from_futures(futures)
    152     """
    153     wait(futures)
--> 154     raise_exception_from_futures(futures)
    155     return futures
    156 

~/anaconda3/envs/rapids-0.16/lib/python3.8/site-packages/cuml/dask/common/utils.py in raise_exception_from_futures(futures)
    141     errs = [f.exception() for f in futures if f.exception()]
    142     if errs:
--> 143         raise RuntimeError("%d of %d worker jobs failed: %s" % (
    144             len(errs), len(futures), ", ".join(map(str, errs))
    145             ))

RuntimeError: 3 of 3 worker jobs failed: NCCL error encountered at: file=/opt/conda/envs/rapids/conda-bld/cuml_1603369552644/work/python/_external_repositories/raft/cpp/include/raft/comms/std_comms.hpp line=334: call='ncclBroadcast(buff, buff, count, get_nccl_datatype(datatype), root, nccl_comm_, stream)', Reason=4:invalid argument
Obtained 37 stack frames
#0 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/raft/dask/common/comms_utils.cpython-38-x86_64-linux-gnu.so(_ZN4raft9exception18collect_call_stackEv+0x43) [0x7f3e197c17f3]
#1 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/raft/dask/common/comms_utils.cpython-38-x86_64-linux-gnu.so(_ZN4raft11logic_errorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x69) [0x7f3e197c1ce9]
#2 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/raft/dask/common/comms_utils.cpython-38-x86_64-linux-gnu.so(_ZNK4raft5comms9std_comms5bcastEPvmNS0_10datatype_tEiP11CUstream_st+0x1b6) [0x7f3e197c2886]
#3 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML6kmeans3opg4impl18initKMeansPlusPlusIdiEEvRKN4raft8handle_tERKNS0_12KMeansParamsERNS_6TensorIT_Li2ET0_EERNS4_2mr6device6bufferISC_EERNSI_IcEE+0x475) [0x7f3df37956c5]
#4 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML6kmeans3opg4impl3fitIdiEEvRKN4raft8handle_tERKNS0_12KMeansParamsEPKT_iiPSB_RSB_Ri+0x254) [0x7f3df378e384]
#5 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML6kmeans3opg3fitERKN4raft8handle_tERKNS0_12KMeansParamsEPKdiiPdRdRi+0x51) [0x7f3df378b5d1]
#6 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/cluster/kmeans_mg.cpython-38-x86_64-linux-gnu.so(+0x2a119) [0x7f3e18b63119]
#7 in /opt/conda/envs/rapids/bin/python(_PyObject_MakeTpCall+0x3bf) [0x563dbf3a2a1f]
#8 in /opt/conda/envs/rapids/bin/python(+0x168f30) [0x563dbf3dbf30]
#9 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x4e73) [0x563dbf454583]
#10 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x260) [0x563dbf43b490]
#11 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x594) [0x563dbf43ca14]
#12 in /opt/conda/envs/rapids/bin/python(PyObject_Call+0x2e9) [0x563dbf3a8c29]
#13 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x1e6a) [0x563dbf45157a]
#14 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x929) [0x563dbf43bb59]
#15 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x594) [0x563dbf43ca14]
#16 in /opt/conda/envs/rapids/bin/python(PyObject_Call+0x2e9) [0x563dbf3a8c29]
#17 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x1e6a) [0x563dbf45157a]
#18 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1b7) [0x563dbf43c637]
#19 in /opt/conda/envs/rapids/bin/python(PyObject_Call+0x7d) [0x563dbf3a89bd]
#20 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x1e6a) [0x563dbf45157a]
#21 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1b7) [0x563dbf43c637]
#22 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x4bf) [0x563dbf44fbcf]
#23 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1b7) [0x563dbf43c637]
#24 in /opt/conda/envs/rapids/bin/python(PyObject_Call+0x7d) [0x563dbf3a89bd]
#25 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x1e6a) [0x563dbf45157a]
#26 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1b7) [0x563dbf43c637]
#27 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x4bf) [0x563dbf44fbcf]
#28 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1b7) [0x563dbf43c637]
#29 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x4bf) [0x563dbf44fbcf]
#30 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1b7) [0x563dbf43c637]
#31 in /opt/conda/envs/rapids/bin/python(+0xa097f) [0x563dbf31397f]
#32 in /opt/conda/envs/rapids/bin/python(PyObject_Call+0x7d) [0x563dbf3a89bd]
#33 in /opt/conda/envs/rapids/bin/python(+0x235e5a) [0x563dbf4a8e5a]
#34 in /opt/conda/envs/rapids/bin/python(+0x1f9bd7) [0x563dbf46cbd7]
#35 in /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f41b6e4b6db]
#36 in /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f41b6b74a3f]
, NCCL error encountered at: file=/opt/conda/envs/rapids/conda-bld/cuml_1603369552644/work/python/_external_repositories/raft/cpp/include/raft/comms/std_comms.hpp line=334: call='ncclBroadcast(buff, buff, count, get_nccl_datatype(datatype), root, nccl_comm_, stream)', Reason=4:invalid argument
Obtained 37 stack frames
#0 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/raft/dask/common/comms_utils.cpython-38-x86_64-linux-gnu.so(_ZN4raft9exception18collect_call_stackEv+0x43) [0x7f4349fc27f3]
#1 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/raft/dask/common/comms_utils.cpython-38-x86_64-linux-gnu.so(_ZN4raft11logic_errorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x69) [0x7f4349fc2ce9]
#2 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/raft/dask/common/comms_utils.cpython-38-x86_64-linux-gnu.so(_ZNK4raft5comms9std_comms5bcastEPvmNS0_10datatype_tEiP11CUstream_st+0x1b6) [0x7f4349fc3886]
#3 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML6kmeans3opg4impl18initKMeansPlusPlusIdiEEvRKN4raft8handle_tERKNS0_12KMeansParamsERNS_6TensorIT_Li2ET0_EERNS4_2mr6device6bufferISC_EERNSI_IcEE+0x475) [0x7f43237956c5]
#4 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML6kmeans3opg4impl3fitIdiEEvRKN4raft8handle_tERKNS0_12KMeansParamsEPKT_iiPSB_RSB_Ri+0x254) [0x7f432378e384]
#5 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML6kmeans3opg3fitERKN4raft8handle_tERKNS0_12KMeansParamsEPKdiiPdRdRi+0x51) [0x7f432378b5d1]
#6 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/cluster/kmeans_mg.cpython-38-x86_64-linux-gnu.so(+0x2a119) [0x7f4349364119]
#7 in /opt/conda/envs/rapids/bin/python(_PyObject_MakeTpCall+0x3bf) [0x56221564ba1f]
#8 in /opt/conda/envs/rapids/bin/python(+0x168f30) [0x562215684f30]
#9 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x4e73) [0x5622156fd583]
#10 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x260) [0x5622156e4490]
#11 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x594) [0x5622156e5a14]
#12 in /opt/conda/envs/rapids/bin/python(PyObject_Call+0x2e9) [0x562215651c29]
#13 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x1e6a) [0x5622156fa57a]
#14 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x929) [0x5622156e4b59]
#15 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x594) [0x5622156e5a14]
#16 in /opt/conda/envs/rapids/bin/python(PyObject_Call+0x2e9) [0x562215651c29]
#17 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x1e6a) [0x5622156fa57a]
#18 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1b7) [0x5622156e5637]
#19 in /opt/conda/envs/rapids/bin/python(PyObject_Call+0x7d) [0x5622156519bd]
#20 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x1e6a) [0x5622156fa57a]
#21 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1b7) [0x5622156e5637]
#22 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x4bf) [0x5622156f8bcf]
#23 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1b7) [0x5622156e5637]
#24 in /opt/conda/envs/rapids/bin/python(PyObject_Call+0x7d) [0x5622156519bd]
#25 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x1e6a) [0x5622156fa57a]
#26 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1b7) [0x5622156e5637]
#27 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x4bf) [0x5622156f8bcf]
#28 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1b7) [0x5622156e5637]
#29 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x4bf) [0x5622156f8bcf]
#30 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1b7) [0x5622156e5637]
#31 in /opt/conda/envs/rapids/bin/python(+0xa097f) [0x5622155bc97f]
#32 in /opt/conda/envs/rapids/bin/python(PyObject_Call+0x7d) [0x5622156519bd]
#33 in /opt/conda/envs/rapids/bin/python(+0x235e5a) [0x562215751e5a]
#34 in /opt/conda/envs/rapids/bin/python(+0x1f9bd7) [0x562215715bd7]
#35 in /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f46e45fb6db]
#36 in /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f46e4324a3f]
, NCCL error encountered at: file=/opt/conda/envs/rapids/conda-bld/cuml_1603369552644/work/python/_external_repositories/raft/cpp/include/raft/comms/std_comms.hpp line=334: call='ncclBroadcast(buff, buff, count, get_nccl_datatype(datatype), root, nccl_comm_, stream)', Reason=4:invalid argument
Obtained 37 stack frames
#0 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/raft/dask/common/comms_utils.cpython-38-x86_64-linux-gnu.so(_ZN4raft9exception18collect_call_stackEv+0x43) [0x7f124c0597f3]
#1 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/raft/dask/common/comms_utils.cpython-38-x86_64-linux-gnu.so(_ZN4raft11logic_errorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x69) [0x7f124c059ce9]
#2 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/raft/dask/common/comms_utils.cpython-38-x86_64-linux-gnu.so(_ZNK4raft5comms9std_comms5bcastEPvmNS0_10datatype_tEiP11CUstream_st+0x1b6) [0x7f124c05a886]
#3 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML6kmeans3opg4impl18initKMeansPlusPlusIdiEEvRKN4raft8handle_tERKNS0_12KMeansParamsERNS_6TensorIT_Li2ET0_EERNS4_2mr6device6bufferISC_EERNSI_IcEE+0x475) [0x7f120f7956c5]
#4 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML6kmeans3opg4impl3fitIdiEEvRKN4raft8handle_tERKNS0_12KMeansParamsEPKT_iiPSB_RSB_Ri+0x254) [0x7f120f78e384]
#5 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML6kmeans3opg3fitERKN4raft8handle_tERKNS0_12KMeansParamsEPKdiiPdRdRi+0x51) [0x7f120f78b5d1]
#6 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/cluster/kmeans_mg.cpython-38-x86_64-linux-gnu.so(+0x2a119) [0x7f12358f1119]
#7 in /opt/conda/envs/rapids/bin/python(_PyObject_MakeTpCall+0x3bf) [0x55f467daaa1f]
#8 in /opt/conda/envs/rapids/bin/python(+0x168f30) [0x55f467de3f30]
#9 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x4e73) [0x55f467e5c583]
#10 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x260) [0x55f467e43490]
#11 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x594) [0x55f467e44a14]
#12 in /opt/conda/envs/rapids/bin/python(PyObject_Call+0x2e9) [0x55f467db0c29]
#13 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x1e6a) [0x55f467e5957a]
#14 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x929) [0x55f467e43b59]
#15 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x594) [0x55f467e44a14]
#16 in /opt/conda/envs/rapids/bin/python(PyObject_Call+0x2e9) [0x55f467db0c29]
#17 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x1e6a) [0x55f467e5957a]
#18 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1b7) [0x55f467e44637]
#19 in /opt/conda/envs/rapids/bin/python(PyObject_Call+0x7d) [0x55f467db09bd]
#20 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x1e6a) [0x55f467e5957a]
#21 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1b7) [0x55f467e44637]
#22 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x4bf) [0x55f467e57bcf]
#23 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1b7) [0x55f467e44637]
#24 in /opt/conda/envs/rapids/bin/python(PyObject_Call+0x7d) [0x55f467db09bd]
#25 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x1e6a) [0x55f467e5957a]
#26 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1b7) [0x55f467e44637]
#27 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x4bf) [0x55f467e57bcf]
#28 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1b7) [0x55f467e44637]
#29 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x4bf) [0x55f467e57bcf]
#30 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1b7) [0x55f467e44637]
#31 in /opt/conda/envs/rapids/bin/python(+0xa097f) [0x55f467d1b97f]
#32 in /opt/conda/envs/rapids/bin/python(PyObject_Call+0x7d) [0x55f467db09bd]
#33 in /opt/conda/envs/rapids/bin/python(+0x235e5a) [0x55f467eb0e5a]
#34 in /opt/conda/envs/rapids/bin/python(+0x1f9bd7) [0x55f467e74bd7]
#35 in /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f15d297d6db]
#36 in /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f15d26a6a3f]

Steps/Code to reproduce bug

from dask_kubernetes import KubeCluster
from dask.distributed import Client

# Worker and scheduler pods are based on 'rapidsai/rapidsai:cuda11.0-runtime-ubuntu18.04-py3.8'
cluster = KubeCluster(pod_template=worker_pod,
                      scheduler_pod_template=sched_pod)

client = Client(cluster)

import dask_cudf
from cuml.dask.cluster import KMeans

higgs_gcp_path = "gs://dvn-cloudml-examples/HIGGS.csv"
columns = ['label'] + [f'col-{i}' for i in range(1, 29)]
workers = client.has_what().keys()

### This block works fine, and produces a cudf Dataframe that works with various other algorithms, including XGboost.
higgs_df = dask_cudf.read_csv(higgs_gcp_path, npartitions=len(workers), headers=None,
                                 names=columns,
                                 storage_options={'token': "/etc/secrets/keyfile.json"})
higgs_df = client.persist(collections=higgs_df, workers=workers)
wait(higgs_df)
###

km = KMeans(client=client,
            n_clusters=12,
            max_iter=371,
            tol=1e-5,
            oversampling_factor=3,
            max_samples_per_batch=32768/2,
           )

km.fit(higgs_df[higgs_df.columns.difference(['label'])])

Expected behavior
Should return a Kmeans model

Environment details (please complete the following information):

  • Environment location: (Cloud) Google Kubernetes Engine
  • Linux Distro/Architecture: Ubuntu 18.04
  • GPU Model/Driver: 450.xx
  • CUDA: 11.0
  • Method of cuDF & cuML install: rapids containers rapidsai/rapidsai:cuda11.0-runtime-ubuntu18.04-py3.8

Additional context
Add any other context about the problem here.

@drobison00 drobison00 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Dec 4, 2020
@drobison00 drobison00 changed the title [BUG] MNMG KMeans [BUG] MNMG KMeans Fails With NCCL errors. Dec 4, 2020
@drobison00 drobison00 changed the title [BUG] MNMG KMeans Fails With NCCL errors. [BUG] MNMG KMeans fit fails with NCCL errors in multi-worker cluster. Dec 4, 2020
@drobison00
Copy link
Contributor Author

On the worker node

distributed.worker - INFO - Run out-of-band function '_func_init_all'
distributed.core - INFO - Event loop was unresponsive in Worker for 392.12s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.worker - WARNING -  Compute Failed
Function:  _func_fit
args:      (b'K\x8b\x96*\xb9AK\x91\xa6S\x1f@\ru\x14z', [           col-1    col-10    col-11  ...     col-7     col-8     col-9
0       0.802861  1.198475  0.167137  ... -0.245602 -0.579267  0.000000
1       0.569890  1.439077 -0.192272  ...  0.344575  0.415388  0.000000
2       0.489366  1.163222  0.722762  ...  0.433695 -0.683489  2.173076
3       0.777972  1.392493  0.187535  ...  0.916927  0.434237  0.000000
4       1.209873  1.225041  0.635339  ... -1.166516 -1.331000  0.000000
...          ...       ...       ...  ...       ...       ...       ...
367463  0.423483  0.588095  0.941321  ...  1.247664  0.531252  2.173076
367464  0.982576  1.291518  2.290559  ... -1.339807  0.689804  2.173076
367465  0.718311  1.496993  1.204564  ...  1.947757 -1.242854  0.000000
367466  0.830495  1.005717 -1.852349  ... -0.427805  0.628822  2.173076
367467  2.457263  2.290940  2.038974  ...  0.602034  1.355608  0.000000

[367468 rows x 28 columns],            col-1    col-10    col-11  ...     col-7     col-8 
kwargs:    {'n_clusters': 12, 'max_iter': 371, 'tol': 1e-05, 'oversampling_factor': 3, 'max_samples_per_batch': 16384.0}
Exception: RuntimeError("NCCL error encountered at: file=/opt/conda/envs/rapids/conda-bld/cuml_1603369552644/work/python/_external_repositories/raft/cpp/include/raft/comms/std_comms.hpp line=334: call='ncclBroadcast(buff, buff, count, get_nccl_datatype(datatype), root, nccl_comm_, stream)', Reason=4:invalid argument\nObtained 37 stack frames\n#0 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/raft/dask/common/comms_utils.cpython-38-x86_64-linux-gnu.so(_ZN4raft9exception18collect_call_stackEv+0x43) [0x7f4349fc27f3]\n#1 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/raft/dask/common/comms_utils.cpython-38-x86_64-linux-gnu.so(_ZN4raft11logic_errorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x69) [0x7f4349fc2ce9]\n#2 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/raft/dask/common/comms_utils.cpython-38-x86_64-linux-gnu.so(_ZNK4raft5comms9std_comms5bcastEPvmNS0_10datatype_tEiP11CUstream_st+0x1b6) [0x7f4349fc3886]\n#3 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML6kmeans3opg4impl18initKMeansPlusPlusIdiEEvRKN4raft8handle_tERKNS0_12KMeansParamsERNS_6TensorIT_Li2ET0_EERNS4_2mr6device6bufferISC_EERNSI_IcEE+0x475) [0x7f43237956c5]\n#4 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML6kmeans3opg4impl3fitIdiEEvRKN4raft8handle_tERKNS0_12KMeansParamsEPKT_iiPSB_RSB_Ri+0x254) [0x7f432378e384]\n#5 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML6kmeans3opg3fitERKN4raft8handle_tERKNS0_12KMeansParamsEPKdiiPdRdRi+0x51) [0x7f432378b5d1]\n#6 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/cluster/kmeans_mg.cpython-38-x86_64-linux-gnu.so(+0x2a119) [0x7f4349364119]\n#7 in /opt/conda/envs/rapids/bin/python(_PyObject_MakeTpCall+0x3bf) [0x56221564ba1f]\n#8 in /opt/conda/envs/rapids/bin/python(+0x168f30) [0x562215684f30]\n#9 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x4e73) [0x5622156fd583]\n#10 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x260) [0x5622156e4490]\n#11 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x594) [0x5622156e5a14]\n#12 in /opt/conda/envs/rapids/bin/python(PyObject_Call+0x2e9) [0x562215651c29]\n#13 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x1e6a) [0x5622156fa57a]\n#14 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x929) [0x5622156e4b59]\n#15 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x594) [0x5622156e5a14]\n#16 in /opt/conda/envs/rapids/bin/python(PyObject_Call+0x2e9) [0x562215651c29]\n#17 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x1e6a) [0x5622156fa57a]\n#18 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1b7) [0x5622156e5637]\n#19 in /opt/conda/envs/rapids/bin/python(PyObject_Call+0x7d) [0x5622156519bd]\n#20 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x1e6a) [0x5622156fa57a]\n#21 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1b7) [0x5622156e5637]\n#22 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x4bf) [0x5622156f8bcf]\n#23 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1b7) [0x5622156e5637]\n#24 in /opt/conda/envs/rapids/bin/python(PyObject_Call+0x7d) [0x5622156519bd]\n#25 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x1e6a) [0x5622156fa57a]\n#26 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1b7) [0x5622156e5637]\n#27 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x4bf) [0x5622156f8bcf]\n#28 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1b7) [0x5622156e5637]\n#29 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x4bf) [0x5622156f8bcf]\n#30 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1b7) [0x5622156e5637]\n#31 in /opt/conda/envs/rapids/bin/python(+0xa097f) [0x5622155bc97f]\n#32 in /opt/conda/envs/rapids/bin/python(PyObject_Call+0x7d) [0x5622156519bd]\n#33 in /opt/conda/envs/rapids/bin/python(+0x235e5a) [0x562215751e5a]\n#34 in /opt/conda/envs/rapids/bin/python(+0x1f9bd7) [0x562215715bd7]\n#35 in /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f46e45fb6db]\n#36 in /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f46e4324a3f]\n")

@drobison00
Copy link
Contributor Author

drobison00 commented Dec 7, 2020

Also seeing the same errors for PCA.

from cuml.dask.decomposition import PCA

X_cudf, y_cudf, centers = blobs.make_blobs(n_samples=1000,
                                 n_features=128,
                                 centers=10,
                                 return_centers=True,
                                 n_parts=n_workers,
                                 shuffle=True, client=client)

wait(X_cudf)

pca = PCA(n_components=5, whiten=False)
x_pca = pca.fit_transform(X_cudf)
x_pca.compute()

@hcho3
Copy link
Contributor

hcho3 commented Dec 7, 2020

@drobison00 Do you think this issue is specific to Kubernetes? I'm interested to see if the issue can be reproduced with LocalCUDACluster.

@hcho3 hcho3 added Dask / cuml.dask Issue/PR related to Python level dask or cuml.dask features. and removed ? - Needs Triage Need team to review and classify labels Dec 7, 2020
@drobison00
Copy link
Contributor Author

@hcho3 LocalCUDACluster works as expected, but doesn't seem like a good comparison, because you're not exercising any multi-node functionality.

Planning to test on a bare metal (non kubernetes) environment this week; however, I'm skeptical its K8s related. Once the worker pods are up, they're not terribly different from any other machine. Also, I can train an xgboost model in the cluster on the full dataset without any issue.

@drobison00
Copy link
Contributor Author

Also appears to affect TruncatedSVD, so I'm guessing its more of a framework level issue.

from cuml.dask.decomposition import TruncatedSVD

tsvd = TruncatedSVD(n_components=n_workers)
xt = tsvd.fit_transform(X_higgs)
xt.compute()

@drobison00 drobison00 self-assigned this Dec 12, 2020
@drobison00
Copy link
Contributor Author

Discussed this with Corey and did some additional NCCL debugging. Looks like a problem with the workers attempting to connect directly back to the client during the fit process, and timing out. Looking into solutions now.

@drobison00
Copy link
Contributor Author

It appears that NCCL requires the ability to be able to connect back to the 'root' node where a communicator is created, which is not possible for use cases where the uniqueId for the communicator is created on the client, and the workers are isolated in a remote cluster.

Looking into a solution that would allow us to create the comm object on the scheduler, which would allow for the most flexibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Dask / cuml.dask Issue/PR related to Python level dask or cuml.dask features.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants