-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] MNMG KMeans fit fails with NCCL errors in multi-worker cluster. #3261
Comments
On the worker node distributed.worker - INFO - Run out-of-band function '_func_init_all'
distributed.core - INFO - Event loop was unresponsive in Worker for 392.12s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.worker - WARNING - Compute Failed
Function: _func_fit
args: (b'K\x8b\x96*\xb9AK\x91\xa6S\x1f@\ru\x14z', [ col-1 col-10 col-11 ... col-7 col-8 col-9
0 0.802861 1.198475 0.167137 ... -0.245602 -0.579267 0.000000
1 0.569890 1.439077 -0.192272 ... 0.344575 0.415388 0.000000
2 0.489366 1.163222 0.722762 ... 0.433695 -0.683489 2.173076
3 0.777972 1.392493 0.187535 ... 0.916927 0.434237 0.000000
4 1.209873 1.225041 0.635339 ... -1.166516 -1.331000 0.000000
... ... ... ... ... ... ... ...
367463 0.423483 0.588095 0.941321 ... 1.247664 0.531252 2.173076
367464 0.982576 1.291518 2.290559 ... -1.339807 0.689804 2.173076
367465 0.718311 1.496993 1.204564 ... 1.947757 -1.242854 0.000000
367466 0.830495 1.005717 -1.852349 ... -0.427805 0.628822 2.173076
367467 2.457263 2.290940 2.038974 ... 0.602034 1.355608 0.000000
[367468 rows x 28 columns], col-1 col-10 col-11 ... col-7 col-8
kwargs: {'n_clusters': 12, 'max_iter': 371, 'tol': 1e-05, 'oversampling_factor': 3, 'max_samples_per_batch': 16384.0}
Exception: RuntimeError("NCCL error encountered at: file=/opt/conda/envs/rapids/conda-bld/cuml_1603369552644/work/python/_external_repositories/raft/cpp/include/raft/comms/std_comms.hpp line=334: call='ncclBroadcast(buff, buff, count, get_nccl_datatype(datatype), root, nccl_comm_, stream)', Reason=4:invalid argument\nObtained 37 stack frames\n#0 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/raft/dask/common/comms_utils.cpython-38-x86_64-linux-gnu.so(_ZN4raft9exception18collect_call_stackEv+0x43) [0x7f4349fc27f3]\n#1 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/raft/dask/common/comms_utils.cpython-38-x86_64-linux-gnu.so(_ZN4raft11logic_errorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x69) [0x7f4349fc2ce9]\n#2 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/raft/dask/common/comms_utils.cpython-38-x86_64-linux-gnu.so(_ZNK4raft5comms9std_comms5bcastEPvmNS0_10datatype_tEiP11CUstream_st+0x1b6) [0x7f4349fc3886]\n#3 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML6kmeans3opg4impl18initKMeansPlusPlusIdiEEvRKN4raft8handle_tERKNS0_12KMeansParamsERNS_6TensorIT_Li2ET0_EERNS4_2mr6device6bufferISC_EERNSI_IcEE+0x475) [0x7f43237956c5]\n#4 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML6kmeans3opg4impl3fitIdiEEvRKN4raft8handle_tERKNS0_12KMeansParamsEPKT_iiPSB_RSB_Ri+0x254) [0x7f432378e384]\n#5 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML6kmeans3opg3fitERKN4raft8handle_tERKNS0_12KMeansParamsEPKdiiPdRdRi+0x51) [0x7f432378b5d1]\n#6 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/cluster/kmeans_mg.cpython-38-x86_64-linux-gnu.so(+0x2a119) [0x7f4349364119]\n#7 in /opt/conda/envs/rapids/bin/python(_PyObject_MakeTpCall+0x3bf) [0x56221564ba1f]\n#8 in /opt/conda/envs/rapids/bin/python(+0x168f30) [0x562215684f30]\n#9 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x4e73) [0x5622156fd583]\n#10 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x260) [0x5622156e4490]\n#11 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x594) [0x5622156e5a14]\n#12 in /opt/conda/envs/rapids/bin/python(PyObject_Call+0x2e9) [0x562215651c29]\n#13 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x1e6a) [0x5622156fa57a]\n#14 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x929) [0x5622156e4b59]\n#15 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x594) [0x5622156e5a14]\n#16 in /opt/conda/envs/rapids/bin/python(PyObject_Call+0x2e9) [0x562215651c29]\n#17 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x1e6a) [0x5622156fa57a]\n#18 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1b7) [0x5622156e5637]\n#19 in /opt/conda/envs/rapids/bin/python(PyObject_Call+0x7d) [0x5622156519bd]\n#20 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x1e6a) [0x5622156fa57a]\n#21 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1b7) [0x5622156e5637]\n#22 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x4bf) [0x5622156f8bcf]\n#23 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1b7) [0x5622156e5637]\n#24 in /opt/conda/envs/rapids/bin/python(PyObject_Call+0x7d) [0x5622156519bd]\n#25 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x1e6a) [0x5622156fa57a]\n#26 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1b7) [0x5622156e5637]\n#27 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x4bf) [0x5622156f8bcf]\n#28 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1b7) [0x5622156e5637]\n#29 in /opt/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x4bf) [0x5622156f8bcf]\n#30 in /opt/conda/envs/rapids/bin/python(_PyFunction_Vectorcall+0x1b7) [0x5622156e5637]\n#31 in /opt/conda/envs/rapids/bin/python(+0xa097f) [0x5622155bc97f]\n#32 in /opt/conda/envs/rapids/bin/python(PyObject_Call+0x7d) [0x5622156519bd]\n#33 in /opt/conda/envs/rapids/bin/python(+0x235e5a) [0x562215751e5a]\n#34 in /opt/conda/envs/rapids/bin/python(+0x1f9bd7) [0x562215715bd7]\n#35 in /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f46e45fb6db]\n#36 in /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f46e4324a3f]\n") |
Also seeing the same errors for PCA. from cuml.dask.decomposition import PCA
X_cudf, y_cudf, centers = blobs.make_blobs(n_samples=1000,
n_features=128,
centers=10,
return_centers=True,
n_parts=n_workers,
shuffle=True, client=client)
wait(X_cudf)
pca = PCA(n_components=5, whiten=False)
x_pca = pca.fit_transform(X_cudf)
x_pca.compute() |
@drobison00 Do you think this issue is specific to Kubernetes? I'm interested to see if the issue can be reproduced with |
@hcho3 LocalCUDACluster works as expected, but doesn't seem like a good comparison, because you're not exercising any multi-node functionality. Planning to test on a bare metal (non kubernetes) environment this week; however, I'm skeptical its K8s related. Once the worker pods are up, they're not terribly different from any other machine. Also, I can train an xgboost model in the cluster on the full dataset without any issue. |
Also appears to affect TruncatedSVD, so I'm guessing its more of a framework level issue. from cuml.dask.decomposition import TruncatedSVD
tsvd = TruncatedSVD(n_components=n_workers)
xt = tsvd.fit_transform(X_higgs)
xt.compute() |
Discussed this with Corey and did some additional NCCL debugging. Looks like a problem with the workers attempting to connect directly back to the client during the fit process, and timing out. Looking into solutions now. |
It appears that NCCL requires the ability to be able to connect back to the 'root' node where a communicator is created, which is not possible for use cases where the uniqueId for the communicator is created on the client, and the workers are isolated in a remote cluster. Looking into a solution that would allow us to create the comm object on the scheduler, which would allow for the most flexibility. |
Describe the bug
Calling fit in the cluster, on a known good data set results in the following:
Steps/Code to reproduce bug
Expected behavior
Should return a Kmeans model
Environment details (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: