Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]Agglomerative clustering encounter cudaErrorInvalidValue:invalid argument #4424

Open
lzhang282 opened this issue Dec 4, 2021 · 5 comments
Labels
? - Needs Triage Need team to review and classify bug Something isn't working inactive-30d inactive-90d

Comments

@lzhang282
Copy link

Describe the bug
Run into cudaError

Environment details (please complete the following information):

  • Cloud: Databricks runtime 9.1LTS
  • Linux Distro/Architecture: [Ubuntu 18.04 amd64]
  • GPU Model/Driver: [V100 and driver 396.44]
  • CUDA: 11.0
  • CUML: 0.19

sample code to reproduce error

import cudf
import cupy
from cuml.cluster import AgglomerativeClustering
from cuml.datasets import make_blobs

n_samples = 10000
n_features = 2

n_clusters = 10
random_state = 0

generate data

device_data, device_labels = make_blobs(n_samples=n_samples,
n_features=n_features,
centers=n_clusters,
random_state=random_state,
cluster_std=0.1)

device_data = cudf.DataFrame(device_data)
device_labels = cudf.Series(device_labels)

agglomerative hierarchical clustering

hc_cuml = AgglomerativeClustering(n_clusters=n_clusters, affinity="euclidean", linkage="single",connectivity='knn',n_neighbors=10)
hc_cuml.fit(device_data)

error message


RuntimeError Traceback (most recent call last)
in
22 # agglomerative hierarchical clustering
23 hc_cuml = AgglomerativeClustering(n_clusters=n_clusters, affinity="euclidean", linkage="single",connectivity='knn',n_neighbors=10)
---> 24 hc_cuml.fit(device_data)

/databricks/python/lib/python3.8/site-packages/cuml/internals/api_decorators.py in inner_with_setters(*args, **kwargs)
407 target_val=target_val)
408
--> 409 return func(*args, **kwargs)
410
411 @wraps(func)

cuml/cluster/agglomerative.pyx in cuml.cluster.agglomerative.AgglomerativeClustering.fit()

RuntimeError: CUDA error encountered at: file=raft/src/raft/cpp/include/raft/cudart_utils.h line=205: call='cudaMemcpyAsync(dst, src, len * sizeof(Type), cudaMemcpyDefault, stream)', Reason=cudaErrorInvalidValue:invalid argument
Obtained 49 stack frames
#0 in /databricks/python/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN4raft9exception18collect_call_stackEv+0x46) [0x7f7a61c26af6]
#1 in /databricks/python/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN4raft10cuda_errorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x69) [0x7f7a61c27259]
#2 in /databricks/python/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN4raft4copyIiEEvPT_PKS1_mP11CUstream_st+0x138) [0x7f7a61c3e088]
#3 in /databricks/python/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN4raft9hierarchy6detail21build_dendrogram_hostIifEEvRKNS_8handle_tEPKT_S8_PKT0_mPS6_RN3rmm14device_uvectorIS9_EERNSE_IS6_EE+0x4cb) [0x7f7a61f09e8b]
#4 in /databricks/python/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN4raft9hierarchy14single_linkageIifLNS0_15LinkageDistanceE1EEEvRKNS_8handle_tEPKT0_mmNS_8distance12DistanceTypeEPNS0_14linkage_outputIT_S6_EEim+0x6c4) [0x7f7a61efd114]
#5 in /databricks/python/lib/python3.8/site-packages/cuml/cluster/agglomerative.cpython-38-x86_64-linux-gnu.so(+0x29bdc) [0x7f7a4ded9bdc]
#6 in /databricks/python/bin/python(PyObject_Call+0x255) [0x55e8204062b5]

@lzhang282 lzhang282 added ? - Needs Triage Need team to review and classify bug Something isn't working labels Dec 4, 2021
@cjnolet
Copy link
Member

cjnolet commented Dec 6, 2021

@lzhang282,

Thank you for opening this issue. There have been a few releases now since cuml 0.19 which have fixed several bugs in the agglomerative clustering code. I'm not able to reproduce this on the most recent version (22.02 at the time of writing), you able to try a more recent version?

@lzhang282
Copy link
Author

@cjnolet Thank you for the quick response. I am aware of a few more releases after 0.19. But 0.19 has been explicitly specified in https://github.com/rapidsai/cloud-ml-examples/blob/main/databricks/docker/rapids-spec.txt . Could you pinpoint places where needs to be replaced in order to use the latest version? I have to create a customized image to run on Databricks. Thanks!

@github-actions
Copy link

github-actions bot commented Jan 5, 2022

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@github-actions
Copy link

github-actions bot commented Apr 5, 2022

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@ilkersigirci
Copy link

ilkersigirci commented Aug 25, 2023

Encountered the same error with the latest cuML version 23.8.0(using tesla p100 16GB). In here, it is said that, the problem occurs because of old hardware. Is this the case actually? Is there any progress for fixing it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working inactive-30d inactive-90d
Projects
None yet
Development

No branches or pull requests

3 participants