Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when set CUDA_VISIBLE_DEVICE to multi gpus and create random forest , receive raft::cuda_error #5314

Open
zhaoyewei opened this issue Mar 31, 2023 · 1 comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working

Comments

@zhaoyewei
Copy link

zhaoyewei commented Mar 31, 2023

When os.environ["CUDA_VISIBLE_DEVICE"]="0" or only one devices, it works well.
But when set CUDA_VISIBLE_DEVICE=0,1,2,3, I will get a cuda_error.
What happend?

Error Infomations are as folllow

terminate called after throwing an instance of 'raft::cuda_error'
what(): CUDA error encountered at: file=/opt/conda/conda-bld/work/cpp/src/decisiontree/batched-levelalgo/builder.cuh line=328: call='cudaMemsetAsync(done_count, 0, sizeof(int) * max_batch * n_col_blks, builder_stream)', Reason=cudaErrorInvalidValue:invalid argument
Obtained 13 stack frames
#0 in /home/zhaoyewei/conda/envs/dl/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN4raft9exception18collect_call_stackEv+0x3b) [0x7f2476f04b8b]
#1 in /home/zhaoyewei/conda/envs/dl/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN4raft10cuda_errorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xbd) [0x7f2476f0627d]
#2 in /home/zhaoyewei/conda/envs/dl/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(ZN2ML2DT7BuilderINS0_20MSEObjectiveFunctionIffiEEE15assignWorkspaceEPcS5+0x2d9) [0x7f24778689c9]
#3 in /home/zhaoyewei/conda/envs/dl/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN2ML2DT7BuilderINS0_20MSEObjectiveFunctionIffiEEEC1ERKN4raft8handle_tEP11CUstream_stimRKNS0_18DecisionTreeParamsEPKfSF_iiPN3rmm14device_uvectorIiEEiRKNS0_9QuantilesIfiEE+0x2b2) [0x7f2477868f22]
#4 in /home/zhaoyewei/conda/envs/dl/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN2ML2DT12DecisionTree3fitIffEESt10shared_ptrINS0_16TreeMetaDataNodeIT_T0_EEERKN4raft8handle_tEP11CUstream_stPKS5_iiPKS6_PN3rmm14device_uvectorIiEEiNS0_18DecisionTreeParamsEmRKNS0_9QuantilesIS5_iEEi+0x185) [0x7f2477889d05]
#5 in /home/zhaoyewei/conda/envs/dl/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(+0xd939fd) [0x7f24778989fd]
#6 in /home/zhaoyewei/conda/envs/dl/lib/python3.10/site-packages/mkl/../../../libiomp5.so(+0x97738) [0x7f5112ad5738]
#7 in /home/zhaoyewei/conda/envs/dl/lib/python3.10/site-packages/mkl/../../../libiomp5.so(__kmp_invoke_microtask+0x93) [0x7f5112af21f3]
#8 in /home/zhaoyewei/conda/envs/dl/lib/python3.10/site-packages/mkl/../../../libiomp5.so(+0x42244) [0x7f5112a80244]
#9 in /home/zhaoyewei/conda/envs/dl/lib/python3.10/site-packages/mkl/../../../libiomp5.so(+0x4194a) [0x7f5112a7f94a]
#10 in /home/zhaoyewei/conda/envs/dl/lib/python3.10/site-packages/mkl/../../../libiomp5.so(+0x95f82) [0x7f5112ad3f82]
#11 in /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f5113d2e609]
#12 in /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f5113aef133]

@zhaoyewei zhaoyewei added ? - Needs Triage Need team to review and classify bug Something isn't working labels Mar 31, 2023
@dantegd
Copy link
Member

dantegd commented Apr 7, 2023

Thanks for the issue @zhaoyewei, when you're trying to use multiple GPUs, are you using the RF present in cuml.dask? Also do you happen to have the code that is throwing the error? We'll try to reproduce and debug this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants