You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When os.environ["CUDA_VISIBLE_DEVICE"]="0" or only one devices, it works well.
But when set CUDA_VISIBLE_DEVICE=0,1,2,3, I will get a cuda_error.
What happend?
Error Infomations are as folllow
terminate called after throwing an instance of 'raft::cuda_error'
what(): CUDA error encountered at: file=/opt/conda/conda-bld/work/cpp/src/decisiontree/batched-levelalgo/builder.cuh line=328: call='cudaMemsetAsync(done_count, 0, sizeof(int) * max_batch * n_col_blks, builder_stream)', Reason=cudaErrorInvalidValue:invalid argument
Obtained 13 stack frames
#0 in /home/zhaoyewei/conda/envs/dl/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN4raft9exception18collect_call_stackEv+0x3b) [0x7f2476f04b8b] #1 in /home/zhaoyewei/conda/envs/dl/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN4raft10cuda_errorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xbd) [0x7f2476f0627d] #2 in /home/zhaoyewei/conda/envs/dl/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(ZN2ML2DT7BuilderINS0_20MSEObjectiveFunctionIffiEEE15assignWorkspaceEPcS5+0x2d9) [0x7f24778689c9] #3 in /home/zhaoyewei/conda/envs/dl/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN2ML2DT7BuilderINS0_20MSEObjectiveFunctionIffiEEEC1ERKN4raft8handle_tEP11CUstream_stimRKNS0_18DecisionTreeParamsEPKfSF_iiPN3rmm14device_uvectorIiEEiRKNS0_9QuantilesIfiEE+0x2b2) [0x7f2477868f22] #4 in /home/zhaoyewei/conda/envs/dl/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN2ML2DT12DecisionTree3fitIffEESt10shared_ptrINS0_16TreeMetaDataNodeIT_T0_EEERKN4raft8handle_tEP11CUstream_stPKS5_iiPKS6_PN3rmm14device_uvectorIiEEiNS0_18DecisionTreeParamsEmRKNS0_9QuantilesIS5_iEEi+0x185) [0x7f2477889d05] #5 in /home/zhaoyewei/conda/envs/dl/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(+0xd939fd) [0x7f24778989fd] #6 in /home/zhaoyewei/conda/envs/dl/lib/python3.10/site-packages/mkl/../../../libiomp5.so(+0x97738) [0x7f5112ad5738] #7 in /home/zhaoyewei/conda/envs/dl/lib/python3.10/site-packages/mkl/../../../libiomp5.so(__kmp_invoke_microtask+0x93) [0x7f5112af21f3] #8 in /home/zhaoyewei/conda/envs/dl/lib/python3.10/site-packages/mkl/../../../libiomp5.so(+0x42244) [0x7f5112a80244] #9 in /home/zhaoyewei/conda/envs/dl/lib/python3.10/site-packages/mkl/../../../libiomp5.so(+0x4194a) [0x7f5112a7f94a] #10 in /home/zhaoyewei/conda/envs/dl/lib/python3.10/site-packages/mkl/../../../libiomp5.so(+0x95f82) [0x7f5112ad3f82] #11 in /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f5113d2e609] #12 in /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f5113aef133]
The text was updated successfully, but these errors were encountered:
Thanks for the issue @zhaoyewei, when you're trying to use multiple GPUs, are you using the RF present in cuml.dask? Also do you happen to have the code that is throwing the error? We'll try to reproduce and debug this issue.
When os.environ["CUDA_VISIBLE_DEVICE"]="0" or only one devices, it works well.
But when set CUDA_VISIBLE_DEVICE=0,1,2,3, I will get a cuda_error.
What happend?
Error Infomations are as folllow
terminate called after throwing an instance of 'raft::cuda_error'
what(): CUDA error encountered at: file=/opt/conda/conda-bld/work/cpp/src/decisiontree/batched-levelalgo/builder.cuh line=328: call='cudaMemsetAsync(done_count, 0, sizeof(int) * max_batch * n_col_blks, builder_stream)', Reason=cudaErrorInvalidValue:invalid argument
Obtained 13 stack frames
#0 in /home/zhaoyewei/conda/envs/dl/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN4raft9exception18collect_call_stackEv+0x3b) [0x7f2476f04b8b]
#1 in /home/zhaoyewei/conda/envs/dl/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN4raft10cuda_errorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xbd) [0x7f2476f0627d]
#2 in /home/zhaoyewei/conda/envs/dl/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(ZN2ML2DT7BuilderINS0_20MSEObjectiveFunctionIffiEEE15assignWorkspaceEPcS5+0x2d9) [0x7f24778689c9]
#3 in /home/zhaoyewei/conda/envs/dl/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN2ML2DT7BuilderINS0_20MSEObjectiveFunctionIffiEEEC1ERKN4raft8handle_tEP11CUstream_stimRKNS0_18DecisionTreeParamsEPKfSF_iiPN3rmm14device_uvectorIiEEiRKNS0_9QuantilesIfiEE+0x2b2) [0x7f2477868f22]
#4 in /home/zhaoyewei/conda/envs/dl/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN2ML2DT12DecisionTree3fitIffEESt10shared_ptrINS0_16TreeMetaDataNodeIT_T0_EEERKN4raft8handle_tEP11CUstream_stPKS5_iiPKS6_PN3rmm14device_uvectorIiEEiNS0_18DecisionTreeParamsEmRKNS0_9QuantilesIS5_iEEi+0x185) [0x7f2477889d05]
#5 in /home/zhaoyewei/conda/envs/dl/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(+0xd939fd) [0x7f24778989fd]
#6 in /home/zhaoyewei/conda/envs/dl/lib/python3.10/site-packages/mkl/../../../libiomp5.so(+0x97738) [0x7f5112ad5738]
#7 in /home/zhaoyewei/conda/envs/dl/lib/python3.10/site-packages/mkl/../../../libiomp5.so(__kmp_invoke_microtask+0x93) [0x7f5112af21f3]
#8 in /home/zhaoyewei/conda/envs/dl/lib/python3.10/site-packages/mkl/../../../libiomp5.so(+0x42244) [0x7f5112a80244]
#9 in /home/zhaoyewei/conda/envs/dl/lib/python3.10/site-packages/mkl/../../../libiomp5.so(+0x4194a) [0x7f5112a7f94a]
#10 in /home/zhaoyewei/conda/envs/dl/lib/python3.10/site-packages/mkl/../../../libiomp5.so(+0x95f82) [0x7f5112ad3f82]
#11 in /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f5113d2e609]
#12 in /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f5113aef133]
The text was updated successfully, but these errors were encountered: