Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] CudaErrorInvalidConfiguration Error when query number is out of 6e4? #2072

Closed
FakeEnd opened this issue Dec 26, 2023 · 3 comments
Closed
Labels
question Further information is requested

Comments

@FakeEnd
Copy link

FakeEnd commented Dec 26, 2023

What is your question?
When I use cagre including 100000 number of queries and index to do computation, something wrong with cudaErrorInvalidConfiguration happens.

I try to adjust the parameters of SearchParams, but it still fails.

import cupy as cp
import torch

from pylibraft.common import DeviceResources
from pylibraft.neighbors import cagra

batch_size = 1024
n_samples = 200
n_features = 64
# n_queries = 1000
k = 3
dataset = torch.randn(batch_size, n_samples, n_features, dtype=torch.float32).cuda()

handle = DeviceResources()
build_params = cagra.IndexParams(metric="sqeuclidean")

dataset_2d = dataset.view(batch_size * n_samples, n_features).clone()
seperate_constant = torch.arange(batch_size).view(-1, 1).repeat(1, n_samples).view(-1, 1).cuda() * 100
dataset_2d += seperate_constant

index = cagra.build(build_params, dataset_2d, handle=handle)
distances, neighbors = cagra.search(cagra.SearchParams(),
                                     index, dataset_2d,
                                     k, handle=handle)

Error:

  File "/archive/bioinformatics/Zhou_lab/shared/jjin/project/ANN/ANN.py", line 60, in forward
    distances, neighbors = cagra.search(cagra.SearchParams(),
  File "handle.pyx", line 225, in pylibraft.common.handle.auto_sync_handle.wrapper
  File "/archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/lib/python3.9/site-packages/pylibraft/common/outputs.py", line 83, in wrapper
    ret_value = f(*args, **kwargs)
  File "cagra.pyx", line 746, in pylibraft.neighbors.cagra.cagra.search
  File "cagra.pyx", line 747, in pylibraft.neighbors.cagra.cagra.search
RuntimeError: CUDA error encountered at: file=/__w/raft/raft/cpp/include/raft/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh line=953: call='cudaPeekAtLastError()', Reason=cudaErrorInvalidConfiguration:invalid configuration argument
Obtained 64 stack frames
#0 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/lib/python3.9/site-packages/pylibraft/neighbors/../libraft.so(_ZN4raft9exception18collect_call_stackEv+0x81) [0x2aab792c6001]
#1 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/lib/python3.9/site-packages/pylibraft/neighbors/../libraft.so(_ZN4raft10cuda_errorC1ERKSs+0x14b) [0x2aab792c6b0b]
#2 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/lib/python3.9/site-packages/pylibraft/neighbors/../libraft.so(_ZN4raft9neighbors5cagra6detail17single_cta_search14select_and_runILj8ELj128EfjfNS0_9filtering24none_cagra_sample_filterEEEvNSt12experimental6mdspanIKT1_NS7_7extentsIlJLm18446744073709551615ELm18446744073709551615EEEENS7_13layout_strideENS_20host_device_accessorINS7_16default_accessorISA_EELNS_11memory_typeE1EEEEENS8_IKT2_SC_NS7_12layout_rightENSE_INSF_ISL_EELSH_1EEEEEPSK_PT3_PSA_jPSL_PjjjjjlSQ_mmjmjmmmmT4_P11CUstream_st+0x978) [0x2aab795c7a58]
#3 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/lib/python3.9/site-packages/pylibraft/neighbors/../libraft.so(_ZN4raft9neighbors5cagra6detail17single_cta_search6searchILj8ELj128EfjfNS0_9filtering24none_cagra_sample_filterEEclERKNS_9resourcesENSt12experimental6mdspanIKfNSB_7extentsIlJLm18446744073709551615ELm18446744073709551615EEEENSB_13layout_strideENS_20host_device_accessorINSB_16default_accessorISD_EELNS_11memory_typeE1EEEEENSC_IKjSF_NSB_12layout_rightENSH_INSI_ISN_EELSK_1EEEEEPjPfPSD_jPSN_SS_jS6_+0x196) [0x2aab79aedae6]
#4 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/lib/python3.9/site-packages/pylibraft/neighbors/../libraft.so(_ZN4raft9neighbors5cagra6searchIfjEEvRKNS_9resourcesERKNS1_13search_paramsERKNS1_5indexIT_T0_EENSt12experimental6mdspanIKSA_NSF_7extentsIlJLm18446744073709551615ELm18446744073709551615EEEENSF_12layout_rightENS_20host_device_accessorINSF_16default_accessorISH_EELNS_11memory_typeE1EEEEENSG_ISB_SJ_SK_NSL_INSM_ISB_EELSO_1EEEEENSG_IfSJ_SK_NSL_INSM_IfEELSO_1EEEEE+0x306) [0x2aab79b027b6]
#5 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/lib/python3.9/site-packages/pylibraft/neighbors/cagra/cagra.cpython-39-x86_64-linux-gnu.so(+0x4573f) [0x2aabacab773f]
#6 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3(PyObject_Call+0xb4) [0x5057d4]
#7 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3(_PyEval_EvalFrameDefault+0x3e14) [0x4eb844]
#8 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3() [0x4e6b2a]
#9 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3(_PyFunction_Vectorcall+0xd4) [0x4f7e54]
#10 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3(PyVectorcall_Call+0x87) [0x505a57]
#11 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/lib/python3.9/site-packages/pylibraft/common/handle.cpython-39-x86_64-linux-gnu.so(+0x4486a) [0x2aab779fc86a]
#12 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3(_PyObject_MakeTpCall+0x2ec) [0x4f073c]
#13 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3(_PyEval_EvalFrameDefault+0x53f6) [0x4ece26]
#14 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3() [0x4e6b2a]
#15 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3() [0x50508d]
#16 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3(PyObject_Call+0xb4) [0x5057d4]
#17 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3(_PyEval_EvalFrameDefault+0x3e14) [0x4eb844]
#18 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3() [0x4e6b2a]
#19 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3() [0x50508d]
#20 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3(PyObject_Call+0xb4) [0x5057d4]
#21 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3(_PyEval_EvalFrameDefault+0x3e14) [0x4eb844]
#22 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3() [0x4e6b2a]
#23 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3(_PyFunction_Vectorcall+0xd4) [0x4f7e54]
#24 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3(_PyObject_FastCallDictTstate+0x1a5) [0x4f0015]
#25 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3(_PyObject_Call_Prepend+0x66) [0x502d86]
#26 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3() [0x5cbd93]
#27 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3(_PyObject_MakeTpCall+0x2ec) [0x4f073c]
#28 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3(_PyEval_EvalFrameDefault+0x53f6) [0x4ece26]
#29 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3() [0x4e6b2a]
#30 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3() [0x50508d]
#31 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3(PyObject_Call+0xb4) [0x5057d4]
#32 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3(_PyEval_EvalFrameDefault+0x3e14) [0x4eb844]
#33 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3() [0x4e6b2a]
#34 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3() [0x50508d]
#35 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3(PyObject_Call+0xb4) [0x5057d4]
#36 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3(_PyEval_EvalFrameDefault+0x3e14) [0x4eb844]
#37 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3() [0x4e6b2a]
#38 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3(_PyFunction_Vectorcall+0xd4) [0x4f7e54]
#39 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3(_PyObject_FastCallDictTstate+0x1a5) [0x4f0015]
#40 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3(_PyObject_Call_Prepend+0x66) [0x502d86]
#41 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3() [0x5cbd93]
#42 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3(_PyObject_MakeTpCall+0x2ec) [0x4f073c]
#43 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3(_PyEval_EvalFrameDefault+0x53f6) [0x4ece26]
#44 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3() [0x4f8123]
#45 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3() [0x505121]
#46 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3(_PyEval_EvalFrameDefault+0x3e14) [0x4eb844]
#47 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3() [0x4e6b2a]
#48 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3(_PyFunction_Vectorcall+0xd4) [0x4f7e54]
#49 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3() [0x505121]
#50 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3(_PyEval_EvalFrameDefault+0x3e14) [0x4eb844]
#51 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3() [0x4e6b2a]
#52 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3(_PyObject_FastCallDictTstate+0x13e) [0x4effae]
#53 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3(_PyObject_Call_Prepend+0x66) [0x502d86]
#54 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3() [0x5cbd93]
#55 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3(_PyObject_MakeTpCall+0x2ec) [0x4f073c]
#56 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3(_PyEval_EvalFrameDefault+0x4b5a) [0x4ec58a]
#57 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3() [0x4f8123]
#58 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3() [0x505121]
#59 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3(_PyEval_EvalFrameDefault+0x3e14) [0x4eb844]
#60 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3() [0x4e6b2a]
#61 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3(_PyFunction_Vectorcall+0xd4) [0x4f7e54]
#62 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3() [0x505121]
#63 in /archive/bioinformatics/Zhou_lab/shared/jjin/software/anaconda3/envs/ANN/bin/python3(_PyEval_EvalFrameDefault+0x3e14) [0x4eb844]
@FakeEnd FakeEnd added the question Further information is requested label Dec 26, 2023
@FakeEnd
Copy link
Author

FakeEnd commented Dec 26, 2023

A more easier demo to reproduce this bug

batch_size = 1024
n_samples = 200
n_features = 64
# n_queries = 1000
k = 3
dataset = torch.randn(batch_size * n_samples, n_features, dtype=torch.float32).cuda()

handle = DeviceResources()
build_params = cagra.IndexParams(metric="sqeuclidean")

index = cagra.build(build_params, dataset, handle=handle)
distances, neighbors = cagra.search(cagra.SearchParams(),
                                     index, dataset,
                                     k, handle=handle)

@FakeEnd
Copy link
Author

FakeEnd commented Dec 27, 2023

Provide more test example.

This example does computation successfully:

n_features = 64
k = 3
dataset = torch.randn(500000, n_features, dtype=torch.float32).cuda()
query = torch.randn(60000, n_features, dtype=torch.float32).cuda()
handle = DeviceResources()
build_params = cagra.IndexParams(metric="sqeuclidean")

index = cagra.build(build_params, dataset, handle=handle)
distances, neighbors = cagra.search(cagra.SearchParams(),
                                     index, query,
                                     k, handle=handle)

This example fails:

n_features = 64
k = 3
dataset = torch.randn(500000, n_features, dtype=torch.float32).cuda()
query = torch.randn(70000, n_features, dtype=torch.float32).cuda()
handle = DeviceResources()
build_params = cagra.IndexParams(metric="sqeuclidean")

index = cagra.build(build_params, dataset, handle=handle)
distances, neighbors = cagra.search(cagra.SearchParams(),
                                     index, query,
                                     k, handle=handle)

@FakeEnd FakeEnd changed the title [QST] CudaErrorInvalidConfiguration Error when using cagre to calculate in large dataset? [QST] CudaErrorInvalidConfiguration Error when query number is out of 6e4? Dec 27, 2023
@lowener
Copy link
Contributor

lowener commented Jan 5, 2024

This issue can be solved by setting the max_queries parameter in cagra.SearchParams().

distances, neighbors = cagra.search(cagra.SearchParams(max_queries=10000),
                                     index, query,
                                     k, handle=handle)

According to the documentation, max_queries is in auto select mode when 0 but it is not bounded (see here). A check should be added.

@FakeEnd FakeEnd closed this as completed Jan 8, 2024
rapids-bot bot pushed a commit that referenced this issue Jan 9, 2024
Fix for #2072: CAGRA search is launching a thread per query in single-CTA. The maximum number of thread is 65535 so the `max_queries` auto selection should be bounded to this number.

Authors:
  - Micka (https://github.com/lowener)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #2081
ChristinaZ pushed a commit to ChristinaZ/raft that referenced this issue Jan 17, 2024
Fix for rapidsai#2072: CAGRA search is launching a thread per query in single-CTA. The maximum number of thread is 65535 so the `max_queries` auto selection should be bounded to this number.

Authors:
  - Micka (https://github.com/lowener)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#2081
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants