faiss::gpu::runMatrixMult failure #34

hellolovetiger · 2017-03-10T02:25:02Z

The full log:
Faiss assertion err == CUBLAS_STATUS_SUCCESS failed in void faiss::gpu::runMatrixMult(faiss::gpu::Tensor<T, 2, true>&, bool, faiss::gpu::Tensor<T, 2, true>&, bool, faiss::gpu::Tensor<T, 2, true>&, bool, float, float, cublasHandle_t, cudaStream_t) [with T = float; cublasHandle_t = cublasContext*; cudaStream_t = CUstream_st*] at utils/MatrixMult.cu:141Aborted (core dumped)

I have successfully run demo_ivfpq_indexing_gpu, which I think the faiss was installed successfully.

The text was updated successfully, but these errors were encountered:

wickedfoo · 2017-03-10T02:26:37Z

Possibility that you ran out of GPU memory?

wickedfoo · 2017-03-10T02:27:07Z

What were you trying to run?

hellolovetiger · 2017-03-10T02:38:33Z

train data shape: (2000000, 1000)
base data shape: (20000000, 1000)
query data shape: (1000000, 1000)
data type: float32

my code:

index = faiss.index_factory(d, "OPQ16_512,IVF4096,PQ16")
co = faiss.GpuClonerOptions()
co.usePrecomputed = False
index = faiss.index_cpu_to_gpu(res, 0, index, co)

index.train(xt)
del xt

index.add(xb)  # error happends here

My GPU memory is 8GB. I just tried the bench bench_gpu_sift1m.py, the same error.

wickedfoo · 2017-03-10T02:53:17Z

index.add(xb) # error happends here

instead of giving all of the (20000000, 1000) at once, try giving it in chunks of (10000, 1000) or so.
This is a issue that will be fixed at some point, the GPU side is less friendly unless you handle chunking the input beforehand, but eventually we'll handle that automatically.

wickedfoo · 2017-03-10T02:54:08Z

Only GpuIndexFlat* handles passing large amounts of data all at once for add or search at present.

hellolovetiger · 2017-03-10T02:59:26Z

I see. Actually I used numpy.memmap to load the data. Sorry, could you give me some guidance on how to chunk the input data that can be loaded with index.add?

hellolovetiger · 2017-03-10T03:08:32Z

Also, I notice that my GPU memory occupation in training is always about 20%. That's strange.

hellolovetiger · 2017-03-10T03:25:15Z

Just made some changes on the bench code bench_gpu_sift1m.py, still the same error. Populating top 10000 not work, either. Seems it is not memory issue. Maybe there is something wrong with the CUBLAS. By the way, do you have a plan to publish an official docker image to avoid some problems caused by installation?

#################################################################
#  Approximate search experiment
#################################################################

print "============ Approximate search"

index = faiss.index_factory(d, "IVF4096,PQ64")

# faster, uses more memory
# index = faiss.index_factory(d, "IVF16384,Flat")

co = faiss.GpuClonerOptions()

# here we are using a 64-byte PQ, so we must set the lookup tables to
# 16 bit float (this is due to the limited temporary memory).
co.useFloat16 = True

index = faiss.index_cpu_to_gpu(res, 0, index, co)

print "train"

index.train(xt)

print "add vectors to index"

index.add(xb[:10000])

mdouze · 2017-03-10T03:43:54Z

Hi
Note that the code above will not work for 1000-dim data (because 1000 is not a multiple of 64).
We do not have plans for a Docker image.

hellolovetiger · 2017-03-10T04:01:13Z

Hi, mdouze. Above code is from bench_gpu_sift1m.py. I used the data from http://corpus-texmex.irisa.fr/, following the instruction in https://github.com/facebookresearch/faiss/tree/master/benchs. I just wanted to check if the bench code works. Turn out to be the same error with my own.

mdouze · 2017-03-10T04:09:38Z

Ok, so this is the exact script bench_gpu_sift1m.py applied to the SIFT1M dataset and not your 20M*1000-dim dataset, correct?
On which type of GPU are you running this?

hellolovetiger · 2017-03-10T04:12:26Z

Yes for your first question.
My GPU is GeForce GTX 1080

mdouze · 2017-03-10T04:22:36Z

It could be the same bug as issue #8. Unfortunately we do not have the hardware to reproduce it, so we would be grateful if you could narrow down the error for us:

Does it still crash in the add?
If yes, could you add fewer vectors until it does not crash any more?
could you set co.usePrecomputed = false and test again?
could you reduce the 2 numbers in "IVF4096,PQ64" by powers of two until it does not crash any more?

wickedfoo · 2017-03-10T05:57:10Z

You can also try running cuda-memcheck on the bench_gpu_sift1m.py to see if anything gets printed out that does not look like the following:

========= Program hit cudaErrorInvalidValue (error 11) due to "invalid argument" on CUDA API call to cudaPointerGetAttributes. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib64/nvidia/libcuda.so.1 [0x2eea03]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x126239]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x16e44]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x1d066]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x1d1e2]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x1889f]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x194e5]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0xb504c]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x2332f]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x260d0]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0xf8cb]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21b35]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0xf415]
=========
========= Program hit cudaErrorInvalidValue (error 11) due to "invalid argument" on CUDA API call to cudaGetLastError. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib64/nvidia/libcuda.so.1 [0x2eea03]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x11de53]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x16e65]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x1d066]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x1d1e2]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x1889f]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x194e5]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0xb504c]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x2332f]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x260d0]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0xf8cb]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21b35]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0xf415]

Another thing is to try resetting the GPU via nvidia-smi and trying again.

Also, you could try and investigate which CUDA shared libraries it is trying to load, to see if there is a mismatch if you have multiple CUDA SDK versions installed.

wickedfoo · 2017-03-10T05:59:36Z

Also, I notice that my GPU memory occupation in training is always about 20%. That's strange.

Faiss GPU reserves about 18% of available GPU memory up front for scratch space. This amount is controllable via StandardGpuResources, but it will run slower if you decrease it by a lot (due to cudaMalloc/cudaFree overhead). 1-2 GB of scratch space seems to be appropriate for most workloads.

hellolovetiger · 2017-03-10T06:11:43Z

For your questions:

could you add fewer vectors until it does not crash any more?
It will always crash no matter how small the number of vectors is.
could you set co.usePrecomputed = false and test again?
It works. But it doesn't work for my own code. I will give more tries.
could you reduce the 2 numbers in "IVF4096,PQ64" by powers of two until it does not crash any more?
It will fail if setting co.usePrecomputed = True

Some other infos:
ldd gpu/test/demo_ivfpq_indexing_gpu ==>

linux-vdso.so.1 =>  (0x00007ffcc0066000)
libopenblas.so.0 => /usr/lib/libopenblas.so.0 (0x00007fa709dfd000)
liblapack.so.3 => /usr/lib/liblapack.so.3 (0x00007fa709661000)
libcublas.so.8.0 => /usr/local/cuda-8.0/targets/x86_64-linux/lib/libcublas.so.8.0 (0x00007fa706cb1000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fa706aa9000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fa70688b000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fa706687000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fa706383000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fa70607d000)
libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007fa705e6e000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fa705c58000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fa705893000)
/lib64/ld-linux-x86-64.so.2 (0x00007fa70b606000)
libblas.so.3 => /usr/lib/libblas.so.3 (0x00007fa70408a000)
libgfortran.so.3 => /usr/lib/x86_64-linux-gnu/libgfortran.so.3 (0x00007fa703d70000)
libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 (0x00007fa703b34000)

hellolovetiger · 2017-03-10T06:33:27Z

@wickedfoo

here is cuda-memcheck result (setting co.usePrecomputed = True):

============ Approximate search
train
add vectors to index
Faiss assertion err == CUBLAS_STATUS_SUCCESS failed in void faiss::gpu::runMatrixMult(faiss::gpu::Tensor<T, 2, true>&, bool, faiss::gpu::Tensor<T, 2, true>&, bool, faiss::gpu::Tensor<T, 2, true>&, bool, float, float, cublasHandle_t, cudaStream_t) [with T = float; cublasHandle_t = cublasContext*; cudaStream_t = CUstream_st*] at utils/MatrixMult.cu:141Aborted========= Error: process didn't terminate successfully
========= Internal error (7)
========= No CUDA-MEMCHECK results found

resetting the GPU via nvidia-smi doesn't work
There is only one CUDA SDK: V8.0.44

wickedfoo · 2017-03-10T16:47:12Z

Are you compiling with clang or gcc?

hellolovetiger · 2017-03-11T02:36:03Z

gcc

mdouze · 2017-03-12T23:25:59Z

I believe this is related to the GPU, which is similar to issue #8

yhpku · 2017-03-13T12:05:23Z

I meet the same problem. My GPU is TITAN X. I want to index 1000000 512 dimension vectors using faiss.GpuIndexFlatL2. Then it will meet this issue. But if I cut the number 1000000 to 500000, it will be normal. It seems the max number of vectors is 500000. Because 60*0000 vectors will also cause this problem. The following is my code:
d = 1000000 # dimension
nb = 512 # database size
nq = 1000 # nb of queries
np.random.seed(1234) # make reproducible
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.
xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.
xc = xb[0:1000, :].copy()
xc[:, 0] += 0.02

res = faiss.StandardGpuResources()

index = faiss.GpuIndexFlatL2(res, 0, d, False)   # build the GPU index
# index = faiss.IndexFlatL2(d)   # build the index
print index.is_trained
index.add(xb)

print index.ntotal
print (' build the index time %f ms' % ((time.time() - time_1) * 1000))

time_1 = time.time()
k = 1                       # we want to see 4 nearest neighbors
D, I = index.search(xc[:1], k)
print (' search time %f ms' % ((time.time() - time_1) * 1000))

hellolovetiger · 2017-03-13T13:47:17Z

@yhpku , Thanks. I tried GTX 1080 and Titan X, both failed. Seems yours is caused by OOM. IndexFlatL2 will load all the data all at once for add or search. So, maybe 500000 is the upper limitation for Titan X. You can try IndexIVFPQ, which compresses the stored vectors with a lossy compression.

mdouze · 2017-03-13T14:10:00Z

Hi @yhpku, in the code above you use 512 vectors in 1M dimensions. Is this what you want?

yhpku · 2017-03-13T14:12:22Z

@mdouze，that's not. I means 1M vectors in 512 dimensions

mdouze · 2017-03-13T14:12:23Z

@hellolovetiger, Titan X should work. Does bench_gpu_sift1m.py crash on Titan X? What error?

mdouze · 2017-03-13T14:46:59Z

@yhpku, please fix your code then.

hellolovetiger · 2017-03-14T03:13:17Z

On Titan X,
For demo_ivfpq_indexing_gpu, the error is:

Adding the vectors to the index
Segmentation fault (core dumped)

For bench_gpu_sift1m.py,

============ Approximate search
train
WARNING clustering 100000 points to 4096 centroids: please provide at least 159744 training points
add vectors to index
Segmentation fault (core dumped)

The error will be gone if setting co.usePrecomputed = False

For my own code:

#train data shape: (2000000, 1000)
#base data shape: (20000000, 1000)
#query data shape: (1000000, 1000)
#data type: float32

index = faiss.index_factory(d, "OPQ16_512,IVF1024,PQ16")
co = faiss.GpuClonerOptions()
co.useFloat16 = False
co.usePrecomputed = False
co.indicesOptions = faiss.INDICES_CPU
index = faiss.index_cpu_to_gpu(res, 0, index, co)

index.train(xt)
del xt

index.add(xb)  # error happends here

The error is:

WARN: increase temp memory to avoid cudaMalloc, or decrease query/add size (alloc 5669326848 B, highwater 5669326848 B)
Faiss assertion err == CUBLAS_STATUS_SUCCESS failed in void faiss::gpu::runMatrixMult(faiss::gpu::Tensor<T, 2, true>&, bool, faiss::gpu::Tensor<T, 2, true>&, bool, faiss::gpu::Tensor<T, 2, true>&, bool, float, float, cublasHandle_t, cudaStream_t) [with T = float; cublasHandle_t = cublasContext*; cudaStream_t = CUstream_st*] at utils/MatrixMult.cu:141Aborted (core dumped)

When I cut the base data from 20M to 3M, the error becomes:

WARN: increase temp memory to avoid cudaMalloc, or decrease query/add size (alloc 6144000000 B, highwater 6144000000 B)
Faiss assertion err == cudaSuccess failed in char* faiss::gpu::StackDeviceMemory::Stack::getAlloc(size_t, cudaStream_t) at utils/StackDeviceMemory.cpp:71Aborted (core dumped)

Seems it becomes a memory issue.
By the way, GpuIndexIVFPQ will still encounter memory issue if the base vectors is too big?

wickedfoo · 2017-03-14T03:24:15Z

@hellolovetiger,

You are running out of GPU memory. Do not try and add so many vectors at once. 3M * 1000 * sizeof(float) is 12 GB.

Try adding the vectors in chunks of 10000 to 50000 instead.

wickedfoo · 2017-03-14T03:25:12Z

After adding to the index, the vectors will then be compressed via PQ, and then you can add more. But, before compression, each vector takes 4000 bytes of memory ( = 1000 * sizeof(float)), not 16 bytes (PQ16).

wickedfoo · 2017-03-14T03:26:01Z

Problems with attempting to add large CPU resident vectors all at once will be fixed internally at some point. But in the meantime you will have to incrementally add them.

hellolovetiger · 2017-03-14T03:38:26Z

Got it. Thanks, @wickedfoo . It is better to add these infos to wiki. 😃

yhpku · 2017-03-14T05:25:31Z

@mdouze ,I am sorry , this is a typing error. The actual code is as follows. And the error output is, "Faiss assertion err == cudaSuccess failed in faiss::gpu::StackDeviceMemory::Stack::~Stack() at utils/StackDeviceMemory.cpp:54Aborted (core dumped)".

time_1 = time.time()
d = 512                           # dimension
nb = 700000                      # database size
nq = 1000                       # nb of queries
np.random.seed(1234)             # make reproducible
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.
xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.
xc = xb[0:1000, :].copy()
xc[:, 0] += 0.02
res = faiss.StandardGpuResources()
index = faiss.GpuIndexFlatL2(res, 0, d, False)   # build the GPU index
print index.is_trained
index.add(xb)
print index.ntotal
print (' build the index time %f ms' % ((time.time() - time_1) * 1000))
time_1 = time.time()
D, I = index.search(xc[:2], 1)
print (D)
print (I)
print (' search time %f ms' % ((time.time() - time_1) * 1000))

mdouze · 2017-03-29T11:44:56Z

Closing this issue now, because the discussion derived. Please open a new one if it is blocking.

anty-zhang · 2019-11-29T02:33:02Z

Recently, I started to use faiss and met the same problem. I found many issues and tried almost all the solutions mentioned above, but failed to find a solution.

At last, I found different CUDA versions shown by nvcc and nvdia-smi, so I adjust the nvcc verion to match the nvidia-smi, and luckily it works at last. So, Note that the nvcc version must be consistent with the nvdia-smi version.

my mismatch nvcc and nvdia-smi

If you met the same problem throgh compile faiss, this may help you.

choose the best CUDA Toolkit version is here.
the difference between nvcc and nvidia-smi is here.

my env
my makefile

zhangxinyu-xyz · 2021-10-09T13:31:42Z

Recently, I started to use faiss and met the same problem. I found many issues and tried almost all the solutions mentioned above, but failed to find a solution.

At last, I found different CUDA versions shown by nvcc and nvdia-smi, so I adjust the nvcc verion to match the nvidia-smi, and luckily it works at last. So, Note that the nvcc version must be consistent with the nvdia-smi version.

my mismatch nvcc and nvdia-smi

If you met the same problem throgh compile faiss, this may help you.

choose the best CUDA Toolkit version is here. the difference between nvcc and nvidia-smi is here.

my env my makefile

You are lucky. Unfortunately, it does not work when I tried to use the faiss-gpu on cuda 11.1.

tigert1998 · 2022-12-27T15:03:25Z

conda install -c conda-forge faiss-gpu
This fix it for me.

sayfulloh11 · 2023-09-15T02:38:39Z

conda install -c conda-forge faiss-gpu

Hi,
I tried the same command
thank you it resolved my problem.

…ex-macro Use impl_concurrent_index macro

mdouze added the bug label Mar 12, 2017

mdouze added the cant-repro label Mar 29, 2017

mdouze closed this as completed Mar 29, 2017

eddienewton mentioned this issue Feb 19, 2020

faiss.index_cpu_to_gpu slow on Conda build #1110

Closed

4 tasks

mqnfred pushed a commit to mqnfred/faiss that referenced this issue Oct 23, 2023

Merge pull request facebookresearch#34 from ava57r/use-concurrent-ind…

5202bb2

…ex-macro Use impl_concurrent_index macro

faiss::gpu::runMatrixMult failure #34

faiss::gpu::runMatrixMult failure #34

Comments

hellolovetiger commented Mar 10, 2017

wickedfoo commented Mar 10, 2017

wickedfoo commented Mar 10, 2017

hellolovetiger commented Mar 10, 2017 • edited Loading

wickedfoo commented Mar 10, 2017

wickedfoo commented Mar 10, 2017

hellolovetiger commented Mar 10, 2017

hellolovetiger commented Mar 10, 2017 • edited Loading

hellolovetiger commented Mar 10, 2017 • edited Loading

mdouze commented Mar 10, 2017

hellolovetiger commented Mar 10, 2017 • edited Loading

mdouze commented Mar 10, 2017

hellolovetiger commented Mar 10, 2017

mdouze commented Mar 10, 2017

wickedfoo commented Mar 10, 2017

wickedfoo commented Mar 10, 2017

hellolovetiger commented Mar 10, 2017

hellolovetiger commented Mar 10, 2017

wickedfoo commented Mar 10, 2017

hellolovetiger commented Mar 11, 2017

mdouze commented Mar 12, 2017

yhpku commented Mar 13, 2017

hellolovetiger commented Mar 13, 2017

mdouze commented Mar 13, 2017

yhpku commented Mar 13, 2017

mdouze commented Mar 13, 2017

mdouze commented Mar 13, 2017

hellolovetiger commented Mar 14, 2017

wickedfoo commented Mar 14, 2017

wickedfoo commented Mar 14, 2017

wickedfoo commented Mar 14, 2017

hellolovetiger commented Mar 14, 2017

yhpku commented Mar 14, 2017

mdouze commented Mar 29, 2017

anty-zhang commented Nov 29, 2019 • edited Loading

zhangxinyu-xyz commented Oct 9, 2021

tigert1998 commented Dec 27, 2022

sayfulloh11 commented Sep 15, 2023

hellolovetiger commented Mar 10, 2017 •

edited

Loading

hellolovetiger commented Mar 10, 2017 •

edited

Loading

hellolovetiger commented Mar 10, 2017 •

edited

Loading

hellolovetiger commented Mar 10, 2017 •

edited

Loading

anty-zhang commented Nov 29, 2019 •

edited

Loading