CUDA 11 (libfaiss) conda package triggers JIT compilation on Turing GPUs #36

dantegd opened this issue Mar 11, 2021 · 5 comments · Fixed by #37
dantegd commented Mar 11, 2021

Issue: Installing the current CUDA 11 conda package (libfaiss in particular) on computers with Turing GPUs (tested on RTX 8000 and 2070S) triggers a JIT compilation in the first call that uses GPU resources, causing a delay of minutes. It works fine on Ampere GPUs (tested on 3080), also works fine on CUDA 10.2 with Turing. The packages are:

faiss                     1.7.0           py38cuda110h60a57df_4_cuda    conda-forge
faiss-proc                1.0.0                      cuda    conda-forge
libfaiss                  1.7.0           cuda110h8045045_4_cuda    conda-forge
libfaiss-avx2             1.7.0           cuda110h1234567_4_cuda    conda-forge

Reproduced with the following code:

Python 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:22:27) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> d = 64                           # dimension
>>> nb = 100000                      # database size
>>> nq = 10000                       # nb of queries
>>> np.random.seed(1234)             # make reproducible
>>> xb = np.random.random((nb, d)).astype('float32')
>>> xb[:, 0] += np.arange(nb) / 1000.
>>> xq = np.random.random((nq, d)).astype('float32')
>>> xq[:, 0] += np.arange(nq) / 1000.
>>> import faiss
>>> res = faiss.StandardGpuResources() 
>>> index_flat = faiss.IndexFlatL2(d)
>>> gpu_index_flat = faiss.index_cpu_to_gpu(res, 0, index_flat)
# (here I am stuck waiting for minutes...) 

This was an issue that we saw first in cuML (that uses FAISS): rapidsai/cuml#3602

Details about conda and system ( conda info ):
$ conda info
active environment : ns0311-110
    active env location : /home/galahad/miniconda3/envs/ns0311-110
            shell level : 2
       user config file : /home/galahad/.condarc
 populated config files :
          conda version : 4.9.2
    conda-build version : not installed
         python version :
       virtual packages : __cuda=11.2=0
       base environment : /home/galahad/miniconda3  (writable)
           channel URLs :
          package cache : /home/galahad/miniconda3/pkgs
       envs directories : /home/galahad/miniconda3/envs
               platform : linux-64
             user-agent : conda/4.9.2 requests/2.24.0 CPython/3.8.5 Linux/5.8.0-44-generic ubuntu/20.04.2 glibc/2.31
                UID:GID : 1000:1000
             netrc file : None
           offline mode : False

cc @viclafargue @hcho3 @jakirkham

I'm guessing you have already looked at nvidia-smi or used other profiling tools to see what is going on. Would be interesting to include that info if you have it. Maybe that sheds light on where things are getting stuck

Hey, sorry to hear this isn't working well. I'll happily admit that the GPU build options around real vs. virtual, JIT, PTX, etc. are a bit over my head - I've often received help on this from the nVidia folks (e.g. @kkraus14 & @teju85 helping out in #1). With the changes of the upstream build system to CMake, I've done the best I can based on the CMake documentation, but it's possible that I'm doing this wrong - pertinent parts are in

My goal was building for maximum compatibility, and at the time, PTX JIT compilation was recommended to me. If this should be removed and or amended somehow, I'll happily accept PRs (or guidance what to do).

teju85 commented Mar 12, 2021

Looking at this line of code in faiss feedstock, for cuda11.0, it does compile for sm_75 architectures.

@dantegd can you please try to disassemble the libfaiss binary and check if it does have sm_75 kernels compiled?

Contributor Author

dantegd commented Mar 12, 2021

@h-vetinari thanks for the response! Indeed @teju85's advice is the best recommendation and what faiss did explicitly in version 1.6.3 with the prior build system, so faiss 1.6.3 works smoothly for all its intended archs. I think the issue was a minor mixup in the usage of the fairly recent CMAKE_CUDA_ARCHITECTURES feature of CMake to accomplish the same thing (which are not super obvious the first time when using if one is used to the direct gencode nvcc flags, I still get them wrong the first time every time). As of right now, the libfaiss conda package for 11.2 for example is generated with:


which causes it to include device code for compute 86 (i.e. 3070/80/90) and PTX for anything under it, so that is what causes it to trigger a JIT compilation when say Turing (75) or Pascal (60s) call it, and we can also inspect (as @teju85 recommended):

(ns0311-110) ➜  lib cuobjdump -lelf
ELF file    1: libfaiss.1.sm_80.cubin
ELF file    2: libfaiss.2.sm_80.cubin
ELF file    3: libfaiss.3.sm_80.cubin
ELF file    4: libfaiss.4.sm_80.cubin

Now most RAPIDS libraries are in the process of migrating to using CMAKE_CUDA_ARCHITECTURES (as opposed to manually injecting them in our older version based CMake scripts), but cuDF already did, and what we did there was to use:


This causes what is (if I'm not mistaken) our intended result, having device code for supported archs (so that supported GPUs can just just cuDF without needing a long JIT compilation step), and then including the PTX for 80 so say if a future GPU with 90+ (or say 50 assuming compatibility) would be able to JIT compile and still use cuDF. And inspecting we can see

(ns0311-110) ➜  lib cuobjdump -lelf
ELF file    1: libcudf.1.sm_60.cubin
ELF file    2: libcudf.2.sm_70.cubin
ELF file    3: libcudf.3.sm_75.cubin
ELF file    4: libcudf.4.sm_80.cubin
ELF file    5: libcudf.5.sm_60.cubin
ELF file    6: libcudf.6.sm_70.cubin
ELF file    7: libcudf.7.sm_75.cubin
ELF file    8: libcudf.8.sm_80.cubin

Which is the similar to how faiss 1.6.3 was:

lib cuobjdump -lelf
ELF file    1: GpuIndex.sm_35.cubin
ELF file    2: GpuIndex.sm_50.cubin
ELF file    3: GpuIndex.sm_52.cubin
ELF file    4: GpuIndex.sm_60.cubin
ELF file    5: GpuIndex.sm_61.cubin
ELF file    6: GpuIndex.sm_70.cubin
ELF file    7: GpuIndex.sm_75.cubin
ELF file    8: GpuIndex.sm_80.cubin
ELF file    9: GpuIndexBinaryFlat.sm_35.cubin
ELF file   10: GpuIndexBinaryFlat.sm_50.cubin
ELF file   11: GpuIndexBinaryFlat.sm_52.cubin
ELF file   12: GpuIndexBinaryFlat.sm_60.cubin
ELF file   13: GpuIndexBinaryFlat.sm_61.cubin
ELF file   14: GpuIndexBinaryFlat.sm_70.cubin
ELF file   15: GpuIndexBinaryFlat.sm_75.cubin
ELF file   16: GpuIndexBinaryFlat.sm_80.cubin

Difference in name of the .cubin comes from how CMake forms the target as opposed to faiss's older system if I'm not mistaken, but the important part is noticing we have the binaries for all supported archs so JIT compilation is not an issue for 1.6.3.

So that was a very verbose way of describing the solution proposed in #37

Thanks for the analysis!

