pqCodeDistances read out of bounds of pqCentroids in loadingThreads subset #1421

XuanYang-cn · 2020-09-23T11:34:30Z

I'm recently very interested in faiss, and run some tests on faiss gpu IVFPQ for 128-dimensional vectors, with M=32, and 256 centroids. Here are some questions of mine.

template <typename OutCodeT,
          typename CentroidT,
          int DimsPerSubQuantizer,
          bool L2Distance>
__global__ void
__launch_bounds__(288, 3)
pqCodeDistances(Tensor<float, 2, true> queries,
                int queriesPerBlock,
                Tensor<CentroidT, 2, true> coarseCentroids,
...
  bool isLoadingThread = threadIdx.x >= codesPerSubQuantizer;
  int loadingThreadId = threadIdx.x - codesPerSubQuantizer;
...
  float subQuantizerData[DimsPerSubQuantizer];
  auto code = threadIdx.x;
...
#pragma unroll
  for (int i = 0; i < DimsPerSubQuantizer; ++i) {
    subQuantizerData[i] = pqCentroids[subQuantizer][i][code].ldg();
  }

faiss/faiss/gpu/impl/PQCodeDistances-inl.cuh

Line 51 in d8af513

auto code = threadIdx.x;

faiss/faiss/gpu/impl/PQCodeDistances-inl.cuh

Line 58 in d8af513

subQuantizerData[i] = pqCentroids[subQuantizer][i][code].ldg();

In the test, there are 288 threads and 256 centroids, which means code=288 at line 51. What confuses me most is why line 58 doesn't go wrong when pqCentroids(32x4x256) trying to obtain data out of the range (when code >= 256), and wouldn't it go wrong at some point?

The 2nd question is, why do Loading Threads need subQuantizerData since they don't process subQuantizerData at all.

The 3rd question is, would it be better had line 56-59 been moved to line 155? why or why not?

The text was updated successfully, but these errors were encountered:

wickedfoo · 2020-10-03T05:30:23Z

re: line 51, yes, this seems to be a bug, there should be a bounds check here. This would typically not crash because the allocation is within the temporary memory allocation buffer managed by StandardGpuResources, the memory of which is temporarily allocated here in PQScanMultiPassNoPrecomputed-inl.cuh line 593, so any read out of bounds would still hit allocated memory but it will be garbage. In the case where we allocate from overflow memory via cudaMalloc this could crash. I will fix this, thank you very much for finding this issue!
The loading threads are a subset of the block threads to allow for double buffering of data. All threads in the block have the same register set, there is no way to segment this within a block. subQuantizerData is stored in registers.
It would not be better because it would be reloaded upon each iteration of the loop at line 82 and 125. It only needs to be loaded once.

Summary: This diff removes a long-standing limitation with GpuIndexIVFPQ, in that only a limited number of dimensions per sub-quantizer were supported when not using precomputed codes. This is part of the general cleanup and extension/optimization that I am performing of the GPU PQ code. Now, we keep the same old specialized distance computations, but if we attempt to use a number of dimensions per sub-Q that are not specialized, we fall back to a general implementation based on batch matrix multiplication for computing PQ distances per code. The batch MM PQ distance computation is enabled automatically if you use an odd number of dimensions per sub-quantizer (say, 7, 11, 53, ...). It can also be manually enabled via the `useMMCodeDistance` option in `GpuIndexIVFPQConfig` for testing purposes, though the result should be within some epsilon of the other implementation. This diff also removes the iterated GEMM wrapper. I don't honestly know why I was using this instead of `cublasGemmStridedBatchedEx`, maybe I couldn't find that or this was originally implemented in a much older version of CUDA. The iterated GEMM call was used in a few other places (e.g., precomputed code computation). Now, this (and the PQ distance computation) use batch MM which is a single CUDA call. This diff also adds stream synchronization to the temporary memory manager, as the fallback PQ distance computation needs to use temporary memory, and there were too many buffers for these to pre-allocate. It also fixes the bug in #1421. Reviewed By: mdouze Differential Revision: D24130629 fbshipit-source-id: 1c8bc53c86d0523832ad89c8bd4fa4b5fc187cae

wickedfoo · 2020-10-06T18:11:35Z

Fixed in 9b007c7

mdouze added GPU question labels Sep 23, 2020

mdouze assigned wickedfoo Oct 2, 2020

wickedfoo added bug and removed question labels Oct 3, 2020

wickedfoo changed the title ~~Question about faiss::gpu::pqCodeDistances~~ pqCodeDistances read out of bounds of pqCentroids in loadingThreads subset Oct 3, 2020

wickedfoo closed this as completed Oct 6, 2020

viclafargue mentioned this issue Feb 5, 2021

[BUG] Memory access error in IVFPQ unit test rapidsai/cuml#3318

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pqCodeDistances read out of bounds of pqCentroids in loadingThreads subset #1421

pqCodeDistances read out of bounds of pqCentroids in loadingThreads subset #1421

XuanYang-cn commented Sep 23, 2020

wickedfoo commented Oct 3, 2020 •

edited

Loading

wickedfoo commented Oct 6, 2020

pqCodeDistances read out of bounds of pqCentroids in loadingThreads subset #1421

pqCodeDistances read out of bounds of pqCentroids in loadingThreads subset #1421

Comments

XuanYang-cn commented Sep 23, 2020

wickedfoo commented Oct 3, 2020 • edited Loading

wickedfoo commented Oct 6, 2020

wickedfoo commented Oct 3, 2020 •

edited

Loading