Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pqCodeDistances read out of bounds of pqCentroids in loadingThreads subset #1421

Closed
XuanYang-cn opened this issue Sep 23, 2020 · 2 comments
Closed
Assignees

Comments

@XuanYang-cn
Copy link

I'm recently very interested in faiss, and run some tests on faiss gpu IVFPQ for 128-dimensional vectors, with M=32, and 256 centroids. Here are some questions of mine.

template <typename OutCodeT,
          typename CentroidT,
          int DimsPerSubQuantizer,
          bool L2Distance>
__global__ void
__launch_bounds__(288, 3)
pqCodeDistances(Tensor<float, 2, true> queries,
                int queriesPerBlock,
                Tensor<CentroidT, 2, true> coarseCentroids,
...
  bool isLoadingThread = threadIdx.x >= codesPerSubQuantizer;
  int loadingThreadId = threadIdx.x - codesPerSubQuantizer;
...
  float subQuantizerData[DimsPerSubQuantizer];
  auto code = threadIdx.x;
...
#pragma unroll
  for (int i = 0; i < DimsPerSubQuantizer; ++i) {
    subQuantizerData[i] = pqCentroids[subQuantizer][i][code].ldg();
  }

auto code = threadIdx.x;

subQuantizerData[i] = pqCentroids[subQuantizer][i][code].ldg();

In the test, there are 288 threads and 256 centroids, which means code=288 at line 51. What confuses me most is why line 58 doesn't go wrong when pqCentroids(32x4x256) trying to obtain data out of the range (when code >= 256), and wouldn't it go wrong at some point?

The 2nd question is, why do Loading Threads need subQuantizerData since they don't process subQuantizerData at all.

The 3rd question is, would it be better had line 56-59 been moved to line 155? why or why not?

@wickedfoo
Copy link
Contributor

wickedfoo commented Oct 3, 2020

  1. re: line 51, yes, this seems to be a bug, there should be a bounds check here. This would typically not crash because the allocation is within the temporary memory allocation buffer managed by StandardGpuResources, the memory of which is temporarily allocated here in PQScanMultiPassNoPrecomputed-inl.cuh line 593, so any read out of bounds would still hit allocated memory but it will be garbage. In the case where we allocate from overflow memory via cudaMalloc this could crash. I will fix this, thank you very much for finding this issue!

  2. The loading threads are a subset of the block threads to allow for double buffering of data. All threads in the block have the same register set, there is no way to segment this within a block. subQuantizerData is stored in registers.

  3. It would not be better because it would be reloaded upon each iteration of the loop at line 82 and 125. It only needs to be loaded once.

@wickedfoo wickedfoo changed the title Question about faiss::gpu::pqCodeDistances pqCodeDistances read out of bounds of pqCentroids in loadingThreads subset Oct 3, 2020
facebook-github-bot pushed a commit that referenced this issue Oct 6, 2020
Summary:
This diff removes a long-standing limitation with GpuIndexIVFPQ, in that only a limited number of dimensions per sub-quantizer were supported when not using precomputed codes. This is part of the general cleanup and extension/optimization that I am performing of the GPU PQ code.

Now, we keep the same old specialized distance computations, but if we attempt to use a number of dimensions per sub-Q that are not specialized, we fall back to a general implementation based on batch matrix multiplication for computing PQ distances per code.

The batch MM PQ distance computation is enabled automatically if you use an odd number of dimensions per sub-quantizer (say, 7, 11, 53, ...). It can also be manually enabled via the `useMMCodeDistance` option in `GpuIndexIVFPQConfig` for testing purposes, though the result should be within some epsilon of the other implementation.

This diff also removes the iterated GEMM wrapper. I don't honestly know why I was using this instead of `cublasGemmStridedBatchedEx`, maybe I couldn't find that or this was originally implemented in a much older version of CUDA. The iterated GEMM call was used in a few other places (e.g., precomputed code computation). Now, this (and the PQ distance computation) use batch MM which is a single CUDA call.

This diff also adds stream synchronization to the temporary memory manager, as the fallback PQ distance computation needs to use temporary memory, and there were too many buffers for these to pre-allocate.

It also fixes the bug in #1421.

Reviewed By: mdouze

Differential Revision: D24130629

fbshipit-source-id: 1c8bc53c86d0523832ad89c8bd4fa4b5fc187cae
@wickedfoo
Copy link
Contributor

Fixed in 9b007c7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants