Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sharded index behave differ on different GPU when integrated with fairseq code #2297

Open
2 of 4 tasks
lishaojun412 opened this issue Apr 14, 2022 · 1 comment
Open
2 of 4 tasks

Comments

@lishaojun412
Copy link

lishaojun412 commented Apr 14, 2022

Summary

inclusive_scan tokes long time and different time

Platform

OS: Ubuntu18.04

Faiss version: 1.7.2

Installed from: compiled

Faiss compilation options:

cmake -B build
-DFAISS_ENABLE_GPU=ON
-DFAISS_ENABLE_C_API=ON
-DFAISS_ENABLE_PYTHON=ON
-DBUILD_TESTING=ON
-DCMAKE_CUDA_FLAGS="-gencode arch=compute_75,code=sm_75"
-DPython_EXECUTABLE=/usr/bin/python3.6
.

Running on:

  • CPU
  • GPU : T4*2

Interface:

  • C++
  • Python

Reproduction instructions

I test the time exhaust with source code and find 2 questions:

  1. The inclusive_scan method exhust the most time among all the step , about 90% times, is it normal?
  2. I have a index of 20G, store on T4 GPU with 2 shards, then I search the index integrate with fairseq code. I find the inclusive_scan method behave different in the 2 shards GPU, one tokens about 3ms which run the fairseq inference code, another tokens 7 ms. The differ is huge and strange. Without fairseq integrating, the 2 shard gpu search time behave the same , about 3-4ms.

the inclusive_scan code in IVFUtils.cu

    double t0 = getmillisecs();

    thrust::inclusive_scan(
            thrust::cuda::par(alloc).on(stream),
            prefixSumOffsets.data(),
            prefixSumOffsets.data() + totalSize,
            prefixSumOffsets.data());
    CUDA_TEST_ERROR();

    double t1 = getmillisecs();
    printf("#%.3f %d ", t1 - t0, getCurrentDevice());

the fairseq search code

index = faiss.read_index(args.indexfile, faiss.IO_FLAG_ONDISK_SAME_DIR)
co = faiss.GpuMultipleClonerOptions()
co.useFloat16LookupTables = True
co.useFloat16 = True
co.useFloat16CoarseQuantizer = True
co.usePrecomputed = False
co.indicesOptions = 2  # 0 cpu  	2 fp16 		3 fp32
co.verbose = True
co.shard = True  # the replicas will be made "manually"
co.shard_type = 2
vres, vdev = make_vres_vdev(0, 2)
index.threaded = True
index = faiss.index_cpu_to_gpus_list(index, co=co, gpus=[0, 1])
faiss.downcast_index(index.at(0)).nprobe = 4
faiss.downcast_index(index.at(1)).nprobe = 4
dists, knns = index.search(queries.detach().cpu().float().numpy(), k)
@lishaojun412
Copy link
Author

lishaojun412 commented Apr 15, 2022

I run the bench_gpu_sift1m.py script with time print on one T4 GPU, find the thrust::inclusive_scan time thrust::inclusive_scan alternative change like this(nprobe=8): #ms device_id #0.330 0 #1.074 0 #0.330 0 #1.074 0 #0.330 0 #1.074 0 ......

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants