Sharded index behave differ on different GPU when integrated with fairseq code #2297

lishaojun412 · 2022-04-14T12:06:16Z

Summary

inclusive_scan tokes long time and different time

Platform

OS: Ubuntu18.04

Faiss version: 1.7.2

Installed from: compiled

Faiss compilation options:

cmake -B build
-DFAISS_ENABLE_GPU=ON
-DFAISS_ENABLE_C_API=ON
-DFAISS_ENABLE_PYTHON=ON
-DBUILD_TESTING=ON
-DCMAKE_CUDA_FLAGS="-gencode arch=compute_75,code=sm_75"
-DPython_EXECUTABLE=/usr/bin/python3.6
.

Running on:

CPU
GPU : T4*2

Interface:

C++
Python

Reproduction instructions

I test the time exhaust with source code and find 2 questions:

The inclusive_scan method exhust the most time among all the step , about 90% times, is it normal?
I have a index of 20G, store on T4 GPU with 2 shards, then I search the index integrate with fairseq code. I find the inclusive_scan method behave different in the 2 shards GPU, one tokens about 3ms which run the fairseq inference code, another tokens 7 ms. The differ is huge and strange. Without fairseq integrating, the 2 shard gpu search time behave the same , about 3-4ms.

the inclusive_scan code in IVFUtils.cu

    double t0 = getmillisecs();

    thrust::inclusive_scan(
            thrust::cuda::par(alloc).on(stream),
            prefixSumOffsets.data(),
            prefixSumOffsets.data() + totalSize,
            prefixSumOffsets.data());
    CUDA_TEST_ERROR();

    double t1 = getmillisecs();
    printf("#%.3f %d ", t1 - t0, getCurrentDevice());

the fairseq search code

index = faiss.read_index(args.indexfile, faiss.IO_FLAG_ONDISK_SAME_DIR)
co = faiss.GpuMultipleClonerOptions()
co.useFloat16LookupTables = True
co.useFloat16 = True
co.useFloat16CoarseQuantizer = True
co.usePrecomputed = False
co.indicesOptions = 2  # 0 cpu  	2 fp16 		3 fp32
co.verbose = True
co.shard = True  # the replicas will be made "manually"
co.shard_type = 2
vres, vdev = make_vres_vdev(0, 2)
index.threaded = True
index = faiss.index_cpu_to_gpus_list(index, co=co, gpus=[0, 1])
faiss.downcast_index(index.at(0)).nprobe = 4
faiss.downcast_index(index.at(1)).nprobe = 4
dists, knns = index.search(queries.detach().cpu().float().numpy(), k)

The text was updated successfully, but these errors were encountered:

lishaojun412 · 2022-04-15T00:59:31Z

I run the bench_gpu_sift1m.py script with time print on one T4 GPU, find the thrust::inclusive_scan time thrust::inclusive_scan alternative change like this(nprobe=8): #ms device_id #0.330 0 #1.074 0 #0.330 0 #1.074 0 #0.330 0 #1.074 0 ......

mdouze added the GPU label Apr 19, 2022

asadoughi added Integration Performance unconfirmed-bug labels Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sharded index behave differ on different GPU when integrated with fairseq code #2297

Sharded index behave differ on different GPU when integrated with fairseq code #2297

lishaojun412 commented Apr 14, 2022 •

edited

Loading

lishaojun412 commented Apr 15, 2022 •

edited

Loading

Sharded index behave differ on different GPU when integrated with fairseq code #2297

Sharded index behave differ on different GPU when integrated with fairseq code #2297

Comments

lishaojun412 commented Apr 14, 2022 • edited Loading

Summary

Platform

Reproduction instructions

lishaojun412 commented Apr 15, 2022 • edited Loading

lishaojun412 commented Apr 14, 2022 •

edited

Loading

lishaojun412 commented Apr 15, 2022 •

edited

Loading