[QST] 23.06 IVFPQ VS Faiss 1.7.2 #1621

xxxdh · 2023-06-29T13:19:13Z

Hello, I am trying to compare the performance of IVFPQ implemented by RAFT 23.06 and FAISS 1.7.2 on Nvidia T4 and cuda 11.8.

The tested data size is :
Train Size: 100000
Base Size: 1000000
Query Size: 5000
Dimension: 256

And I run RAFT IVFPQ by:
index_params.n_lists = 1024;
index_params.pq_bits = 8;
index_params.pd_dim= 32;
index_params.metric = raft::distance::DistanceType::L2Expanded;
search_params.n_probes = 128;
topK = 100;
auto index = raft::neighbors::ivf_pq::build<float. std::uint32_t>(handle_, index_params, base_view);
The size of indices_out_view and dists_out_view are (Query_num, topK)
raft::neighbors::ivf_pq::search(handle_, search_params, index, query_view, indices_out_view, dists_out_view);

The results are listed below:

Search Batch	QPS(Faiss-Gpu)	QPS(RAFT)
1	2770	2093
4	5581	5809
8	7032	7672
16	10632	9075
32	12654	7013
64	14221	10307
128	15496	10073
1024	15871	12252

As shown in the table, I don't see any significant QPS improvement using RAFT(even slower). So I am wondering if the result I got is expected? Or is there anything wrong with my usage?

The text was updated successfully, but these errors were encountered:

achirkin · 2023-06-29T14:36:01Z

Hi, thanks for the report! Could you please tell which parameters did you use calling FAISS?

We've been reworking some memory allocation internals lately, this could affect the performance, but should be fixed in the coming release. To see if this is the case, you can set the rmm memory resource to a pool, perhaps with somewhat large initial size:

#include <rmm/mr/device/per_device_resource.hpp>
#include <rmm/mr/device/pool_memory_resource.hpp>
...
auto pool_mr = rmm::mr::pool_memory_resource(
    rmm::mr::get_current_device_resource(), initial_pool_size
);
rmm::mr::set_current_device_resource(&pool_mr);

xxxdh · 2023-07-04T12:28:08Z

Thanks for your reply!
I call faiss IVFPQ use codes below:
nlist = 1024
m = 8
k = 100
d = 256
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, 8)
index.train(xb)
index.add(xb)
index.nprobe = 128
D, I = index.search(query, k)

I added the codes you provided and it becomes a bit faster than before:

batch	QPS(Faiss-Gpu)	QPS(RAFT End to End)
1	2770	4952
4	5581	8296
8	7032	9868
32	12654	11221
128	15496	12129
1024	15871	12485

However I found a new issue: the SEARCH function of faiss includes: copy query from host to device, search on GPU, copy result from device to host. But SEARCH function of RAFT does not include the memory copy part. So for the sake of fairness, when calculate search QPS, I also include memory copy time while using RAFT. I searched the queries batch by batch and also do the memory copy part batch by batch and the results are listed in the table above.

I also calculate the time cost of memcpy from host to device, search, time of memcpy from device to host by adding sync_stream function after each part (if I am correct), and results are listed below:

batch	Host to Device(ms)	Search(ms)	Device to Host(ms)
1	36.2	924	107
4	11.6	596	30.4
8	6.7	514	15.6
32	2.9	464	5.3
128	1.8	433	2.4
1024	1.4	422	1.5

As shown in the first table, when search batch is small, RAFT performs better than FAISS. But when search batch becomes larger, FAISS gets better results. Is there still anything wrong with my test?
Also, according to the second table, saerch time does not decrease with the increase of the search batch when search batch is relatively large (>128). Is this the expecting result?

For better understanding, I also listed some codes below:

rmm::device_uvector<float> distance_dev(QuerySize * topK, stream_);
rmm::device_uvector<std::uint32_t> indices_dev(QuerySize * topK, stream_);
##CAL T0
for (int batch : batches) {
for (size_t i = 0; i < QuerySize; i += batch) {
# Send a batch of query from host to device
raft::update_device(query_dev.data() + i * d, query.data() + i * d, batch * d, stream_);
# Search batch by batch here. Set size of query_view, dists_out_view, indices_out_view to min(batch, QuerySize - i)
raft::neighbors::ivf_pq::search(handle_, search_params, index, query_view, indices_out_view, dists_out_view);
# Send a batch of dist and index results from device to host
raft::update_host(distances_IVFPQ.data() + i * topK, distance_dev.data() + i * topK, batch* topK, stream_);
raft::update_host(indices_IVFPQ.data() + i * topK, indices_dev.data() + i * topK, batch* topK, stream_);
}
}
##CAL T1
QPS is calculated using T1 - T0.

achirkin · 2023-07-04T12:41:11Z

Thanks for the detailed update! I'll be digging this more, but for now I can share a quick bit of info observed independently in other benchmarks: it looks like raft's IVF-PQ has performance problems specifically for k in the range of 65..128 (GPU kernel register pressure issues). This is either a regression or an oversight from the times we did benchmarks before.

xxxdh · 2023-07-24T12:17:41Z

Hello, I read some test results in you posted PPT named [S51945] Improving Dense Text Retrieval Accuracy with Approximate Nearest Neighbor Search(https://www.nvidia.com/en-us/on-demand/session/gtcspring23-s51945/). According to your result, you can reach up to 8.0x faster than FAISS with small batch size and 2.4x faster than FAISS with large batch size. I tested with the settings below:

	My Setting	Your Setting
GPU	L4	A100
Data	Deep10M	Deep500M
Dim	96	96
Clusters	1024	200K
PQ_Dim	48	48
PQ_Bit	8	8
topK	100	100
n_probe	128	8~512

And I got the results:

batch	RAFT QPS	FAISS QPS	RAFT/FAISS
1	1827.17	1736.3	1.33
8	5831.05	2402.49	2.43
128	10268.46	3002.34	3.42
1024	12007.7	3043.98	3.94

As we can see, I can get more than 3x faster than FAISS with large batches, which is consistent with your result. But I can only get 1~2x faster with small batches, which is not consistent to the result in the PPT. I did all tests using the codes I provided in my last comment. Is there anything wrong with my tests?

By the way, do you support FP16 data in BRUTE-FORCE and IVFPQ?

Thanks!

Fix occasional slowdown of the compute similarity kernels: 1. Allow using the fused version for small work size cases 2. Switch to the warpsort implementation that uses less registers at the expense of shared memory. Addresses (at least partially) #1621 Authors: - Artem M. Chirkin (https://github.com/achirkin) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #1726

xxxdh added the question Further information is requested label Jun 29, 2023

achirkin mentioned this issue Jul 24, 2023

IVF-PQ tutorial notebook #1544

Merged

achirkin mentioned this issue Aug 8, 2023

Fix IVF-PQ fused kernel performance problems #1726

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] 23.06 IVFPQ VS Faiss 1.7.2 #1621

[QST] 23.06 IVFPQ VS Faiss 1.7.2 #1621

xxxdh commented Jun 29, 2023

achirkin commented Jun 29, 2023

xxxdh commented Jul 4, 2023

achirkin commented Jul 4, 2023

xxxdh commented Jul 24, 2023

[QST] 23.06 IVFPQ VS Faiss 1.7.2 #1621

[QST] 23.06 IVFPQ VS Faiss 1.7.2 #1621

Comments

xxxdh commented Jun 29, 2023

achirkin commented Jun 29, 2023

xxxdh commented Jul 4, 2023

achirkin commented Jul 4, 2023

xxxdh commented Jul 24, 2023