Add random subsampling for IVF methods #2077

tfeher · 2024-01-03T10:33:55Z

While building IVF-Flat or IVF-PQ indices we usually subsample the dataset to create a smaller training set for k-means clustering. Until now this subsampling was done with a fixed stride, this PR changes it to random subsampling.

The input is always randomized, even if all the vectors of the dataset are used.

Random sampling adds an overhead. The overhead is proportional to the training set size. If dataset is on host, then this overhead can be partially or completely masked by H2D transfer. The overhead is small compared to k-means training.

To completely overlap random sampling of the data with H2D copies, we utilize OpenMP parallelization to increase the effective bandwidth for gathering the data.

tfeher · 2024-01-03T10:44:37Z

The build time slightly increases when random subsampling is enabled. Here are measurements with IVF-Flat on H100 with subsets of the DEEP dataset. Here random_seed=-1 refers to the original subsampling with fixed stride.

build time (s)	index_size	nlist	random_seed	ratio
0.37	1.00E+06	5000	-1	2
0.43	1.00E+06	5000	137	2
46.7	1.00E+08	50000	-1	10
47.22	1.00E+08	50000	137	10

tfeher · 2024-01-03T10:53:05Z

There is also slight variation in recall. DEEP-100M, IVF-Flat:

n_probes	Recall orig	recall random	diff
20	85.68%	85.57%	-0.11%
30	90.02%	90.00%	-0.02%
40	92.44%	92.48%	0.04%
50	94.09%	94.03%	-0.06%
100	97.37%	97.36%	-0.02%
200	98.93%	98.98%	0.05%
500	99.70%	99.74%	0.03%
1000	99.88%	99.88%	0.00%

IVF-PQ for the same dataset

nprobe	Recall orig	recall random	recall diff
20	85.57%	85.46%	-0.11%
30	90.02%	89.91%	-0.11%
40	92.48%	92.32%	-0.16%
50	94.05%	93.91%	-0.14%
100	97.28%	97.24%	-0.04%
200	98.87%	98.80%	-0.07%
1000	99.74%	99.74%	0.00%
2000	99.78%	99.78%	0.00%
5000	99.79%	99.78%	-0.01%

tfeher · 2024-01-03T10:55:08Z

Tagging @abc99lr to rebase #2052 on this and run tests.

lowener

Nice change! Is there a reason why subsample is added to raft::spatial::knn::detail::utils namespace? It is used in raft::neighbors so it make sense to add it under raft::neighbors::detail::utils.

cpp/include/raft/spatial/knn/detail/ann_utils.cuh

cpp/include/raft/neighbors/ivf_flat_types.hpp

tfeher · 2024-01-17T00:04:25Z

To establish a baseline on what are the expected variations, I have run test with the original RAFT code (fixed stride subsampling), but with input data where the vectors are shuffled.

The tables show recall values in percentage (%). Index building and search was run with 10 different permutation of the input, and statistics on recall variation is presented in the tables below. Dataset is DEEP-10M, subsample ratio is 10.

When the number of clusters is larger (while keeping number of probes fixed) then we have a large variation in the results.

IVF-Flat 1k clusters

nprobe	mean	std	min	max	min-max diff
20	97.3981	0.055991	97.322	97.5	0.178
30	98.6231	0.029797	98.576	98.667	0.091
40	99.1678	0.022685	99.135	99.216	0.081
50	99.4493	0.024459	99.427	99.508	0.081
100	99.8575	0.008114	99.844	99.871	0.027
200	99.9432	0.004614	99.934	99.95	0.016
500	99.9579	0.002132	99.954	99.96	0.006
1000	99.9581	0.002132	99.954	99.961	0.007

IVF-Flat 10k clusters

nprobe	mean	std	min	max	min-max diff
20	88.9535	0.144761	88.737	89.192	0.455
30	92.7403	0.098084	92.581	92.892	0.311
40	94.7593	0.083885	94.633	94.871	0.238
50	96.0016	0.072469	95.893	96.109	0.216
100	98.4533	0.040351	98.396	98.523	0.127
200	99.4852	0.024225	99.453	99.519	0.066
500	99.8841	0.008252	99.871	99.897	0.026
1000	99.9466	0.003978	99.939	99.952	0.013

IVF-PQ 1k clusters, dim_pq=64, pq_bits=5

nprobe	mean	std	min	max	min-max diff
20	73.0448	0.090379	72.907	73.163	0.256
30	73.5117	0.084429	73.4	73.636	0.236
40	73.6941	0.083665	73.575	73.817	0.242
50	73.7857	0.088112	73.664	73.892	0.228
100	73.9107	0.084362	73.803	74.021	0.218
200	73.9334	0.084866	73.831	74.042	0.211
1000	73.931	0.083273	73.83	74.038	0.208

IVF-PQ 1k clusters, dim_pq=64, pq_bits=8

nprobe	mean	std	min	max	min-max diff
20	87.8706	0.061715	87.775	87.964	0.189
30	88.5719	0.063796	88.49	88.671	0.181
40	88.8595	0.059461	88.759	88.965	0.206
50	89.0077	0.064338	88.906	89.134	0.228
100	89.2005	0.063746	89.11	89.314	0.204
200	89.2359	0.062199	89.149	89.341	0.192
1000	89.2373	0.065332	89.144	89.342	0.198

IVF-PQ 10k clusters, dim_pq=96, pq_bits=8

nprobe	mean	std	min	max	min-max diff
20	83.026	0.100094	82.856	83.223	0.367
30	85.7558	0.068037	85.663	85.856	0.193
40	87.1303	0.078033	86.974	87.235	0.261
50	87.9287	0.070296	87.765	88.009	0.244
100	89.3685	0.065758	89.221	89.449	0.228
200	89.9053	0.063934	89.775	89.988	0.213
1000	90.1055	0.062324	90.012	90.197	0.185

tfeher · 2024-01-17T16:29:34Z

I have investigated random subsampling versus shuffling the input data and keeping fixed stride subsampling. In both cases the recall value fluctuates with a std < 0.14. I have compared the mean recall using these two methods over 10 iterations. The diff between average recall is between 0.01% and 0.03%. I would conclude that we have no significant change in recall.

Example results: Running 10 different build/search iterations with different random seed. The table shows recall values and its variations (orig) datase file shuffled, (this PR). The last columns is the difference between the mean recall values. All recall values are in percentage.

The table shows results for deep-10M. Results for bigANN-10M ar similar.

index	nprobe	recall mean	std	min-max diff	recall mean	std	min-max diff	mean recall diff
		orig	orig	orig	this PR	this PR	this PR	orig - this PR
ivf-flat 10K	20	88.95	0.14	0.46	88.93	0.10	0.29	-0.03
ivf-flat 10K	30	92.74	0.10	0.31	92.73	0.08	0.24	-0.01
ivf-flat 10K	40	94.76	0.08	0.24	94.77	0.08	0.24	0.01
ivf-flat 10K	50	96.00	0.07	0.22	96.01	0.09	0.27	0.01
ivf-flat 10K	100	98.45	0.04	0.13	98.46	0.05	0.16	0.01
ivf-flat 10K	200	99.49	0.02	0.07	99.49	0.02	0.08	0.00
ivf-flat 10K	500	99.88	0.01	0.03	99.89	0.01	0.03	0.00
ivf-flat 10K	1000	99.95	0.00	0.01	99.95	0.00	0.01	0.00
ivf-PQ d64b5n1K	20	73.04	0.09	0.26	73.06	0.10	0.33	0.01
ivf-PQ d64b5n1K	30	73.51	0.08	0.24	73.52	0.10	0.34	0.01
ivf-PQ d64b5n1K	40	73.69	0.08	0.24	73.70	0.10	0.33	0.01
ivf-PQ d64b5n1K	50	73.79	0.09	0.23	73.79	0.10	0.33	0.01
ivf-PQ d64b5n1K	100	73.91	0.08	0.22	73.91	0.11	0.36	0.00
ivf-PQ d64b5n1K	200	73.93	0.08	0.21	73.93	0.10	0.34	0.00
ivf-PQ d64b5n1K	1000	73.93	0.08	0.21	73.94	0.10	0.34	0.01

tfeher

Thanks @lowener for the review!

why subsample is added to raft::spatial::knn::detail::utils namespace?

Currently the helper function is still in the old namespace. I have started a separate branch to move the ann_utils.cuh file to the neighbors namespace, I will submit a separate PR for 24.04.

cpp/include/raft/spatial/knn/detail/ann_utils.cuh

cpp/include/raft/neighbors/ivf_flat_types.hpp

tfeher · 2024-01-18T13:48:12Z

Currently the overhead of the random subsampling is larger than ideal. I am running tests to quantify it.

tfeher · 2024-01-21T22:29:27Z

Overhead reduced. Switched to batchwise copy of the data: this reduces the time to allocate temporary buffers, and enables to overlap gathering vectors with host to device copies.

The relative overhead is larger for IVF-Flat than IVF-PQ (because IVF-PQ building has more work to do, therefore preparing the training set takes a smaller fraction of build time). The attached example with trainset ratio=1 is the corner case where the trainset copies take the largest relative fraction. While using 1 thread we have a significant overhead, it goes down to 2% overhead when 8 threads are used.

When we consider taking trainset as a smaller fraction of the dataset, then the overhead becomes smaller.

The data is gathered into a contiguous buffer before copying to the device. This can be slightly faster than the strided copy using cudaMemcpy2D that we head earlier. In best case H2D copies are slightly accelerated, and completely overlap with data gathering. This can lead to 1% build time speedup.

achirkin

Thanks for the PR @tfeher! Especially for the comprehensive analysis of the perf impact and for reducing th esize of the ivf_pq_build.cuh file :)
Couple small things regarding the use of new raft primitives below.

cpp/include/raft/matrix/detail/gather.cuh

cpp/include/raft/spatial/knn/detail/ann_utils.cuh

cpp/include/raft/neighbors/detail/ivf_pq_build.cuh

tfeher

Thansk @achirkin for the review, I have addressed the issues.

cpp/include/raft/neighbors/detail/ivf_pq_build.cuh

cpp/include/raft/spatial/knn/detail/ann_utils.cuh

cpp/include/raft/matrix/detail/gather.cuh

cpp/include/raft/spatial/knn/detail/ann_utils.cuh

achirkin

Thanks for the updates, LGTM!

cjnolet · 2024-01-23T13:35:39Z

/merge

This reverts commit 9c35f73.

Random sampling of training set for IVF methods was reverted in rapidsai/raft#2144 due to the large memory usage of the subsample method. Since then, PR rapidsai/raft#2155 has implemented a new random sampling method with improved memory utilization. Using that we can now enable random sampling of IVF methods (rapidsai/raft#2052 and rapidsai/raft#2077). Random subsampling has measurable overhead for IVF-Flat, therefore it is only enabled for IVF-PQ. Authors: - Tamas Bela Feher (https://github.com/tfeher) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #122

Random sampling of training set for IVF methods was reverted in rapidsai/raft#2144 due to the large memory usage of the subsample method. Since then, PR rapidsai/raft#2155 has implemented a new random sampling method with improved memory utilization. Using that we can now enable random sampling of IVF methods (rapidsai/raft#2052 and rapidsai/raft#2077). Random subsampling has measurable overhead for IVF-Flat, therefore it is only enabled for IVF-PQ. Authors: - Tamas Bela Feher (https://github.com/tfeher) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#122

tfeher added feature request New feature or request non-breaking Non-breaking change Vector Search labels Jan 3, 2024

tfeher requested review from a team as code owners January 3, 2024 10:33

github-actions bot added cpp python labels Jan 3, 2024

tfeher self-assigned this Jan 3, 2024

lowener reviewed Jan 5, 2024

View reviewed changes

cpp/include/raft/spatial/knn/detail/ann_utils.cuh Outdated Show resolved Hide resolved

abc99lr mentioned this pull request Jan 9, 2024

Subsampling for IVF-PQ codebook generation #2052

Merged

1 task

achirkin reviewed Jan 15, 2024

View reviewed changes

cpp/include/raft/neighbors/ivf_flat_types.hpp Outdated Show resolved Hide resolved

github-actions bot removed the python label Jan 18, 2024

tfeher commented Jan 18, 2024

View reviewed changes

cpp/include/raft/spatial/knn/detail/ann_utils.cuh Outdated Show resolved Hide resolved

cpp/include/raft/neighbors/ivf_flat_types.hpp Outdated Show resolved Hide resolved

tfeher force-pushed the ivf_subsample branch from 629e6fa to f322d0a Compare January 18, 2024 12:45

tfeher added 7 commits January 21, 2024 12:33

Add random subsampling for IVF methods

a455b08

fix style

ee39ddc

Fix IdxT

ca3eec7

remove random_seed parameter

2ee6aff

Remove strided subsampling

abf68a6

Use managed memory for the new temprorary buffer

7cfd9b5

Batched index gather overlapped with H2D copies

5485557

tfeher force-pushed the ivf_subsample branch from f322d0a to 5485557 Compare January 21, 2024 11:35

tfeher added 2 commits January 21, 2024 13:10

Fix copyright year

9177356

Fix nvtx markers

790c0e6

tfeher mentioned this pull request Jan 22, 2024

[FEA] API to set random seed for IVF subsampling #2105

Open

achirkin requested changes Jan 22, 2024

View reviewed changes

tfeher added 2 commits January 23, 2024 08:21

replace thrust call with map_offset

8223259

Remove unused copy_warped

f431ae7

tfeher mentioned this pull request Jan 23, 2024

ANN subsample dataset: use mdspan input #2111

Open

tfeher commented Jan 23, 2024

View reviewed changes

achirkin approved these changes Jan 23, 2024

View reviewed changes

cjnolet approved these changes Jan 23, 2024

View reviewed changes

rapids-bot bot merged commit 9c35f73 into rapidsai:branch-24.02 Jan 23, 2024
61 checks passed

cjnolet mentioned this pull request Jan 23, 2024

[FEA] Enable reduced precision in IVF-flat, IVF-pq, and CAGRA indexes #1675

Open

4 tasks

tfeher mentioned this pull request Jan 31, 2024

[BUG] Large memory overhead while subsampling vectors for IVF methods #2141

Closed

cjnolet added a commit to cjnolet/raft that referenced this pull request Jan 31, 2024

Revert "Add random subsampling for IVF methods (rapidsai#2077)"

d49875a

This reverts commit 9c35f73.

tfeher mentioned this pull request Mar 14, 2024

Re enable IVF random sampling #2225

Closed

tfeher mentioned this pull request May 15, 2024

Enable random subsampling rapidsai/cuvs#122

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add random subsampling for IVF methods #2077

Add random subsampling for IVF methods #2077

tfeher commented Jan 3, 2024 •

edited

Loading

tfeher commented Jan 3, 2024

tfeher commented Jan 3, 2024

tfeher commented Jan 3, 2024

lowener left a comment

tfeher commented Jan 17, 2024

tfeher commented Jan 17, 2024 •

edited

Loading

tfeher left a comment

tfeher commented Jan 18, 2024

tfeher commented Jan 21, 2024 •

edited

Loading

achirkin left a comment

tfeher left a comment

achirkin left a comment

cjnolet commented Jan 23, 2024

Add random subsampling for IVF methods #2077

Add random subsampling for IVF methods #2077

Conversation

tfeher commented Jan 3, 2024 • edited Loading

tfeher commented Jan 3, 2024

tfeher commented Jan 3, 2024

tfeher commented Jan 3, 2024

lowener left a comment

Choose a reason for hiding this comment

tfeher commented Jan 17, 2024

IVF-Flat 1k clusters

IVF-Flat 10k clusters

IVF-PQ 1k clusters, dim_pq=64, pq_bits=5

IVF-PQ 1k clusters, dim_pq=64, pq_bits=8

IVF-PQ 10k clusters, dim_pq=96, pq_bits=8

tfeher commented Jan 17, 2024 • edited Loading

tfeher left a comment

Choose a reason for hiding this comment

tfeher commented Jan 18, 2024

tfeher commented Jan 21, 2024 • edited Loading

achirkin left a comment

Choose a reason for hiding this comment

tfeher left a comment

Choose a reason for hiding this comment

achirkin left a comment

Choose a reason for hiding this comment

cjnolet commented Jan 23, 2024

tfeher commented Jan 3, 2024 •

edited

Loading

tfeher commented Jan 17, 2024 •

edited

Loading

tfeher commented Jan 21, 2024 •

edited

Loading