Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] IVF index building with pinned H2D transfer #2106

Open
tfeher opened this issue Jan 22, 2024 · 1 comment
Open

[FEA] IVF index building with pinned H2D transfer #2106

tfeher opened this issue Jan 22, 2024 · 1 comment
Labels
feature request New feature or request Vector Search

Comments

@tfeher
Copy link
Contributor

tfeher commented Jan 22, 2024

Is your feature request related to a problem? Please describe.
For IVF-Flat ad IVF-PQ index building, large datasets are provided in host memory or as mmap-ed file. After the cluster centers are trained, both method streams through the whole dataset twice. Currently there is no overlap between host to device copies and additional data processing on the GPU.

Describe the solution you'd like
Use pinned buffers to copy the data to the GPU and overlap it with GPU side computation.

Additional context

  • Since the dataset can be larger than the physical (host) memory of the system, it is not possible to load the whole dataset into pinned memory.

  • Index subsampling already use pinned buffers to overlap vector gathering and H2D copies 5485557

IVF-Flat and IVF-PQ streams through the whole dataset here:

  • assign vectors to cluster centers (k-means predict): IVF-Flat, IVF-PQ
  • copy vectors to their respective cluster (additionally encode vectros and map to a specific layout): IVF-Flat, IVF-PQ

We use batch_load_iterator to copy the data to host. Ideally, we could improve the batch load-iterator to prefetch the data into a pinned buffer.

@tfeher tfeher added feature request New feature or request Vector Search labels Jan 22, 2024
@tfeher
Copy link
Contributor Author

tfeher commented Jan 22, 2024

Tagging @abc99lr who plans to work on this, and @achirkin for visibility.

rapids-bot bot pushed a commit to rapidsai/cuvs that referenced this issue Jul 31, 2024
Currently, in IVF index building (both IVF-Flat and IVF-PQ), large dataset is usually in pageable host memory or mmap-ed file. In both case, after the cluster centers are trained, the entire dataset needs to be copied twice to the GPU -- one for assigning vectors to clusters, the other for copying vectors to the corresponding clusters. Both copies are done using `batch_load_iterator` in a chunk-by-chunk fashion. Since the source buffer is in pageable memory, the current `batch_load_iterator` implementation doesn't support kernel and memcopy overlapping. This PR adds support on prefetching with `cudaMemcpyAsync` on pageable memory. We achieve kernel copy overlapping by launching kernel first following by the prefetching of the next chunk. 

We benchmarked the change on L40S. The results show 3%-21% speedup on index building, without impacting the search recall (about 1-2%, similar to run-to-run variance). 
algo | dataset | model | with prefetching (s) | without prefetching (s) | speedup
-- | -- | -- | -- | -- | --
IVF-PQ | deep-100M | d64b5n50K | 97.3547 | 100.36 | 1.03
IVF-PQ | wiki-all-10M | d64-nlist16K | 14.9763 | 18.1602 | 1.21
IVF-Flat | deep-100M | nlist50K | 78.8188 | 81.4461 | 1.03

This PR is related to the issue submitted to RAFT: rapidsai/raft#2106

Authors:
  - Rui Lan (https://github.com/abc99lr)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)

URL: #230
divyegala pushed a commit to divyegala/cuvs that referenced this issue Aug 7, 2024
Currently, in IVF index building (both IVF-Flat and IVF-PQ), large dataset is usually in pageable host memory or mmap-ed file. In both case, after the cluster centers are trained, the entire dataset needs to be copied twice to the GPU -- one for assigning vectors to clusters, the other for copying vectors to the corresponding clusters. Both copies are done using `batch_load_iterator` in a chunk-by-chunk fashion. Since the source buffer is in pageable memory, the current `batch_load_iterator` implementation doesn't support kernel and memcopy overlapping. This PR adds support on prefetching with `cudaMemcpyAsync` on pageable memory. We achieve kernel copy overlapping by launching kernel first following by the prefetching of the next chunk. 

We benchmarked the change on L40S. The results show 3%-21% speedup on index building, without impacting the search recall (about 1-2%, similar to run-to-run variance). 
algo | dataset | model | with prefetching (s) | without prefetching (s) | speedup
-- | -- | -- | -- | -- | --
IVF-PQ | deep-100M | d64b5n50K | 97.3547 | 100.36 | 1.03
IVF-PQ | wiki-all-10M | d64-nlist16K | 14.9763 | 18.1602 | 1.21
IVF-Flat | deep-100M | nlist50K | 78.8188 | 81.4461 | 1.03

This PR is related to the issue submitted to RAFT: rapidsai/raft#2106

Authors:
  - Rui Lan (https://github.com/abc99lr)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)

URL: rapidsai#230
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Vector Search
Projects
None yet
Development

No branches or pull requests

1 participant