Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

random sampling of dataset rows with improved memory utilization #2155

Merged
merged 19 commits into from
Mar 19, 2024

Conversation

tfeher
Copy link
Contributor

@tfeher tfeher commented Feb 5, 2024

The random sampling of IVF methods was reverted (#2144) due to large memory utilization #2141.

This PR improves the memory consumption of subsamling: it is O(n_train) where n_train is the size of the subsampled dataset.

This PR adds the following new APIs:

  • random::excess_sampling (todo may just call as sample_without_replacement)
  • matrix::sample_rows
  • matrix::gather for host input matrix

@tfeher tfeher added enhancement New feature or request non-breaking Non-breaking change Vector Search labels Feb 5, 2024
@tfeher tfeher self-assigned this Feb 5, 2024
@github-actions github-actions bot added the cpp label Feb 5, 2024
@github-actions github-actions bot added the CMake label Mar 12, 2024
@tfeher tfeher changed the title IVF random sampling with improved memory utilization. random sampling of dataset rows with improved memory utilization. Mar 13, 2024
@tfeher tfeher changed the title random sampling of dataset rows with improved memory utilization. random sampling of dataset rows with improved memory utilization Mar 13, 2024
@tfeher tfeher marked this pull request as ready for review March 13, 2024 09:00
@tfeher tfeher requested review from a team as code owners March 13, 2024 09:00
@tfeher tfeher added improvement Improvement / enhancement to an existing function and removed enhancement New feature or request labels Mar 13, 2024
Copy link
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Tamas, for the PR! Looks good overall and I like the idea of the excess sampling. Just having a few minor questions.

cpp/bench/prims/random/subsample.cu Show resolved Hide resolved
cpp/include/raft/random/detail/rng_impl.cuh Outdated Show resolved Hide resolved
cpp/include/raft/spatial/knn/detail/ann_utils.cuh Outdated Show resolved Hide resolved
cpp/test/random/excess_sampling.cu Outdated Show resolved Hide resolved
cpp/include/raft/matrix/detail/gather.cuh Outdated Show resolved Hide resolved
Copy link
Contributor Author

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @achirkin for the review, I have addressed the issues.

cpp/bench/prims/random/subsample.cu Show resolved Hide resolved
cpp/include/raft/matrix/detail/gather.cuh Outdated Show resolved Hide resolved
cpp/include/raft/random/detail/rng_impl.cuh Outdated Show resolved Hide resolved
cpp/include/raft/spatial/knn/detail/ann_utils.cuh Outdated Show resolved Hide resolved
cpp/test/random/excess_sampling.cu Outdated Show resolved Hide resolved
@tfeher
Copy link
Contributor Author

tfeher commented Mar 18, 2024

/rerun tests

Copy link
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updates, Tamas, LGTM!

Copy link
Member

@benfred benfred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@benfred
Copy link
Member

benfred commented Mar 19, 2024

/merge

@rapids-bot rapids-bot bot merged commit 0b9692b into rapidsai:branch-24.04 Mar 19, 2024
71 checks passed
rapids-bot bot pushed a commit to rapidsai/cuvs that referenced this pull request Aug 1, 2024
Random sampling of training set for IVF methods was reverted in rapidsai/raft#2144 due to the large memory usage of the subsample method.

Since then, PR rapidsai/raft#2155 has implemented a new random sampling method with improved memory utilization.  Using that we can now enable random sampling of IVF methods (rapidsai/raft#2052 and rapidsai/raft#2077).

Random subsampling has measurable overhead for IVF-Flat, therefore it is only enabled for IVF-PQ.

Authors:
  - Tamas Bela Feher (https://github.com/tfeher)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #122
divyegala pushed a commit to divyegala/cuvs that referenced this pull request Aug 7, 2024
Random sampling of training set for IVF methods was reverted in rapidsai/raft#2144 due to the large memory usage of the subsample method.

Since then, PR rapidsai/raft#2155 has implemented a new random sampling method with improved memory utilization.  Using that we can now enable random sampling of IVF methods (rapidsai/raft#2052 and rapidsai/raft#2077).

Random subsampling has measurable overhead for IVF-Flat, therefore it is only enabled for IVF-PQ.

Authors:
  - Tamas Bela Feher (https://github.com/tfeher)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#122
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake cpp improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Vector Search
Projects
Development

Successfully merging this pull request may close these issues.

3 participants