Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialize only knn graph with CAGRA bench #1743

Closed

Conversation

tfeher
Copy link
Contributor

@tfeher tfeher commented Aug 16, 2023

This PR changes how the CAGRA ANN benchmarks saves the index. We save the graph only, because the dataset is usually available in a separate file.

@tfeher tfeher requested a review from a team as a code owner August 16, 2023 18:21
@github-actions github-actions bot added the cpp label Aug 16, 2023
@@ -129,15 +135,81 @@ void RaftCagra<T, IdxT>::set_search_param(const AnnSearchParam& param)
template <typename T, typename IdxT>
void RaftCagra<T, IdxT>::save(const std::string& file) const
{
raft::neighbors::cagra::serialize(handle_, file, *index_);
// 1 orig serialization: save both dataset and knn graph into the file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benfred and I were literally just talking about this. We have also agreed in the past that we should be making serializing the dataset optional and I believe we also converged on providing a update_dataset() method right on the index to alleviate the awkwardness of having to accept a dataset at deserialization time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we can just construct the index from the two arrays like here
https://github.com/rapidsai/raft/pull/1743/files#diff-5f08a0aeb75c8884f5142d218e97ec859de8cc0cffd98d8c08f0e18a45655da0R211-R212

Saving the knn-graph mdspan is trivial with the current helpers, but we could add the following code block as a helper to make it easy to load an mdspan:
https://github.com/rapidsai/raft/pull/1743/files#diff-5f08a0aeb75c8884f5142d218e97ec859de8cc0cffd98d8c08f0e18a45655da0R175-R181

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a PR up here #1781 which doesn't write out the dataset on serialize (by using the include_dataset param from #1755) and update_dataset to set the dataset after deserialize.

benfred added a commit to benfred/raft that referenced this pull request Aug 27, 2023
As an alternative to rapidsai#1743, this uses the `include_dataset=False` param
in cagra::serialize to avoid writing the dataset to disk with the index.
This lets us avoid writing a second copy of the dataset, since it is
available in a separate file already
rapids-bot bot pushed a commit that referenced this pull request Aug 28, 2023
As an alternative to #1743, this uses the `include_dataset=False` param in cagra::serialize to avoid writing the dataset to disk with the index. This lets us avoid writing a second copy of the dataset, since it is available in a separate file already

Authors:
  - Ben Frederickson (https://github.com/benfred)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #1781
@tfeher
Copy link
Contributor Author

tfeher commented Sep 4, 2023

Alternative solution provided in #1781. Closing this.

@tfeher tfeher closed this Sep 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

Successfully merging this pull request may close these issues.

3 participants