Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Improve CAGRA serialization #1729

Closed
lowener opened this issue Aug 9, 2023 · 0 comments · Fixed by #1755
Closed

[FEA] Improve CAGRA serialization #1729

lowener opened this issue Aug 9, 2023 · 0 comments · Fixed by #1755
Labels
feature request New feature or request

Comments

@lowener
Copy link
Contributor

lowener commented Aug 9, 2023

CAGRA should expose its data type at the beginning of the serialization file for easier deserialization on Python side, just like IVF Flat.
#1717 is introducing a temporary fix on the Python side due to the dtype not being at a fixed offset in the serialization file.

@lowener lowener added the feature request New feature or request label Aug 9, 2023
benfred added a commit to benfred/raft that referenced this issue Aug 18, 2023
This changes the serialization format of saved CAGRA instances by:

* The dtype will now be written in the first 4 bytes of the index, to match
the IVF methods and to make it easier to deduce the dtype from python (rapidsai#1729)
* Writing out the dataset with the index is now optional. Since many use cases
will already have the dataset written out separately, this gives us the
option to save disk space by not writing out an extra copy of the input dataset.
If the include_dataset=false option is given, you will have to call `index.update_dataset`
to set the dataset yourself after loading
@benfred benfred linked a pull request Aug 18, 2023 that will close this issue
rapids-bot bot pushed a commit that referenced this issue Aug 21, 2023
This changes the serialization format of saved CAGRA indices by:

* The dtype will now be written in the first 4 bytes of the serialized file, to match the IVF methods and to make it easier to deduce the dtype from python (#1729)
* Writing out the dataset with the index is now optional. Since many use cases will already have the dataset written out separately, this gives us the option to save disk space by not writing out an extra copy of the input dataset. If the include_dataset=false option is given, you will have to call `index.update_dataset` to set the dataset yourself after loading

Authors:
  - Ben Frederickson (https://github.com/benfred)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #1755
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
Development

Successfully merging a pull request may close this issue.

1 participant