-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove dataset from CAGRA index #1435
Conversation
Pull requests from external contributors require approval from a |
/ok to test |
Thanks @enp1s0 for the PR! This PR is adressing the following two issues:
If we want to solve these issues by removing the dataset from the index, then this PR does a clean job in doing so. But removing the dataset from the index is not the only way to solve these problems. Alternative option:
The question is what shall the index represent? For the IVF methods in RAFT, the index is a self contained object that stores everything that is needed for the search (including the dataset). Removing the dataset from the CAGRA index would assign a different meaning to the index: a helper structure for search, which might or might not need the dataset. Both are valid approaches. I slightly prefer the self contained index. What is your opinion @cjnolet? |
To resolve style issues, you could use the pre-commit hooks. |
Hi, In the case of IVFflat and IVFPQ, there is a strong dependency between the dataset and the reset of "index", so I think it is a reasonable solution to keep everything as one file. For example, the order of the dataset changes based on the clustering result, and in the case of PQ, the encoding result changes depending on the PQ code bit length used. However, in the case of graph-based ANNS, the dependency between the dataset and the graph is not that strong. For example, in graph-based ANNS, a graph of low degree tends to have low accuracy, but a graph of high degree tends to have low speed. Therefore, we create graphs of various degrees to find the best one that satisfies the performance requirements (size, accuracy, and speed), but we do not need to change the dataset at that time. Creating graphs of different degrees does not require changing the dataset. Conversely, there is a need to use the same graphs, but to change the dataset. Suppose the data type of the original dataset is fp32. If the data type is changed to lower precision type such as fp16, for example, the speed will increase, but the search accuracy will decrease. How much accuracy loss depends on the dataset, so there is a need to try multiple data types, but there is no need to change the graph just because the data type of the dataset is changed. Thus, since the dependency between dataset and graph is not that strong in graph-based ANNS, and since there are advantages to keeping them separate, I would prefer to have the dataset and graph as separate files. |
@tfeher I also prefer the self-contained index. @anaruse I definitely see the argument that the dataset doesn't have to be coupled completely to the graph but this does still assume the graph is representative of the dataset, correct? Right now coupling is enforced through the index itself having a reference to the dataset. One concern I have with removing the dataset from the index altogether is that it makes it too easy for a user to pass in a completely different dataset to I think we can design this in a way which will be more immediately obvious to users, making it clear to the user what they are intending to do while better avoiding the potential for a user accidentally misuing our APIs. I propose we keep the dataset view on the index but provide a function to take an index and produce a new index with a different dataset. Something like the following: template<typename T, typename IdxT, typename T_old>
index<T, IdxT> update_dataset(raft::resources const &handle,
index<T_old, IdxT> const &index,
T const &dataset); This function will return a new index with all of the members of the old index (copied or referenced directly, depending on needs), but with the new Another thing we can do, which might be even cleaner, is provide a method right on the index itself that can produce a new index with a different dataset: auto new_index = index.update_dataset(new_dataset); |
@cjnolet @tfeher Thank you for your comments. The alternative approach @anaruse mentioned is to offer the functionality to manipulate the graph degree, dataset types, and other features dynamically after deserializing the index file. |
Thanks everyone for the comments!
Separating the graph from the dataset can be done while saving / loading the data. Here is an example: #include "../../bench/ann/src/common/dataset.h"
#include <iostream>
#include <raft/core/detail/mdspan_numpy_serializer.hpp>
#include <raft/core/host_mdarray.hpp>
#include <raft/core/serialize.hpp>
#include <raft/neighbors/cagra.cuh>
// Load a dataset using custom dataset loader (e.g. loader for bigANN input type)
raft::bench::ann::BinFile<float> dataset_file("dataset_filename", "r");
size_t n_rows;
int dim;
dataset_file.get_shape(&n_rows, &dims);
auto dataset = make_host_matrix<float, uint32_t>(n_rows, dim);
dataset_file.read(dataset.data_handle());
// train an index
index = cagra::build(res, params, make_const_mdspan(dataset.view());
// Serialize graph only
std::ofstream of("cagra_graph", std::ios::out | std::ios::binary);
serialize_mdspan(res, of, index.graph());
of.close();
// Deserialize graph
std::ifstream is("cagra_graph", std::ios::in | std::ios::binary);
raft::detail::numpy_serializer::header_t header = raft::detail::numpy_serializer::read_header(is);
is.seekg(0); // rewind
auto knn_graph = make_host_matrix<uint32_t, uint32_t>(header.shape[0], header.shape[1]);
deserialize_mdspan(res, is, knn_graph.view());
is.close();
// Create index from knn_graph and dataset
auto loaded_index = cagra::index<float, uint32_t>((res,
raft::distance::DistanceType::L2Expanded,
make_const_mdspan(dataset.view()),
knn_graph.view()); In the last step, the constructor creates memory copies of the dataset and graph. Ideally, we would only store a reference in case these arrays are already in device memory. In that case constructing the index is would be a cheap operation: just wrapping the pointers (mdspans) into the index structure. This should be addressed in a separate issue, so that we can focus on disk copies here. To make this nicer, we should have The example above avoids creating unnecessary disk copies of the dataset. We shall discuss whether we need further convenience functions to simplify these steps. Please let me know what do you think.
This goes one step further then just optimizing disc copies, and it is definitely an interesting idea to explore. Could you provide more information? |
@tfeher Thank you for the comment.
I'll make a new issue as it is a bit independent of this PR. The approach you have proposed can address my concern while not changing the index format on memory. (Sorry for skipping your proposal earlier.)
Additionally, we could use the |
I think I like this approach, but I wonder if we should make it even a little easier for the user and store a boolean in the serialized index format that denotes whether or not the dataset was serialized along with the index. That way a user who might have created a moderately small index could just serialize and deserialize the entire index and dataset while the user with a larger dataset can use the flag in the serialize function to turn off serializing the dataset. For deserialization, we can provide two functions- one which just accepts the single serialized file and deserializes the dataset if it was serialized. Another implementation of the deserialize funciton can accept two files- a serialized index and dataset. Of course, if a user invokes the deserialize that only accepts the serialized index and it didn't provide the dataset, they could still use |
@cjnolet I am not convinced that the deserialize function shall load the dataset. I would expect the dataset to be stored in a custom format (e.g. format used by big-ann-benchmarks, or HDF5, etc), and we probably not want to deal with these details. |
@tfeher, the more I think about this, we will likely never need to serialize the dataset with the index. If the user was able to provide the dataset when training the index, they'll be able to provide the dataset after deserializing the index. I just want to make sure we're keeping the dataset coupled to the index itself and not requiring it as an argument to the search function. |
@enp1s0 we are getting close to burndown for 23.06 and I'm just checking in on the status of these changes. I believe we were discussing adding an |
@cjnolet yes, we will add |
This PR aims to improve the workflow when dealing with large datasets. When experimenting with different versions of the knn-graph, we might want to construct indices with the same dataset (see #1435 for further discussion). If the dataset is already in device memory (and rows are properly aligned / padded), then we only store a reference to the dataset, therefore multiple indices can refer to the same dataset. Similarly, when `knn_graph` is a device array, then store only a reference. Additionally, this PR adds `update_dataset` and `update_graph` methods to the index. Closes #1479 Authors: - Tamas Bela Feher (https://github.com/tfeher) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #1494
Description
This PR removes the dataset from the current CAGRA index format.
Background
We create some indices using CAGRA, serialize them, and compare their performance. However, the current CAGRA index format includes the dataset, which means that every time we serialize the index, we also store a copy of the dataset with it. As a result, this can take up much storage space.
Test
The current tests can test this change.