Remove dataset from CAGRA index #1435

enp1s0 · 2023-04-20T10:25:54Z

Description

This PR removes the dataset from the current CAGRA index format.

Background

We create some indices using CAGRA, serialize them, and compare their performance. However, the current CAGRA index format includes the dataset, which means that every time we serialize the index, we also store a copy of the dataset with it. As a result, this can take up much storage space.

Test

The current tests can test this change.

rapids-bot · 2023-04-20T10:25:58Z

Pull requests from external contributors require approval from a rapidsai organization member with write or admin permissions before CI can begin.

cjnolet · 2023-04-20T11:55:04Z

/ok to test

tfeher · 2023-04-20T18:38:37Z

Thanks @enp1s0 for the PR!

This PR is adressing the following two issues:

we create a copy in device memory, which might not be necessary as discussed here
CAGRA #1375 (comment)
If the dataset is already on disk then it is unnecessary to save it with the index. This is particularly problematic if we want to generate multiple graphs for the same dataset to experiment with hyper parameters.

If we want to solve these issues by removing the dataset from the index, then this PR does a clean job in doing so.

But removing the dataset from the index is not the only way to solve these problems. Alternative option:

The serialize method could have a flag to prevent saving the dataset.
The deserialize method could take two files (graph and dataset). Alternatively we could have a constructor that takes a graph filename and a dataset array.
If the dataset is already in device memory, then the index could store a reference to that, instead of creating a copy.

The question is what shall the index represent? For the IVF methods in RAFT, the index is a self contained object that stores everything that is needed for the search (including the dataset). Removing the dataset from the CAGRA index would assign a different meaning to the index: a helper structure for search, which might or might not need the dataset. Both are valid approaches. I slightly prefer the self contained index. What is your opinion @cjnolet?

tfeher · 2023-04-20T19:02:26Z

To resolve style issues, you could use the pre-commit hooks.

anaruse · 2023-04-21T03:46:08Z

Hi,

In the case of IVFflat and IVFPQ, there is a strong dependency between the dataset and the reset of "index", so I think it is a reasonable solution to keep everything as one file. For example, the order of the dataset changes based on the clustering result, and in the case of PQ, the encoding result changes depending on the PQ code bit length used.

However, in the case of graph-based ANNS, the dependency between the dataset and the graph is not that strong.

For example, in graph-based ANNS, a graph of low degree tends to have low accuracy, but a graph of high degree tends to have low speed. Therefore, we create graphs of various degrees to find the best one that satisfies the performance requirements (size, accuracy, and speed), but we do not need to change the dataset at that time. Creating graphs of different degrees does not require changing the dataset.

Conversely, there is a need to use the same graphs, but to change the dataset. Suppose the data type of the original dataset is fp32. If the data type is changed to lower precision type such as fp16, for example, the speed will increase, but the search accuracy will decrease. How much accuracy loss depends on the dataset, so there is a need to try multiple data types, but there is no need to change the graph just because the data type of the dataset is changed.

Thus, since the dependency between dataset and graph is not that strong in graph-based ANNS, and since there are advantages to keeping them separate, I would prefer to have the dataset and graph as separate files.

cjnolet · 2023-04-24T21:27:33Z

I slightly prefer the self contained index. What is your opinion @cjnolet?

@tfeher I also prefer the self-contained index.

@anaruse I definitely see the argument that the dataset doesn't have to be coupled completely to the graph but this does still assume the graph is representative of the dataset, correct? Right now coupling is enforced through the index itself having a reference to the dataset. One concern I have with removing the dataset from the index altogether is that it makes it too easy for a user to pass in a completely different dataset to search, which is no longer representative of the graph in the index.

I think we can design this in a way which will be more immediately obvious to users, making it clear to the user what they are intending to do while better avoiding the potential for a user accidentally misuing our APIs. I propose we keep the dataset view on the index but provide a function to take an index and produce a new index with a different dataset. Something like the following:

template<typename T, typename IdxT, typename T_old>
index<T, IdxT> update_dataset(raft::resources const &handle, 
                                                   index<T_old, IdxT> const &index, 
                                                   T const &dataset);

This function will return a new index with all of the members of the old index (copied or referenced directly, depending on needs), but with the new dataset. This makes the user's intention very clear here so when they want to swap out the dataset, they need to make sure the dataset they pass in is representative of the original graph. Otherwise, they can use the search() on whichever index they prefer.

Another thing we can do, which might be even cleaner, is provide a method right on the index itself that can produce a new index with a different dataset:

auto new_index = index.update_dataset(new_dataset);

anaruse · 2023-04-25T08:32:55Z

I discussed this with @enp1s0 and we came up with an idea that could solve or at least alleviate various issues, so I think we will close this PR and @enp1s0 will submit an issue or PR based on that idea.

enp1s0 · 2023-04-25T09:23:54Z

@cjnolet @tfeher Thank you for your comments.
I completely agree with you that the current index format is effective in preventing API misuse. Furthermore, the update_dataset function seems to be a suitable solution for modifying the dataset.
However, I would like to address my original concern, which is the need to store a copy of the dataset each time we serialize the index.
As an example, the Microsoft Turing-ANNS 1B benchmark dataset has a size of 400GB, and I believe it would be inconvenient for users to make multiple copies of such a large dataset.
1B scale datasets are typically more than 100GB in size.
(Benchmark dataset examples: https://big-ann-benchmarks.com/#bench-datasets ; DatasetSize=sizeof(Datatype) * Dimension * 1B)
Therefore, I believe separating the graph and dataset is the best solution concerning the storage size.

The alternative approach @anaruse mentioned is to offer the functionality to manipulate the graph degree, dataset types, and other features dynamically after deserializing the index file.
To compare the performance of different graph configurations, we create a large graph initially and serialize it.
Upon deserialization, we measure the performance while altering the configurations using the provided functionality.
This approach requires making only one copy of the dataset, thereby eliminating the need for multiple copies.
If there is no problem with this approach, I will close this PR and submit a new one for the functionality.

tfeher · 2023-04-25T14:35:27Z

Thanks everyone for the comments!

However, I would like to address my original concern, which is the need to store a copy of the dataset each time we serialize the index. [...] I believe separating the graph and dataset is the best solution concerning the storage size.

Separating the graph from the dataset can be done while saving / loading the data. Here is an example:

#include "../../bench/ann/src/common/dataset.h"
#include <iostream>
#include <raft/core/detail/mdspan_numpy_serializer.hpp>
#include <raft/core/host_mdarray.hpp>
#include <raft/core/serialize.hpp>
#include <raft/neighbors/cagra.cuh>

// Load a dataset using custom dataset loader (e.g. loader for bigANN input type)
raft::bench::ann::BinFile<float> dataset_file("dataset_filename", "r");
size_t n_rows;
int dim;
dataset_file.get_shape(&n_rows, &dims);
auto dataset = make_host_matrix<float, uint32_t>(n_rows, dim);
dataset_file.read(dataset.data_handle());

// train an index
index = cagra::build(res, params, make_const_mdspan(dataset.view());

// Serialize graph only
std::ofstream of("cagra_graph", std::ios::out | std::ios::binary);
serialize_mdspan(res, of, index.graph());
of.close();

// Deserialize graph
std::ifstream is("cagra_graph", std::ios::in | std::ios::binary);
raft::detail::numpy_serializer::header_t header = raft::detail::numpy_serializer::read_header(is);
is.seekg(0);  // rewind
auto knn_graph = make_host_matrix<uint32_t, uint32_t>(header.shape[0], header.shape[1]);
deserialize_mdspan(res, is, knn_graph.view());
is.close();

// Create index from knn_graph and dataset
auto loaded_index = cagra::index<float, uint32_t>((res,
                                                   raft::distance::DistanceType::L2Expanded,
                                                   make_const_mdspan(dataset.view()),
                                                   knn_graph.view());

In the last step, the constructor creates memory copies of the dataset and graph. Ideally, we would only store a reference in case these arrays are already in device memory. In that case constructing the index is would be a cheap operation: just wrapping the pointers (mdspans) into the index structure. This should be addressed in a separate issue, so that we can focus on disk copies here.

To make this nicer, we should have deserialize_host_matrix() utility to encapsulate the steps under deserialize graph.

The example above avoids creating unnecessary disk copies of the dataset. We shall discuss whether we need further convenience functions to simplify these steps. Please let me know what do you think.

alternative approach @anaruse mentioned is to offer the functionality to manipulate the graph degree, dataset types, and other features dynamically after deserializing the index file.

This goes one step further then just optimizing disc copies, and it is definitely an interesting idea to explore. Could you provide more information?

enp1s0 · 2023-04-25T15:55:28Z

@tfeher Thank you for the comment.

This goes one step further then just optimizing disc copies, and it is definitely an interesting idea to explore. Could you provide more information?

I'll make a new issue as it is a bit independent of this PR.

The approach you have proposed can address my concern while not changing the index format on memory. (Sorry for skipping your proposal earlier.)

The serialize method could have a flag to prevent saving the dataset.

The deserialize method could take two files (graph and dataset). Alternatively we could have a constructor that takes a graph filename and a dataset array.

If the dataset is already in device memory, then the index could store a reference to that, instead of creating a copy.

Additionally, we could use the update_dataset function to incorporate the dataset into the index when the index file does not contain one.
What do you think about this approach, @cjnolet ?

cjnolet · 2023-04-25T21:01:38Z

The serialize method could have a flag to prevent saving the dataset.

The deserialize method could take two files (graph and dataset). Alternatively we could have a constructor that takes a graph filename and a dataset array.

If the dataset is already in device memory, then the index could store a reference to that, instead of creating a copy.

I think I like this approach, but I wonder if we should make it even a little easier for the user and store a boolean in the serialized index format that denotes whether or not the dataset was serialized along with the index. That way a user who might have created a moderately small index could just serialize and deserialize the entire index and dataset while the user with a larger dataset can use the flag in the serialize function to turn off serializing the dataset.

For deserialization, we can provide two functions- one which just accepts the single serialized file and deserializes the dataset if it was serialized. Another implementation of the deserialize funciton can accept two files- a serialized index and dataset. Of course, if a user invokes the deserialize that only accepts the serialized index and it didn't provide the dataset, they could still use update_index to add the dataset to the deserialized index.

tfeher · 2023-04-26T19:20:04Z

Another implementation of the deserialize funciton can accept two files- a serialized index and dataset.

@cjnolet I am not convinced that the deserialize function shall load the dataset. I would expect the dataset to be stored in a custom format (e.g. format used by big-ann-benchmarks, or HDF5, etc), and we probably not want to deal with these details.

cjnolet · 2023-04-26T19:40:50Z

I am not convinced that the deserialize function shall load the dataset

@tfeher, the more I think about this, we will likely never need to serialize the dataset with the index. If the user was able to provide the dataset when training the index, they'll be able to provide the dataset after deserializing the index. I just want to make sure we're keeping the dataset coupled to the index itself and not requiring it as an argument to the search function.

cjnolet · 2023-05-15T20:56:20Z

@enp1s0 we are getting close to burndown for 23.06 and I'm just checking in on the status of these changes. I believe we were discussing adding an update_dataset() method to the cagra::index but I'm not sure if you are planning to implement that change in this PR or another one.

enp1s0 · 2023-05-16T06:48:34Z

@cjnolet yes, we will add update_dataset. I'll make another PR for that so this PR would be unnecessary anymore. Thank you all for participating in the discussion!

This PR aims to improve the workflow when dealing with large datasets. When experimenting with different versions of the knn-graph, we might want to construct indices with the same dataset (see #1435 for further discussion). If the dataset is already in device memory (and rows are properly aligned / padded), then we only store a reference to the dataset, therefore multiple indices can refer to the same dataset. Similarly, when `knn_graph` is a device array, then store only a reference. Additionally, this PR adds `update_dataset` and `update_graph` methods to the index. Closes #1479 Authors: - Tamas Bela Feher (https://github.com/tfeher) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #1494

Remove dataset from cagra index

26b3a43

enp1s0 requested a review from a team as a code owner April 20, 2023 10:25

github-actions bot added the cpp label Apr 20, 2023

tfeher added improvement Improvement / enhancement to an existing function breaking Breaking change labels Apr 20, 2023

cjnolet assigned cjnolet and enp1s0 and unassigned cjnolet Apr 20, 2023

enp1s0 added 2 commits April 21, 2023 10:47

Fix style

48e109b

Merge branch 'branch-23.06' into fix_cagra_index

2d0d7d6

enp1s0 mentioned this pull request Apr 28, 2023

CAGRA graph degree manipulation #1472

Open

enp1s0 closed this May 16, 2023

tfeher mentioned this pull request Jul 19, 2023

Cagra index construction without copying device mdarrays #1494

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove dataset from CAGRA index #1435

Remove dataset from CAGRA index #1435

enp1s0 commented Apr 20, 2023

rapids-bot bot commented Apr 20, 2023

cjnolet commented Apr 20, 2023

tfeher commented Apr 20, 2023

tfeher commented Apr 20, 2023

anaruse commented Apr 21, 2023 •

edited

Loading

cjnolet commented Apr 24, 2023 •

edited

Loading

anaruse commented Apr 25, 2023

enp1s0 commented Apr 25, 2023 •

edited

Loading

tfeher commented Apr 25, 2023

enp1s0 commented Apr 25, 2023

cjnolet commented Apr 25, 2023

tfeher commented Apr 26, 2023

cjnolet commented Apr 26, 2023

cjnolet commented May 15, 2023

enp1s0 commented May 16, 2023

Remove dataset from CAGRA index #1435

Remove dataset from CAGRA index #1435

Conversation

enp1s0 commented Apr 20, 2023

Description

Background

Test

rapids-bot bot commented Apr 20, 2023

cjnolet commented Apr 20, 2023

tfeher commented Apr 20, 2023

tfeher commented Apr 20, 2023

anaruse commented Apr 21, 2023 • edited Loading

cjnolet commented Apr 24, 2023 • edited Loading

anaruse commented Apr 25, 2023

enp1s0 commented Apr 25, 2023 • edited Loading

tfeher commented Apr 25, 2023

enp1s0 commented Apr 25, 2023

cjnolet commented Apr 25, 2023

tfeher commented Apr 26, 2023

cjnolet commented Apr 26, 2023

cjnolet commented May 15, 2023

enp1s0 commented May 16, 2023

anaruse commented Apr 21, 2023 •

edited

Loading

cjnolet commented Apr 24, 2023 •

edited

Loading

enp1s0 commented Apr 25, 2023 •

edited

Loading