Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom patch to support multi-vector #1

Closed
wants to merge 1 commit into from
Closed

Conversation

heemin32
Copy link
Owner

Temporary PR to make it easy to review opensearch-project/k-NN#1358

@navneet1v
Copy link

can we add unit test for this change?

faiss/impl/HNSW.cpp Outdated Show resolved Hide resolved
faiss/IndexIDMap.cpp Outdated Show resolved Hide resolved
// Completed collection of result. Run post processor.
collector->finalize(nres, I);
// Collector completed its task. Release all resource of the collector.
collectorFactory->deleteCollector(collector);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this? as these are classes their object should get destructed once these objects go out of scope.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends how resource is created. At least, the factory create collector with new which need to be released manually.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

objects created with new for sure gets deallocated. Its the pointers that create the issue. For pointers it is recommended to use unique_pointer.

faiss/impl/HNSW.cpp Outdated Show resolved Hide resolved
@heemin32
Copy link
Owner Author

can we add unit test for this change?

It will be covered by existing unit test. For some additional scenario where we are utilizing id mapper or extending result collector, I would like to write a unit test in OpenSearch side to reduce the patch size.

@navneet1v
Copy link

can we add unit test for this change?

It will be covered by existing unit test. For some additional scenario where we are utilizing id mapper or extending result collector, I would like to write a unit test in OpenSearch side to reduce the patch size.

make sense. So, can you paste the response of unit test runs, as the tests are not running as part of github actions.

@heemin32 heemin32 force-pushed the extension_v2 branch 2 times, most recently from cb36f5a to c59bc53 Compare December 22, 2023 21:57
@heemin32
Copy link
Owner Author

heemin32 commented Dec 22, 2023

Test result are in facebookresearch#3178

@@ -10,6 +10,7 @@
#include <faiss/Index.h>
#include <faiss/IndexBinary.h>
#include <faiss/impl/IDSelector.h>
#include <faiss/impl/ResultCollectorFactory.h>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this included here if its not used in the interface?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me move it to cpp file.

#include <faiss/MetricType.h>
#include <faiss/utils/Heap.h>

/** ResultCollector is intended to define how to collect search result */

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add more detail here? Maybe add an example

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure


/** Encapsulates a set of ids to handle. */
struct ResultCollector {
// For each result, collect method is called to store result

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add comment on what each of the args are for?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

namespace faiss {

/** ResultCollectorFactory to create a ResultCollector object */
struct ResultCollectorFactory {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be overridden? How will this be customized?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be overridden.

@@ -131,6 +150,10 @@ void IndexIDMapTemplate<IndexT>::search(
sel_change.set(params_non_const, &this_idtrans);
}
}

if (params && params->col && !params->col->id_map) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats the purpose of !params->col->id_map?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if id_map is already set, we don't want to replace it with a new one which is same implementation as id selector.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but if ID is already set, then are we not setting col_change?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

col_change is just a method to update id_map inside parameter. If id_map is already there, no need to call the method.

// RAII object to reset the id_map parameter in ResultCollectorFactory object
// This object make sure to reset the id_map parameter in ResultCollectorFactory
// once the program exist current method scope.
struct ScopedColChange {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have any threading implications?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no.

@@ -111,8 +112,8 @@ void HNSW::print_neighbor_stats(int level) const {
level,
nb_neighbors(level));
size_t tot_neigh = 0, tot_common = 0, tot_reciprocal = 0, n_node = 0;
#pragma omp parallel for reduction(+: tot_neigh) reduction(+: tot_common) \
reduction(+: tot_reciprocal) reduction(+: n_node)
#pragma omp parallel for reduction(+ : tot_neigh) reduction(+ : tot_common) \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we revert this if we are not touching it?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't changed it. Let me rebase the code.

const std::vector<int64_t>* id_map;

// Create a new ResultCollector object
virtual ResultCollector* newCollector() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should methods be snake_case instead of camelCase?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In C++, the convention for method names is typically camel case.

faiss/impl/ResultCollector.h Show resolved Hide resolved
* @param bh_val search result, distances from query
* @param bh_ids search result, ids of vectors
* @param val distance from query for current vector
* @param ids id of current vector

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name as id?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to keep the parameter name consistent with faiss original code.

faiss/impl/ResultCollector.h Show resolved Hide resolved
@heemin32 heemin32 closed this Dec 27, 2023
heemin32 pushed a commit that referenced this pull request May 22, 2024
…Lists (facebookresearch#3327)

Summary:
Pull Request resolved: facebookresearch#3327

**Context**
1. [Issue 2621](facebookresearch#2621) discuss inconsistency between OnDiskInvertedList and InvertedList. OnDiskInvertedList is supposed to handle disk based multiple Index Shards. Thus, we should name it differently when merging invls from index shard.
2. [Issue 2876](facebookresearch#2876) provides usecase of shifting ids when merging invls from different shards.

**In this diff**,
1. To address #1 above, I renamed the merge_from function to merge_from_multiple without touching merge_from base class.
why so? To continue to allow merge invl from one index to ondiskinvl from other index.

2. To address facebookresearch#2 above, I have added support of shift_ids in merge_from_multiple to shift ids from different shards. This can be used when each shard has same set of ids but different data. This is not recommended if id is already unique across shards.

Reviewed By: mdouze

Differential Revision: D55482518

fbshipit-source-id: 95470c7449160488d2b45b024d134cbc037a2083
heemin32 pushed a commit that referenced this pull request Jul 8, 2024
…ookresearch#3527)

Summary:
Pull Request resolved: facebookresearch#3527

**Context**
Design Doc: [Faiss Benchmarking](https://docs.google.com/document/d/1c7zziITa4RD6jZsbG9_yOgyRjWdyueldSPH6QdZzL98/edit)

**In this diff**
1. Be able to reference codec and index from blobstore (bucket & path) outside the experiment
2. To support #1, naming is moved to descriptors.
3. Build index can be written as well.
4. You can run benchmark with train and then refer it in index built and then refer index built in knn search. Index serialization is optional. Although not yet exposed through index descriptor.
5. Benchmark can support index with different datasets sizes
6. Working with varying dataset now support multiple ground truth. There may be small fixes before we could use this.
7. Added targets for bench_fw_range, ivf, codecs and optimize.

**Analysis of ivf result**: D58823037

Reviewed By: algoriddle

Differential Revision: D57236543

fbshipit-source-id: ad03b28bae937a35f8c20f12e0a5b0a27c34ff3b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants