Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MOD-6738] IndexComputer #535

Merged
merged 40 commits into from
Oct 21, 2024
Merged

[MOD-6738] IndexComputer #535

merged 40 commits into from
Oct 21, 2024

Conversation

meiravgri
Copy link
Collaborator

@meiravgri meiravgri commented Aug 29, 2024

In this PR introduces two new components of the VecSimIndexAbstract: Preprocessor container and a calculator.

Preprocessor container

The PreprocessorsContainerAbstract API supports processing blobs for storage, for query (graph search) or both.
It is also responsible to copy the original blob if it needs to be processed and wrapping it with the appropriate deleted.
PreprocessorsContainerAbstract preprocessing includes alignment of a query blob.

Multi preprocessors

MultiPreprocessorsContainer extends PreprocessorsContainerAbstract by holding an array of pointers to PreprocessorInterface objects, each responsible for a different processing step.
Currently, we only have CosinePreprocessor, which normalizes vectors if the index type is Cosine. Calling CosinePreprocessor::preprocess with a blob pointing to the same memory address will result in one normalization call and a the returned blobs will point to the same memory.

NOTE: in tiered index, we assume that the vectors are processed before being inserted into the backend index, so the frontend index will be of type VecSimMetric_Cosine, but internally doesn't hold a cosine preprocessor.

The processed blobs have a scope lifetime and will be released automatically. It is assumed that they are copied if their lifetime needs to be extended (for storage purposes for example).

Distance Calculator

The distance calculator is defined according to the distance function signature.
It holds the distance function of the abstract index.
The distance calculation API of all Distance Calculator classes is: calc_dist(v1,v2,dim), but internally they will call the distance function according to the template signature.

Index API changes

  • An index of type VecSimIndexAbstract is responsible for preprocessing a blob before performing any operation.
  • It includes adding a new vector and processing a query before searching in the index.
  • All *Wrapper functions were removed.
  • As for the tiered index, it is assumed that the backend index receives a blob that was preprocessed by the frontend index. The backend index can perform additional preprocessing if needed.

computer is used to process blobs and calc distance
it has a DistanceCalculator to call when calculating distance

abstract index object expects IndexComputer to be passed in the ctor
Copy link

codecov bot commented Aug 29, 2024

Codecov Report

Attention: Patch coverage is 97.56757% with 9 lines in your changes missing coverage. Please review.

Project coverage is 97.02%. Comparing base (f08c051) to head (04ca716).
Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
src/VecSim/spaces/computer/preprocessors.h 89.18% 4 Missing ⚠️
...rc/VecSim/spaces/computer/preprocessor_container.h 96.05% 3 Missing ⚠️
src/VecSim/index_factories/brute_force_factory.cpp 96.55% 1 Missing ⚠️
src/VecSim/index_factories/hnsw_factory.cpp 97.61% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #535      +/-   ##
==========================================
- Coverage   97.16%   97.02%   -0.14%     
==========================================
  Files          94      100       +6     
  Lines        4862     5307     +445     
==========================================
+ Hits         4724     5149     +425     
- Misses        138      158      +20     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

call it from batch iterator

batchitertor store index dim
call indexComputer instead.

implement alignment in indexComputer::preprocessQuery and use to align blobs before query
introduce IndexComputerExtended that can have an array of preprocessors
implment preprocess that preprocess both query and storage blobs

add addvectrorctx and hnswaddvectorctx
TieredHNSWIndex::insertVectorToHNSW is now responsible to preprocess the blob according to the hnsw preprocessor

HNSW: introduce indexvector(blob, label, state) that inserts a stored vector to the graph.

Tiredindex: preprocessing for query is done according to the frontend index first.

Factory:
HNSW: added is_normalized arg to decide if we need IP or cosine
computerExtended has flags indicating if preprocessing is for query or storage or both
the flags are updated in addPreproccessor.
…reprocess queries by the backend index using wrapper API
addPreprocessor returns -1 if failed to add the preprocessor (due to lack of capcity)

IndexComputerExtended preprocess* fallback to  IndexComputerBasic if no preprocessors added yet

added tests to tst_common and test_tiered

vec_sim_index:

removed using spaces::dist_func_t;

factory:
moved is_nirmalized to new_index istead of to abstract params to the index will be intialized with the user setting metric.

added namespace MemoryUtils for definitiaons related to memory allcoations

added UNUSED macro to vec_sim_common
@meiravgri meiravgri requested a review from GuyAv46 September 11, 2024 13:00
@meiravgri meiravgri changed the title intiail implmnetation of IndexComputer and DistanceCalculator [MOD-6738] IndexComputer Sep 12, 2024
@meiravgri meiravgri requested a review from alonre24 September 12, 2024 09:16
src/VecSim/vec_sim_index.h Outdated Show resolved Hide resolved
src/VecSim/algorithms/brute_force/brute_force.h Outdated Show resolved Hide resolved
src/VecSim/algorithms/hnsw/hnsw_serializer.h Outdated Show resolved Hide resolved
src/VecSim/algorithms/hnsw/hnsw_serializer_declarations.h Outdated Show resolved Hide resolved
src/VecSim/algorithms/hnsw/hnsw_single.h Outdated Show resolved Hide resolved
src/VecSim/spaces/computer/preprocessor.h Outdated Show resolved Hide resolved
src/VecSim/vec_sim_tiered_index.h Outdated Show resolved Hide resolved
src/VecSim/vec_sim_tiered_index.h Outdated Show resolved Hide resolved
src/VecSim/vec_sim_tiered_index.h Outdated Show resolved Hide resolved
src/VecSim/vec_sim_tiered_index.h Outdated Show resolved Hide resolved
@alonre24 alonre24 removed their request for review September 12, 2024 16:56
use calcDistance instead of indexComputer->calcDistance

move version to the end of the arguments list of a hnsw serialized index ctor

uso auto in variable declarations when possible

add assert to allocate_force_aligned that alignment is not 0
use n_preprocessors as a template argument instead of dynamic allocation on preprocessors
@meiravgri meiravgri requested a review from GuyAv46 September 14, 2024 04:33
Instead the index holds the components seperatly:
indexCalculator for distance calculations and
PreprocessorsContainer to pre process user data

Index ctor expects a struct of all the components needed to intialize the index
currently the struct contains indexCalculator and PreprocessorsContainer
@meiravgri meiravgri force-pushed the meiravg_introduce_computer branch from ef8e781 to 44a16c5 Compare October 21, 2024 04:48
add preprocessors to the tiered index that preprocesses blobs
 if needed.
this commit will be reverted because the tiered index's
 job is to manage the safe transfer of blobs between the frontend and backend indexes.
 Any operations on the data should be the responsibility of the index that stores it.
alonre24
alonre24 previously approved these changes Oct 21, 2024
Copy link
Collaborator

@alonre24 alonre24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go!!

@meiravgri meiravgri added this pull request to the merge queue Oct 21, 2024
@meiravgri meiravgri removed this pull request from the merge queue due to a manual request Oct 21, 2024
@meiravgri meiravgri requested a review from alonre24 October 21, 2024 14:04
@meiravgri meiravgri enabled auto-merge October 21, 2024 14:07
@meiravgri meiravgri added this pull request to the merge queue Oct 21, 2024
Merged via the queue into main with commit 35c26b3 Oct 21, 2024
19 checks passed
@meiravgri meiravgri deleted the meiravg_introduce_computer branch October 21, 2024 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants