Skip to content

garyfeng/embedding_vector_search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

embedding_vector_search

prototyping global search for embedding vectors

Introduction

The objective of this repo is to prototype a global search tool for embedding vectors of the scale of roughly:

  • 1-10M vectors
  • 200-2k dimensions
  • Top 10 results, with recall of ???
  • SLA < 1 sec

Datasets

For testing, we use the following strategy

Synthetic datasets

Generate random vectors using scikit-learn, see examples in https://github.com/erikbern/ann-benchmarks/blob/14df47df72bb1deee5d9e68616ba03edf65d4680/ann_benchmarks/datasets.py#L473

Parameters:

Open source datasets

See https://github.com/erikbern/ann-benchmarks for available test vectors, in HDF5 format. Choose some that are closer to our needs.

Proprietary datasets

We can generate our own embedding vectors.

Loading/Enrolling

We need to load the embedding vectors to a proper data structure, either in memory or in a database. For the application we have in mind, we also need to link the vectors to some metadata, presummably in a JSON object.

Search

We conduct a 10-NN search over the full list that is filtered based on the metadata.

About

prototyping global search for embedding vectors

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published