Benchmark against FAISS & nmslib? #4

jaanli · 2024-04-14T11:30:36Z

Such a benchmark would be super helpful to decide which in-browser use cases are flexible enough :)

https://github.com/nmslib/hnswlib

For example, I have a few databases ready to go:

20 years of census data - https://jaanli.github.io/american-community-survey/new-york-area/income-by-race
15 million hospital claims - https://onefact.github.io/synthetic-healthcare-data/
All of NYC real estate - https://jaanli.github.io/new-york-real-estate/

And I really want to visualize the 30,000+ Mandarin characters by their phono-semantic specificity/etymological origins on a map.

All of these require high-dimensional similarity search, but are of very different scale. So the UI/UX interactions (e.g. very early ones from 2017 here: https://jaan.io/food2vec-augmented-cooking-machine-intelligence/) will be constrained by the queries per second supported in this duckdb extension.

Hope that makes sense, and happy to help! 🙏 super exciting that this is now feasible!!

jaanli · 2024-04-19T21:13:33Z

In case further motivation is needed, here are the types of algorithms I need to benchmark: https://github.com/google-deepmind/xtr - the FAISS parts are here: https://github.com/google-deepmind/xtr/blob/main/xtr_evaluation_on_beir_miracl.ipynb

            ds = 128
            num_clusters = 50
            code_size = 64
            quantizer = faiss.IndexFlatIP(ds)
            opq_matrix = faiss.OPQMatrix(ds, code_size)
            opq_matrix.niter = 10
            sub_index = faiss.IndexIVFPQ(quantizer, ds, num_clusters, code_size, 4, faiss.METRIC_INNER_PRODUCT)
            index = faiss.IndexPreTransform(opq_matrix, sub_index)
            index.train(all_token_embeds[:num_tokens])
            index.add(all_token_embeds[:num_tokens])
            class FaissSearcher(object):
                def __init__(self, index):
                    self.index = index
                def search_batched(self, query_embeds, final_num_neighbors, **kwargs):
                    scores, top_ids = self.index.search(query_embeds, final_num_neighbors)
                    return top_ids, scores
            self.searcher = FaissSearcher(index)

JAicewizard · 2024-09-09T13:37:44Z

Hello, I ran some benchmarks comparing VSS against the FAISS extension and posted them here: https://github.com/arjenpdevries/faiss/blob/main/README.md
This URL will die soon, but once it is merged it will be in the general REDME. TLDR: VSS is about 2-3 times slower compared to FAISS when using a single query on 8.8M datapoints with dimension 1536.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark against FAISS & nmslib? #4

Benchmark against FAISS & nmslib? #4

jaanli commented Apr 14, 2024

jaanli commented Apr 19, 2024 •

edited

Loading

JAicewizard commented Sep 9, 2024 •

edited

Loading

Benchmark against FAISS & nmslib? #4

Benchmark against FAISS & nmslib? #4

Comments

jaanli commented Apr 14, 2024

jaanli commented Apr 19, 2024 • edited Loading

JAicewizard commented Sep 9, 2024 • edited Loading

jaanli commented Apr 19, 2024 •

edited

Loading

JAicewizard commented Sep 9, 2024 •

edited

Loading