Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[META] Add performance and accuracy benchmarks for Neural search Features #430

Open
15 tasks
navneet1v opened this issue Oct 10, 2023 · 5 comments
Open
15 tasks
Labels
backlog All the backlog features should be marked with this label

Comments

@navneet1v
Copy link
Collaborator

navneet1v commented Oct 10, 2023

Description

The aim of this issue is to write the performance and accuracy benchmarks for different features of Neural search plugin.

Tasks

  • List all the datasets that will be used in the benchmarking
  • Performance Benchmarks for ingestion of different processors
    • Text Embedding Processor
    • Sparse Encoding Processor
    • TextandImage Processor
  • Performance Benchmarks for different Query Clause
    • Neural Query Clause
    • Sparse Encoding Query Clause
    • Hybrid Query Clause
    • Hybrid Query using Bool Query Clause
  • Accuracy Benchmarks
    • Neural Query Clause
    • Sparse Encoding Query Clause
    • Hybrid Query Clause
    • Hybrid Query using Bool Query Clause
@navneet1v navneet1v added backlog All the backlog features should be marked with this label and removed untriaged labels Oct 10, 2023
@navneet1v navneet1v moved this from Backlog to Backlog (Hot) in Vector Search RoadMap Oct 10, 2023
@navneet1v navneet1v changed the title Add performance and accuracy benchmarks for Neural search Features [META] Add performance and accuracy benchmarks for Neural search Features Oct 10, 2023
@jmazanec15
Copy link
Member

Im wondering if as part of this, we should add search relevance metrics/workload to OSB? For instance, for the text-based queries, one key question this will answer is when to use what and what are the tradeoffs? We could have a generic OSB run where the input/output stays constant (like datasets for BEIR) and we just change the internal implementation. When a new method (i.e. reranker, or different combination logic such as RRF) comes in, we can just plug them into the OSB configuration, run the test and determine where it stacks up.

@navneet1v
Copy link
Collaborator Author

@jmazanec15 the idea of this issue is to have a high level issue to add the benchmarks. Now what should be used to do the benchmarks like OSB or something else is not decided and I left it open. If we start using OSB then yes we need to get Search relevance metrics in OSB. But we should work with OSB team to provide a capability get these custom metrics.

@sam-herman
Copy link

@jmazanec15 the idea of this issue is to have a high level issue to add the benchmarks. Now what should be used to do the benchmarks like OSB or something else is not decided and I left it open. If we start using OSB then yes we need to get Search relevance metrics in OSB. But we should work with OSB team to provide a capability get these custom metrics.

+1 I think first priority is to come up with benchmarks that help with providing a baseline to quality of search.
Regarding OSB as an implementation platform I'm not so sure. It is implemented in Ruby and focuses on stress testing while we are more trying to define metrics of quality. For that even small data sets can do just fine and we can run those even as part of IT tests like the embedded JMH framework would seem more native solution to the task.

@jmazanec15
Copy link
Member

+1 I think first priority is to come up with benchmarks that help with providing a baseline to quality of search.

Yes, definitely agree with this.

It is implemented in Ruby and focuses on stress testing while we are more trying to define metrics of quality.

OSB is actually in python. Should be more friendly with existing data sets.

For that even small data sets can do just fine and we can run those even as part of IT tests like the embedded JMH framework would seem more native solution to the task.

Thats interesting - Im not super familiar with it, but could make sense - itd be nice to have as an integ test. I guess I like OSB becuase it would (1) be easier to integrate into automated performance testing infrastructure/metric publishing (2) let users test relevance for their own clusters easier (i.e. just point the OSB workload or a custom workload at their cluster and let it run). But maybe it makes sense to do both.

@navneet1v
Copy link
Collaborator Author

There has been a PR : https://github.com/opensearch-project/opensearch-benchmark-workloads/pull/232/files

added for doing text_embeddings benchmarks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog All the backlog features should be marked with this label
Projects
Status: Backlog
Development

No branches or pull requests

4 participants