Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Benchmarking K-NN Plugin and Vectorsearch Workload #103

Closed
jmazanec15 opened this issue Jan 12, 2022 · 6 comments
Closed

Support Benchmarking K-NN Plugin and Vectorsearch Workload #103

jmazanec15 opened this issue Jan 12, 2022 · 6 comments
Labels
enhancement New feature or request Medium Priority

Comments

@jmazanec15
Copy link
Member

Is your feature request related to a problem? Please describe.

The k-NN plugin adds support for a new field type, knn_vector, which can be thought of as an array of floating point numbers. The plugin then also adds support for running approximate k nearest neighbor search on these fields.

For benchmarking the plugin, we are interested in several metrics including:

  1. Query latency
  2. Indexing throughput
  3. Refresh time
  4. Recall (the ratio of neighbors returned by an Approximate search that are actually in the ground truth nearest neighbors)
  5. Training latency
  6. Native memory footprint
  7. Disk utilization

Currently, we have our own custom code to get these metrics: https://github.com/opensearch-project/k-NN/tree/main/benchmarks/perf-tool. The reason we decided to build our own tool was that we needed some functionality that was not easily available in Rally/OpenSearch Benchmark: ability to compute the recall from a set of queries, integration of our own custom APIs and metrics, using datasets in alternative forms, etc.

Describe the solution you'd like

We would prefer to use OpenSearch Benchmarks to collect the metrics above so that we don't have to maintain our own tool and that allows customers to not have to adopt another tool other than OpenSearch Benchmarks. I saw #98 was created and I imagine we may need that in order to reach our goal. I would be interested in helping contribute this feature.

Describe alternatives you've considered

The alternative is to continue to use our own benchmarking tool. However, this has several drawbacks mentioned above.

@jmazanec15
Copy link
Member Author

@travisbenedict @achitojha Im working on adding training component from k-NN to a custom runner (i.e. train-model)

I would like a user to be able to specify the body for the train API in a file and parametrize it from the workload-params:

(train-body.json)

{
    "training_index": " {{ training_index }}",
    "training_field": " {{ training_field }}",
    "dimension":  {{ dimension] }},
    "max_training_vector_count":  {{ max_training_vector_count | default(8) }},
    "search_size":  {{ search_size | default(8) }},
    "description": "My model",
    "method": {
        "name":"ivf",
        "engine":"faiss",
        "space_type": "l2",
        "parameters":{
            "nlists": {{ nlists | default(8) }},
            "encoder":{
                "name":"pq",
                "parameters":{
                    "code_size": {{ code_size | default(8) }}
                }
            }
        }
    }
}

I know that this works when defining indices, but is there a way for this work for arbitrary custom json file parameters?

Docs: https://opensearch.org/docs/latest/search-plugins/knn/api/#train-model

@travisbenedict
Copy link
Contributor

I'm not 100% familiar with your usecase but I think you should be able to define this operation in the operations file for your workload and parameterize that. You can see an example of this with the nyc_taxis workload

@jmazanec15
Copy link
Member Author

Thanks @travisbenedict that makes sense

@jmazanec15
Copy link
Member Author

I submitted a PR to add index load tests for k-NN into the repo: opensearch-project/k-NN#364. For now, I think it makes sense to keep them in that repo as they will continue to evolve. Please take a look and let me know what you think.

@jmazanec15
Copy link
Member Author

@travisbenedict I added another PR to add querying functionality to kNN runners and param source: opensearch-project/k-NN#409. I am having an issue with getting the recall metric to show up in the results. Would you be able to take a look?

@IanHoang IanHoang changed the title Support benchmarking k-NN plugin Support Benchmarking K-NN Plugin and Vector Workload Apr 10, 2024
@IanHoang IanHoang changed the title Support Benchmarking K-NN Plugin and Vector Workload Support Benchmarking K-NN Plugin and Vectorsearch Workload Apr 10, 2024
@github-project-automation github-project-automation bot moved this to Roadmap Project Backlog in OpenSearch Benchmark Roadmap Aug 30, 2024
@IanHoang
Copy link
Collaborator

Closing this as OSB now supports vectorsearch workload.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Medium Priority
Projects
Development

No branches or pull requests

4 participants