Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vectorsearch training workload #333

Merged
merged 3 commits into from
Jul 18, 2024

Conversation

finnroblin
Copy link
Contributor

@finnroblin finnroblin commented Jun 24, 2024

Description

Adds the train-test vectorsearch workload to benchmark kNN operations that require training like faiss ivf. Please see issue #332 for context.

This PR adds a schedule to train kNN algorithms using the train-knn-model operation proposal in OSB PR 556. It depends on the operation runners in that PR. It also requires an additional index in the vectorsearch workload.json to hold training data.

The train-test workload on my branch works on the faiss-sift-128 dataset without breaking backwards compatibility with other vectorsearch workloads. Please feel free to clone my forks (OSB, OSB Workload) to investigate workload behavior, as there are not unit tests in the OSB workloads framework.

Issues Resolved

Closes #332

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Sample Output:

> export PARAMS=opensearch-benchmark-workloads/vectorsearch/params/train/train-faiss-sift-128-l2-sq.json
> opensearch-benchmark execute-test --target-hosts $ENDPOINT \                                                               
    --workload-path /Users/finnrobl/Code/opensearch-benchmark-workloads/vectorsearch  --workload-params $PARAMS \
    --pipeline benchmark-only \
    --kill-running-processes \
  --test-procedure train-test 

   ____                  _____                      __       ____                  __                         __
  / __ \____  ___  ____ / ___/___  ____ ___________/ /_     / __ )___  ____  _____/ /_  ____ ___  ____ ______/ /__
 / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \   / __  / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/
/ /_/ / /_/ /  __/ / / /__/ /  __/ /_/ / /  / /__/ / / /  / /_/ /  __/ / / / /__/ / / / / / / / / /_/ / /  / ,<
\____/ .___/\___/_/ /_/____/\___/\__,_/_/   \___/_/ /_/  /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/  /_/|_|
    /_/

[INFO] [Test Execution ID]: c9954f19-26a8-48bb-9f18-b0b6605aab76
[INFO] Executing test with workload [vectorsearch], test_procedure [train-test] and provision_config_instance ['external'] with version [3.0.0-SNAPSHOT].

Running delete-train-index                                                     [100% done]
Running create-train-index                                                     [100% done]
Running custom-vector-bulk-train                                               [100% done]
Running refresh-train-index                                                    [100% done]
Running delete-target-index                                                    [100% done]
Running create-target-index                                                    [100% done]
Running custom-vector-bulk                                                     [100% done]
Running refresh-target-index                                                   [100% done]
Running delete-model                                                           [100% done]
Running train-knn-model                                                        [100% done]
Running warmup-indices                                                         [100% done]
Running prod-queries                                                           [100% done]

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------
            
|                                                         Metric |                     Task |      Value |   Unit |
|---------------------------------------------------------------:|-------------------------:|-----------:|-------:|
|                     Cumulative indexing time of primary shards |                          |    15.1107 |    min |
|             Min cumulative indexing time across primary shards |                          | 0.00368333 |    min |
|          Median cumulative indexing time across primary shards |                          |    7.55535 |    min |
|             Max cumulative indexing time across primary shards |                          |     15.107 |    min |
|            Cumulative indexing throttle time of primary shards |                          |          0 |    min |
|    Min cumulative indexing throttle time across primary shards |                          |          0 |    min |
| Median cumulative indexing throttle time across primary shards |                          |          0 |    min |
|    Max cumulative indexing throttle time across primary shards |                          |          0 |    min |
|                        Cumulative merge time of primary shards |                          |    3.43095 |    min |
|                       Cumulative merge count of primary shards |                          |         16 |        |
|                Min cumulative merge time across primary shards |                          |          0 |    min |
|             Median cumulative merge time across primary shards |                          |    1.71548 |    min |
|                Max cumulative merge time across primary shards |                          |    3.43095 |    min |
|               Cumulative merge throttle time of primary shards |                          |   0.505767 |    min |
|       Min cumulative merge throttle time across primary shards |                          |          0 |    min |
|    Median cumulative merge throttle time across primary shards |                          |   0.252883 |    min |
|       Max cumulative merge throttle time across primary shards |                          |   0.505767 |    min |
|                      Cumulative refresh time of primary shards |                          |     0.4279 |    min |
|                     Cumulative refresh count of primary shards |                          |         34 |        |
|              Min cumulative refresh time across primary shards |                          |    0.00125 |    min |
|           Median cumulative refresh time across primary shards |                          |    0.21395 |    min |
|              Max cumulative refresh time across primary shards |                          |    0.42665 |    min |
|                        Cumulative flush time of primary shards |                          |    0.03595 |    min |
|                       Cumulative flush count of primary shards |                          |          1 |        |
|                Min cumulative flush time across primary shards |                          |          0 |    min |
|             Median cumulative flush time across primary shards |                          |   0.017975 |    min |
|                Max cumulative flush time across primary shards |                          |    0.03595 |    min |
|                                        Total Young Gen GC time |                          |      4.022 |      s |
|                                       Total Young Gen GC count |                          |       2405 |        |
|                                          Total Old Gen GC time |                          |          0 |      s |
|                                         Total Old Gen GC count |                          |          0 |        |
|                                                     Store size |                          |    1.98823 |     GB |
|                                                  Translog size |                          |   0.174298 |     GB |
|                                         Heap used for segments |                          |          0 |     MB |
|                                       Heap used for doc values |                          |          0 |     MB |
|                                            Heap used for terms |                          |          0 |     MB |
|                                            Heap used for norms |                          |          0 |     MB |
|                                           Heap used for points |                          |          0 |     MB |
|                                    Heap used for stored fields |                          |          0 |     MB |
|                                                  Segment count |                          |         36 |        |
|                                                 Min Throughput | custom-vector-bulk-train |    18383.6 | docs/s |
|                                                Mean Throughput | custom-vector-bulk-train |    18383.6 | docs/s |
|                                              Median Throughput | custom-vector-bulk-train |    18383.6 | docs/s |
|                                                 Max Throughput | custom-vector-bulk-train |    18383.6 | docs/s |
|                                        50th percentile latency | custom-vector-bulk-train |    43.5641 |     ms |
|                                        90th percentile latency | custom-vector-bulk-train |    46.3634 |     ms |
|                                       100th percentile latency | custom-vector-bulk-train |      46.49 |     ms |
|                                   50th percentile service time | custom-vector-bulk-train |    43.5641 |     ms |
|                                   90th percentile service time | custom-vector-bulk-train |    46.3634 |     ms |
|                                  100th percentile service time | custom-vector-bulk-train |      46.49 |     ms |
|                                                     error rate | custom-vector-bulk-train |          0 |      % |
|                                                 Min Throughput |       custom-vector-bulk |    8894.83 | docs/s |
|                                                Mean Throughput |       custom-vector-bulk |    11858.3 | docs/s |
|                                              Median Throughput |       custom-vector-bulk |    10465.9 | docs/s |
|                                                 Max Throughput |       custom-vector-bulk |    30396.6 | docs/s |
|                                        50th percentile latency |       custom-vector-bulk |    101.675 |     ms |
|                                        90th percentile latency |       custom-vector-bulk |    137.139 |     ms |
|                                        99th percentile latency |       custom-vector-bulk |    277.051 |     ms |
|                                      99.9th percentile latency |       custom-vector-bulk |    2109.04 |     ms |
|                                     99.99th percentile latency |       custom-vector-bulk |    2827.03 |     ms |
|                                       100th percentile latency |       custom-vector-bulk |    2890.82 |     ms |
|                                   50th percentile service time |       custom-vector-bulk |    101.609 |     ms |
|                                   90th percentile service time |       custom-vector-bulk |    137.125 |     ms |
|                                   99th percentile service time |       custom-vector-bulk |    277.253 |     ms |
|                                 99.9th percentile service time |       custom-vector-bulk |    2109.04 |     ms |
|                                99.99th percentile service time |       custom-vector-bulk |    2827.03 |     ms |
|                                  100th percentile service time |       custom-vector-bulk |    2890.82 |     ms |
|                                                     error rate |       custom-vector-bulk |          0 |      % |
|                                                 Min Throughput |             delete-model |       84.7 |  ops/s |
|                                                Mean Throughput |             delete-model |       84.7 |  ops/s |
|                                              Median Throughput |             delete-model |       84.7 |  ops/s |
|                                                 Max Throughput |             delete-model |       84.7 |  ops/s |
|                                       100th percentile latency |             delete-model |    11.6162 |     ms |
|                                  100th percentile service time |             delete-model |    11.6162 |     ms |
|                                                     error rate |             delete-model |          0 |      % |
|                                                 Min Throughput |          train-knn-model |        1.1 |  ops/s |
|                                                Mean Throughput |          train-knn-model |        1.1 |  ops/s |
|                                              Median Throughput |          train-knn-model |        1.1 |  ops/s |
|                                                 Max Throughput |          train-knn-model |        1.1 |  ops/s |
|                                       100th percentile latency |          train-knn-model |    909.219 |     ms |
|                                  100th percentile service time |          train-knn-model |    909.219 |     ms |
|                                                     error rate |          train-knn-model |          0 |      % |
|                                                 Min Throughput |           warmup-indices |       3.39 |  ops/s |
|                                                Mean Throughput |           warmup-indices |       3.39 |  ops/s |
|                                              Median Throughput |           warmup-indices |       3.39 |  ops/s |
|                                                 Max Throughput |           warmup-indices |       3.39 |  ops/s |
|                                       100th percentile latency |           warmup-indices |    294.256 |     ms |
|                                  100th percentile service time |           warmup-indices |    294.256 |     ms |
|                                                     error rate |           warmup-indices |          0 |      % |
|                                                 Min Throughput |             prod-queries |      56.65 |  ops/s |
|                                                Mean Throughput |             prod-queries |      56.65 |  ops/s |
|                                              Median Throughput |             prod-queries |      56.65 |  ops/s |
|                                                 Max Throughput |             prod-queries |      56.65 |  ops/s |
|                                        50th percentile latency |             prod-queries |    8.57323 |     ms |
|                                        90th percentile latency |             prod-queries |    11.1135 |     ms |
|                                        99th percentile latency |             prod-queries |     116.16 |     ms |
|                                       100th percentile latency |             prod-queries |    215.067 |     ms |
|                                   50th percentile service time |             prod-queries |    8.57323 |     ms |
|                                   90th percentile service time |             prod-queries |    11.1135 |     ms |
|                                   99th percentile service time |             prod-queries |     116.16 |     ms |
|                                  100th percentile service time |             prod-queries |    215.067 |     ms |
|                                                     error rate |             prod-queries |          0 |      % |



Comment on lines 12 to 13
"target_index_num_vectors": 1000,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove "target_index_num_vectors" from param file?

}
],
"corpora": [
{
"name": "cohere",
"base-url": "https://dbyiw3u3rf9yr.cloudfront.net/corpora/vectorsearch/cohere-wikipedia-22-12-en-embeddings",
"target-index": "{{ target_index_name }}",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling out here that this target-index param is not used anywhere in the workload, but it's necessary due to OSB validation. I'm not sure what the solution is, but I opened an issue about this.

@finnroblin finnroblin requested a review from VijayanB June 26, 2024 21:30
Copy link
Collaborator

@IanHoang IanHoang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@IanHoang IanHoang added backport 2 Backport to the "2" branch backport 1 backport 3 Backport to the "3" branch labels Jul 2, 2024
Copy link
Collaborator

@IanHoang IanHoang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@finnroblin Overall, LGTM. As per best practices specified in the README, please provide a sample summary output of train-test in the PR description.

Copy link
Collaborator

@IanHoang IanHoang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@IanHoang IanHoang merged commit 29d9715 into opensearch-project:main Jul 18, 2024
2 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jul 18, 2024
* Add vectorsearch training workload

Signed-off-by: Finn Roblin <[email protected]>

* Addressed Vijay feedback and ignores error if model DNE

Signed-off-by: Finn Roblin <[email protected]>

* Added documentation to VS readme

Signed-off-by: Finn Roblin <[email protected]>

---------

Signed-off-by: Finn Roblin <[email protected]>
(cherry picked from commit 29d9715)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jul 18, 2024
* Add vectorsearch training workload

Signed-off-by: Finn Roblin <[email protected]>

* Addressed Vijay feedback and ignores error if model DNE

Signed-off-by: Finn Roblin <[email protected]>

* Added documentation to VS readme

Signed-off-by: Finn Roblin <[email protected]>

---------

Signed-off-by: Finn Roblin <[email protected]>
(cherry picked from commit 29d9715)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
IanHoang pushed a commit that referenced this pull request Jul 18, 2024
* Add vectorsearch training workload



* Addressed Vijay feedback and ignores error if model DNE



* Added documentation to VS readme



---------


(cherry picked from commit 29d9715)

Signed-off-by: Finn Roblin <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
IanHoang pushed a commit that referenced this pull request Jul 18, 2024
* Add vectorsearch training workload



* Addressed Vijay feedback and ignores error if model DNE



* Added documentation to VS readme



---------


(cherry picked from commit 29d9715)

Signed-off-by: Finn Roblin <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2 Backport to the "2" branch backport 3 Backport to the "3" branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Add Train Model KNN Workload
3 participants