Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support hdf5 files in bulk operation #620

Closed

Conversation

finnroblin
Copy link
Contributor

@finnroblin finnroblin commented Aug 17, 2024

Description

Adds hdf5 file support for bulk ingestion. hdf5 files contain datasets of vectors in a non-json format so @VijayanB wrote separate parameter operations to send vectors to the bulk API. This PR adds vector support within OSB's bulk operation. This is advantageous for vector search benchmarking since the bulk operation supports additional features, and it decreases the number of vector search-specific features.

Testing

  • New functionality includes testing

Unit tests and manual verification. I modified the cohere 1000 document to include the information needed for the bulk operation.

Steps taken for manual verification:
Parameter file:

{
    "target_index_name": "target_index",
    "target_field_name": "target_field",
    "target_index_body": "indices/faiss-index.json",
    "target_index_primary_shards": 1,
    "target_index_dimension": 768,
    "target_index_space_type": "l2",
    
    "target_index_bulk_size": 5,
    "target_index_bulk_index_data_set_format": "hdf5",
    "target_index_bulk_indexing_clients": 10,
    "target_index_bulk_index_data_set_corpus": "cohere",
    
    "target_index_max_num_segments": 1,
    "target_index_force_merge_timeout": 300,
    "hnsw_ef_search": 100,
    "hnsw_ef_construction": 100,

    "query_k": 100,
    "query_body": {
         "docvalue_fields" : ["_id"],
         "stored_fields" : "_none_"
    },

    "query_data_set_format": "hdf5",
    "query_data_set_corpus": "cohere",
    "query_count": 100
}

Bulk schedule:

{
    "operation": {
        "name": "delete-target-index",
        "operation-type": "delete-index",
        "only-if-exists": true,
        "index": "{{ target_index_name | default('target_index') }}"
    }
},
{
    "operation": {
        "name": "create-target-index",
        "operation-type": "create-index",
        "index": "{{ target_index_name | default('target_index') }}"
    }
},
{
    "operation": {
        "name": "bulk",
        "operation-type": "bulk",
        "bulk-size": 5,
        "data_set_format": "{{ target_index_bulk_index_data_set_format | default('hdf5') }}",
        "source_format": "hdf5",
        "index": "target_index",
        "field": "target_field",
        "vector_dataset_context": "index",
        "corpora": ["cohere"]
    },
    "clients": {{ target_index_bulk_indexing_clients | default(1)}}
},
{
    "name" : "refresh-target-index",
    "operation" : "refresh-target-index"
}

Corpus changes:

"corpora": [
    {
      "name": "cohere",
      "base-url": "https://dbyiw3u3rf9yr.cloudfront.net/corpora/vectorsearch/cohere-wikipedia-22-12-en-embeddings",
      "target-index": "{{ target_index_name }}",
      "documents": [
        {
          "source-file": "documents-1k.hdf5.bz2",
          "source-format": "hdf5",
          "document-count": 1000,
          "generate-increasing-vector-ids": true,
          "id-field-name": "_id",
          "vector-field-name": "target_field"
        }
      ]
    },

bulk-procedure:

    "name": "bulk-procedure",
    "default": false,
    "schedule": [
       {{ benchmark.collect(parts="common/bulk-schedule.json") }},

       {{ benchmark.collect(parts="common/search-only-schedule.json") }}
    ]
},

Result:

.venv) finnrobl@80a9970f4597 opensearch-benchmark % export PARAMS=/Users/finnrobl/Code/opensearch-benchmark-workloads/vectorsearch/params/bulk-params.json 
(.venv) finnrobl@80a9970f4597 opensearch-benchmark % opensearch-benchmark execute-test --target-hosts $ENDPOINT \                                                
    --workload-path /Users/finnrobl/Code/opensearch-benchmark-workloads/vectorsearch  --workload-params $PARAMS \
    --pipeline benchmark-only \
    --kill-running-processes \
  --test-procedure bulk-procedure

   ____                  _____                      __       ____                  __                         __
  / __ \____  ___  ____ / ___/___  ____ ___________/ /_     / __ )___  ____  _____/ /_  ____ ___  ____ ______/ /__
 / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \   / __  / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/
/ /_/ / /_/ /  __/ / / /__/ /  __/ /_/ / /  / /__/ / / /  / /_/ /  __/ / / / /__/ / / / / / / / / /_/ / /  / ,<
\____/ .___/\___/_/ /_/____/\___/\__,_/_/   \___/_/ /_/  /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/  /_/|_|
    /_/

[INFO] [Test Execution ID]: e8307702-7dda-4a30-8b87-6f2fc1834ecb
[INFO] Executing test with workload [vectorsearch], test_procedure [bulk-procedure] and provision_config_instance ['external'] with version [3.0.0-SNAPSHOT].

[WARNING] merges_total_time is 16 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] indexing_total_time is 7 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] refresh_total_time is 63 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] flush_total_time is 120 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
Running delete-target-index                                                    [100% done]
Running create-target-index                                                    [100% done]
Running bulk                                                                   [100% done]
Running refresh-target-index                                                   [100% done]
Running warmup-indices                                                         [100% done]
Running prod-queries                                                           [100% done]

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------
            
|                                                         Metric |           Task |       Value |   Unit |
|---------------------------------------------------------------:|---------------:|------------:|-------:|
|                     Cumulative indexing time of primary shards |                |   0.0371833 |    min |
|             Min cumulative indexing time across primary shards |                |           0 |    min |
|          Median cumulative indexing time across primary shards |                | 0.000116667 |    min |
|             Max cumulative indexing time across primary shards |                |   0.0370667 |    min |
|            Cumulative indexing throttle time of primary shards |                |           0 |    min |
|    Min cumulative indexing throttle time across primary shards |                |           0 |    min |
| Median cumulative indexing throttle time across primary shards |                |           0 |    min |
|    Max cumulative indexing throttle time across primary shards |                |           0 |    min |
|                        Cumulative merge time of primary shards |                | 0.000266667 |    min |
|                       Cumulative merge count of primary shards |                |           1 |        |
|                Min cumulative merge time across primary shards |                |           0 |    min |
|             Median cumulative merge time across primary shards |                |           0 |    min |
|                Max cumulative merge time across primary shards |                | 0.000266667 |    min |
|               Cumulative merge throttle time of primary shards |                |           0 |    min |
|       Min cumulative merge throttle time across primary shards |                |           0 |    min |
|    Median cumulative merge throttle time across primary shards |                |           0 |    min |
|       Max cumulative merge throttle time across primary shards |                |           0 |    min |
|                      Cumulative refresh time of primary shards |                |  0.00468333 |    min |
|                     Cumulative refresh count of primary shards |                |          12 |        |
|              Min cumulative refresh time across primary shards |                |           0 |    min |
|           Median cumulative refresh time across primary shards |                |     0.00105 |    min |
|              Max cumulative refresh time across primary shards |                |  0.00363333 |    min |
|                        Cumulative flush time of primary shards |                |       0.002 |    min |
|                       Cumulative flush count of primary shards |                |           2 |        |
|                Min cumulative flush time across primary shards |                |           0 |    min |
|             Median cumulative flush time across primary shards |                |           0 |    min |
|                Max cumulative flush time across primary shards |                |       0.002 |    min |
|                                        Total Young Gen GC time |                |        0.01 |      s |
|                                       Total Young Gen GC count |                |           1 |        |
|                                          Total Old Gen GC time |                |           0 |      s |
|                                         Total Old Gen GC count |                |           0 |        |
|                                                     Store size |                |   0.0173898 |     GB |
|                                                  Translog size |                |   0.0150675 |     GB |
|                                         Heap used for segments |                |           0 |     MB |
|                                       Heap used for doc values |                |           0 |     MB |
|                                            Heap used for terms |                |           0 |     MB |
|                                            Heap used for norms |                |           0 |     MB |
|                                           Heap used for points |                |           0 |     MB |
|                                    Heap used for stored fields |                |           0 |     MB |
|                                                  Segment count |                |          10 |        |
|                                                 Min Throughput |           bulk |     1640.19 | docs/s |
|                                                Mean Throughput |           bulk |     1640.19 | docs/s |
|                                              Median Throughput |           bulk |     1640.19 | docs/s |
|                                                 Max Throughput |           bulk |     1640.19 | docs/s |
|                                        50th percentile latency |           bulk |     17.3579 |     ms |
|                                        90th percentile latency |           bulk |     45.1002 |     ms |
|                                        99th percentile latency |           bulk |     83.7313 |     ms |
|                                       100th percentile latency |           bulk |      88.521 |     ms |
|                                   50th percentile service time |           bulk |     17.3579 |     ms |
|                                   90th percentile service time |           bulk |     45.1002 |     ms |
|                                   99th percentile service time |           bulk |     83.7313 |     ms |
|                                  100th percentile service time |           bulk |      88.521 |     ms |
|                                                     error rate |           bulk |           0 |      % |
|                                                 Min Throughput | warmup-indices |       36.24 |  ops/s |
|                                                Mean Throughput | warmup-indices |       36.24 |  ops/s |
|                                              Median Throughput | warmup-indices |       36.24 |  ops/s |
|                                                 Max Throughput | warmup-indices |       36.24 |  ops/s |
|                                       100th percentile latency | warmup-indices |     27.4253 |     ms |
|                                  100th percentile service time | warmup-indices |     27.4253 |     ms |
|                                                     error rate | warmup-indices |           0 |      % |
|                                                 Min Throughput |   prod-queries |       149.9 |  ops/s |
|                                                Mean Throughput |   prod-queries |       149.9 |  ops/s |
|                                              Median Throughput |   prod-queries |       149.9 |  ops/s |
|                                                 Max Throughput |   prod-queries |       149.9 |  ops/s |
|                                        50th percentile latency |   prod-queries |     3.36225 |     ms |
|                                        90th percentile latency |   prod-queries |      4.6824 |     ms |
|                                        99th percentile latency |   prod-queries |     58.3903 |     ms |
|                                       100th percentile latency |   prod-queries |     109.023 |     ms |
|                                   50th percentile service time |   prod-queries |     3.36225 |     ms |
|                                   90th percentile service time |   prod-queries |      4.6824 |     ms |
|                                   99th percentile service time |   prod-queries |     58.3903 |     ms |
|                                  100th percentile service time |   prod-queries |     109.023 |     ms |
|                                                     error rate |   prod-queries |           0 |      % |
|                                                  Mean recall@k |   prod-queries |        0.37 |        |
|                                                  Mean recall@1 |   prod-queries |        0.07 |        |


--------------------------------
[INFO] SUCCESS (took 63 seconds)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@finnroblin finnroblin marked this pull request as ready for review August 19, 2024 20:46
@finnroblin finnroblin changed the title [Draft] Initial vector bulk hdf5 implementation Support hdf5 files in bulk operation Aug 27, 2024
Copy link
Collaborator

@IanHoang IanHoang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@finnroblin Can you address merge conflicts?

@IanHoang
Copy link
Collaborator

Closing this as of now activity

@IanHoang IanHoang closed this Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants