From bf4ae72c7b714a0e67abc6fdcb64fc7f28945cd1 Mon Sep 17 00:00:00 2001 From: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Date: Tue, 30 Jan 2024 11:50:02 -0600 Subject: [PATCH] Add understanding workloads section (#6164) * Add understanding workloads section. Signed-off-by: Naarcha-AWS * Add additional anatomy sections Signed-off-by: Naarcha-AWS * Add section headers Signed-off-by: Naarcha-AWS * Fix link Signed-off-by: Naarcha-AWS * Fix typos Signed-off-by: Naarcha-AWS * Change example to fix build error. Signed-off-by: Naarcha-AWS * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Melissa Vagi Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Melissa Vagi Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower Co-authored-by: Melissa Vagi Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update anatomy-of-a-workload.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Fix build errors Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update anatomy-of-a-workload.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update anatomy-of-a-workload.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update concepts.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update index.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --------- Signed-off-by: Naarcha-AWS Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Melissa Vagi Co-authored-by: Nathan Bower --- _benchmark/reference/workloads/index.md | 2 +- _benchmark/user-guide/concepts.md | 150 +--- .../anatomy-of-a-workload.md | 742 ++++++++++++++++++ .../choosing-a-workload.md | 29 + .../understanding-workloads/index.md | 14 + 5 files changed, 789 insertions(+), 148 deletions(-) create mode 100644 _benchmark/user-guide/understanding-workloads/anatomy-of-a-workload.md create mode 100644 _benchmark/user-guide/understanding-workloads/choosing-a-workload.md create mode 100644 _benchmark/user-guide/understanding-workloads/index.md diff --git a/_benchmark/reference/workloads/index.md b/_benchmark/reference/workloads/index.md index 655ada92d9..1dd609cacb 100644 --- a/_benchmark/reference/workloads/index.md +++ b/_benchmark/reference/workloads/index.md @@ -16,7 +16,7 @@ A workload is a specification of one or more benchmarking scenarios. A workload This section provides a list of options and examples you can use when customizing or using a workload. -For more information about what comprises a workload, see [Anatomy of a workload]({{site.url}}{{site.baseurl}}/benchmark/user-guide/concepts#anatomy-of-a-workload). +For more information about what comprises a workload, see [Anatomy of a workload](({{site.url}}{{site.baseurl}}/benchmark/understanding-workloads/anatomy-of-a-workload/). ## Workload examples diff --git a/_benchmark/user-guide/concepts.md b/_benchmark/user-guide/concepts.md index ade9fe53b6..5fd6d2e7dd 100644 --- a/_benchmark/user-guide/concepts.md +++ b/_benchmark/user-guide/concepts.md @@ -11,7 +11,7 @@ Before using OpenSearch Benchmark, familiarize yourself with the following conce ## Core concepts and definitions -- **Workload**: The description of one or more benchmarking scenarios that use a specific document corpus to perform a benchmark against your cluster. The document corpus contains any indexes, data files, and operations invoked when the workflow runs. You can list the available workloads by using `opensearch-benchmark list workloads` or view any included workloads in the [OpenSearch Benchmark Workloads repository](https://github.com/opensearch-project/opensearch-benchmark-workloads/). For more information about the elements of a workload, see [Anatomy of a workload](#anatomy-of-a-workload). For information about building a custom workload, see [Creating custom workloads]({{site.url}}{{site.baseurl}}/benchmark/creating-custom-workloads/). +- **Workload**: The description of one or more benchmarking scenarios that use a specific document corpus to perform a benchmark against your cluster. The document corpus contains any indexes, data files, and operations invoked when the workflow runs. You can list the available workloads by using `opensearch-benchmark list workloads` or view any included workloads in the [OpenSearch Benchmark Workloads repository](https://github.com/opensearch-project/opensearch-benchmark-workloads/). For more information about the elements of a workload, see [Anatomy of a workload](({{site.url}}{{site.baseurl}}/benchmark/understanding-workloads/anatomy-of-a-workload/). For information about building a custom workload, see [Creating custom workloads]({{site.url}}{{site.baseurl}}/benchmark/creating-custom-workloads/). - **Pipeline**: A series of steps occurring before and after a workload is run that determines benchmark results. OpenSearch Benchmark supports three pipelines: - `from-sources`: Builds and provisions OpenSearch, runs a benchmark, and then publishes the results. @@ -110,149 +110,5 @@ This latency cascade continues, increasing latency by 100ms for each subsequent ### Recommendation -As shown by the preceding examples, you should be aware of the average service time of each task and provide a `target-throughput` that accounts for the service time. The OpenSearch Benchmark latency is calculated based on the `target-throughput` set by the user. In other words, the OpenSearch Benchmark latency could be redefined as "throughput-based latency". - -## Anatomy of a workload - -The following example workload shows all of the essential elements needed to create a `workload.json` file. You can run this workload in your own benchmark configuration to understand how all of the elements work together: - -```json -{ - "description": "Tutorial benchmark for OpenSearch Benchmark", - "indices": [ - { - "name": "movies", - "body": "index.json" - } - ], - "corpora": [ - { - "name": "movies", - "documents": [ - { - "source-file": "movies-documents.json", - "document-count": 11658903, # Fetch document count from command line - "uncompressed-bytes": 1544799789 # Fetch uncompressed bytes from command line - } - ] - } - ], - "schedule": [ - { - "operation": { - "operation-type": "create-index" - } - }, - { - "operation": { - "operation-type": "cluster-health", - "request-params": { - "wait_for_status": "green" - }, - "retry-until-success": true - } - }, - { - "operation": { - "operation-type": "bulk", - "bulk-size": 5000 - }, - "warmup-time-period": 120, - "clients": 8 - }, - { - "operation": { - "name": "query-match-all", - "operation-type": "search", - "body": { - "query": { - "match_all": {} - } - } - }, - "iterations": 1000, - "target-throughput": 100 - } - ] -} -``` - -A workload usually includes the following elements: - -- [indices]({{site.url}}{{site.baseurl}}/benchmark/workloads/indices/): Defines the relevant indexes and index templates used for the workload. -- [corpora]({{site.url}}{{site.baseurl}}/benchmark/workloads/corpora/): Defines all document corpora used for the workload. -- `schedule`: Defines operations and the order in which the operations run inline. Alternatively, you can use `operations` to group operations and the `test_procedures` parameter to specify the order of operations. -- `operations`: **Optional**. Describes which operations are available for the workload and how they are parameterized. - -### Indices - -To create an index, specify its `name`. To add definitions to your index, use the `body` option and point it to the JSON file containing the index definitions. For more information, see [indices]({{site.url}}{{site.baseurl}}/benchmark/workloads/indices/). - -### Corpora - -The `corpora` element requires the name of the index containing the document corpus, for example, `movies`, and a list of parameters that define the document corpora. This list includes the following parameters: - -- `source-file`: The file name that contains the workload's corresponding documents. When using OpenSearch Benchmark locally, documents are contained in a JSON file. When providing a `base_url`, use a compressed file format: `.zip`, `.bz2`, `.gz`, `.tar`, `.tar.gz`, `.tgz`, or `.tar.bz2`. The compressed file must have one JSON file containing the name. -- `document-count`: The number of documents in the `source-file`, which determines which client indexes correlate to which parts of the document corpus. Each N client receives an Nth of the document corpus. When using a source that contains a document with a parent-child relationship, specify the number of parent documents. -- `uncompressed-bytes`: The size, in bytes, of the source file after decompression, indicating how much disk space the decompressed source file needs. -- `compressed-bytes`: The size, in bytes, of the source file before decompression. This can help you assess the amount of time needed for the cluster to ingest documents. - -### Operations - -The `operations` element lists the OpenSearch API operations performed by the workload. For example, you can set an operation to `create-index`, an index in the test cluster to which OpenSearch Benchmark can write documents. Operations are usually listed inside of `schedule`. - -### Schedule - -The `schedule` element contains a list of actions and operations that are run by the workload. Operations run according to the order in which they appear in the `schedule`. The following example illustrates a `schedule` with multiple operations, each defined by its `operation-type`: - -```json - "schedule": [ - { - "operation": { - "operation-type": "create-index" - } - }, - { - "operation": { - "operation-type": "cluster-health", - "request-params": { - "wait_for_status": "green" - }, - "retry-until-success": true - } - }, - { - "operation": { - "operation-type": "bulk", - "bulk-size": 5000 - }, - "warmup-time-period": 120, - "clients": 8 - }, - { - "operation": { - "name": "query-match-all", - "operation-type": "search", - "body": { - "query": { - "match_all": {} - } - } - }, - "iterations": 1000, - "target-throughput": 100 - } - ] -} -``` - -According to this schedule, the actions will run in the following order: - -1. The `create-index` operation creates an index. The index remains empty until the `bulk` operation adds documents with benchmarked data. -2. The `cluster-health` operation assesses the health of the cluster before running the workload. In this example, the workload waits until the status of the cluster's health is `green`. - - The `bulk` operation runs the `bulk` API to index `5000` documents simultaneously. - - Before benchmarking, the workload waits until the specified `warmup-time-period` passes. In this example, the warmup period is `120` seconds. -5. The `clients` field defines the number of clients that will run the remaining actions in the schedule concurrently. -6. The `search` runs a `match_all` query to match all documents after they have been indexed by the `bulk` API using the 8 clients specified. - - The `iterations` field indicates the number of times each client runs the `search` operation. The report generated by the benchmark automatically adjusts the percentile numbers based on this number. To generate a precise percentile, the benchmark needs to run at least 1,000 iterations. - - Lastly, the `target-throughput` field defines the number of requests per second each client performs, which, when set, can help reduce the latency of the benchmark. For example, a `target-throughput` of 100 requests divided by 8 clients means that each client will issue 12 requests per second. +As shown by the preceding examples, you should be aware of the average service time of each task and provide a `target-throughput` that accounts for the service time. The OpenSearch Benchmark latency is calculated based on the `target-throughput` set by the user, that is, the latency could be redefined as "throughput-based latency." + diff --git a/_benchmark/user-guide/understanding-workloads/anatomy-of-a-workload.md b/_benchmark/user-guide/understanding-workloads/anatomy-of-a-workload.md new file mode 100644 index 0000000000..2d2328b40d --- /dev/null +++ b/_benchmark/user-guide/understanding-workloads/anatomy-of-a-workload.md @@ -0,0 +1,742 @@ +--- +layout: default +title: Anatomy of a workload +nav_order: 15 +grand_parent: User guide +parent: Understanding workloads +--- + +# Anatomy of a workload + +All workloads contain the following files and directories: + +- [workload.json](#workloadjson): Contains all of the workload settings. +- [index.json](#indexjson): Contains the document mappings and parameters as well as index settings. +- [files.txt](#filestxt): Contains the data corpora file names. +- [_test-procedures](#_operations-and-_test-procedures): Most workloads contain only one default test procedure, which is configured in `default.json`. +- [_operations](#_operations-and-_test-procedures): Contains all of the operations used in test procedures. +- workload.py: Adds more dynamic functionality to the test. + +## workload.json + +The following example workload shows all of the essential elements needed to create a `workload.json` file. You can run this workload in your own benchmark configuration to understand how all of the elements work together: + +```json +{ + "description": "Tutorial benchmark for OpenSearch Benchmark", + "indices": [ + { + "name": "movies", + "body": "index.json" + } + ], + "corpora": [ + { + "name": "movies", + "documents": [ + { + "source-file": "movies-documents.json", + "document-count": 11658903, # Fetch document count from command line + "uncompressed-bytes": 1544799789 # Fetch uncompressed bytes from command line + } + ] + } + ], + "schedule": [ + { + "operation": { + "operation-type": "create-index" + } + }, + { + "operation": { + "operation-type": "cluster-health", + "request-params": { + "wait_for_status": "green" + }, + "retry-until-success": true + } + }, + { + "operation": { + "operation-type": "bulk", + "bulk-size": 5000 + }, + "warmup-time-period": 120, + "clients": 8 + }, + { + "operation": { + "name": "query-match-all", + "operation-type": "search", + "body": { + "query": { + "match_all": {} + } + } + }, + "iterations": 1000, + "target-throughput": 100 + } + ] +} +``` + +A workload usually includes the following elements: + +- [indices]({{site.url}}{{site.baseurl}}/benchmark/workloads/indices/): Defines the relevant indexes and index templates used for the workload. +- [corpora]({{site.url}}{{site.baseurl}}/benchmark/workloads/corpora/): Defines all document corpora used for the workload. +- `schedule`: Defines operations and the order in which the operations run inline. Alternatively, you can use `operations` to group operations and the `test_procedures` parameter to specify the order of operations. +- `operations`: **Optional**. Describes which operations are available for the workload and how they are parameterized. + +### Indices + +To create an index, specify its `name`. To add definitions to your index, use the `body` option and point it to the JSON file containing the index definitions. For more information, see [Indices]({{site.url}}{{site.baseurl}}/benchmark/workloads/indices/). + +### Corpora + +The `corpora` element requires the name of the index containing the document corpus, for example, `movies`, and a list of parameters that define the document corpora. This list includes the following parameters: + +- `source-file`: The file name that contains the workload's corresponding documents. When using OpenSearch Benchmark locally, documents are contained in a JSON file. When providing a `base_url`, use a compressed file format: `.zip`, `.bz2`, `.zst`, `.gz`, `.tar`, `.tar.gz`, `.tgz`, or `.tar.bz2`. The compressed file must include one JSON file containing the name. +- `document-count`: The number of documents in the `source-file`, which determines which client indexes correlate to which parts of the document corpus. Each N client is assigned an Nth of the document corpus to ingest into the test cluster. When using a source that contains a document with a parent-child relationship, specify the number of parent documents. +- `uncompressed-bytes`: The size, in bytes, of the source file after decompression, indicating how much disk space the decompressed source file needs. +- `compressed-bytes`: The size, in bytes, of the source file before decompression. This can help you assess the amount of time needed for the cluster to ingest documents. + +### Operations + +The `operations` element lists the OpenSearch API operations performed by the workload. For example, you can list an operation named `create-index` that creates an index in the benchmark cluster to which OpenSearch Benchmark can write documents. Operations are usually listed inside of the `schedule` element. + +### Schedule + +The `schedule` element contains a list of operations that are run in a specified order, as shown in the following JSON example: + +```json + "schedule": [ + { + "operation": { + "operation-type": "create-index" + } + }, + { + "operation": { + "operation-type": "cluster-health", + "request-params": { + "wait_for_status": "green" + }, + "retry-until-success": true + } + }, + { + "operation": { + "operation-type": "bulk", + "bulk-size": 5000 + }, + "warmup-time-period": 120, + "clients": 8 + }, + { + "operation": { + "name": "query-match-all", + "operation-type": "search", + "body": { + "query": { + "match_all": {} + } + } + }, + "iterations": 1000, + "target-throughput": 100 + } + ] +} +``` + +According to this `schedule`, the actions will run in the following order: + +1. The `create-index` operation creates an index. The index remains empty until the `bulk` operation adds documents with benchmarked data. +2. The `cluster-health` operation assesses the cluster's health before running the workload. In the JSON example, the workload waits until the cluster's health status is `green`. + - The `bulk` operation runs the `bulk` API to index `5000` documents simultaneously. + - Before benchmarking, the workload waits until the specified `warmup-time-period` passes. In the JSON example, the warmup period is `120` seconds. +3. The `clients` field defines the number of clients, in this example, eight, that will run the bulk indexing operation concurrently. +4. The `search` operation runs a `match_all` query to match all documents after they have been indexed by the `bulk` API using the specified clients. + - The `iterations` field defines the number of times each client runs the `search` operation. The benchmark report automatically adjusts the percentile numbers based on this number. To generate a precise percentile, the benchmark needs to run at least 1,000 iterations. + - The `target-throughput` field defines the number of requests per second that each client performs. When set, the setting can help reduce benchmark latency. For example, a `target-throughput` of 100 requests divided by 8 clients means that each client will issue 12 requests per second. For more information about how target throughput is defined in OpenSearch Benchmark, see [Throughput and latency](https://opensearch.org/docs/latest/benchmark/user-guide/concepts/#throughput-and-latency). + +## index.json + +The `index.json` file defines the data mappings, indexing parameters, and index settings for workload documents during `create-index` operations. + +When OpenSearch Benchmark creates an index for the workload, it uses the index settings and mappings template in the `index.json` file. Mappings in the `index.json` file are based on the mappings of a single document from the workload's corpus, which is stored in the `files.txt` file. The following is an example of the `index.json` file for the `nyc_taxis` workload. You can customize the fields, such as `number_of_shards`, `number_of_replicas`, `query_cache_enabled`, and `requests_cache_enabled`. + +```json +{ + "settings": { + "index.number_of_shards": {{number_of_shards | default(1)}}, + "index.number_of_replicas": {{number_of_replicas | default(0)}}, + "index.queries.cache.enabled": {{query_cache_enabled | default(false) | tojson}}, + "index.requests.cache.enable": {{requests_cache_enabled | default(false) | tojson}} + }, + "mappings": { + "_source": { + "enabled": {{ source_enabled | default(true) | tojson }} + }, + "properties": { + "surcharge": { + "scaling_factor": 100, + "type": "scaled_float" + }, + "dropoff_datetime": { + "type": "date", + "format": "yyyy-MM-dd HH:mm:ss" + }, + "trip_type": { + "type": "keyword" + }, + "mta_tax": { + "scaling_factor": 100, + "type": "scaled_float" + }, + "rate_code_id": { + "type": "keyword" + }, + "passenger_count": { + "type": "integer" + }, + "pickup_datetime": { + "type": "date", + "format": "yyyy-MM-dd HH:mm:ss" + }, + "tolls_amount": { + "scaling_factor": 100, + "type": "scaled_float" + }, + "tip_amount": { + "type": "half_float" + }, + "payment_type": { + "type": "keyword" + }, + "extra": { + "scaling_factor": 100, + "type": "scaled_float" + }, + "vendor_id": { + "type": "keyword" + }, + "store_and_fwd_flag": { + "type": "keyword" + }, + "improvement_surcharge": { + "scaling_factor": 100, + "type": "scaled_float" + }, + "fare_amount": { + "scaling_factor": 100, + "type": "scaled_float" + }, + "ehail_fee": { + "scaling_factor": 100, + "type": "scaled_float" + }, + "cab_color": { + "type": "keyword" + }, + "dropoff_location": { + "type": "geo_point" + }, + "vendor_name": { + "type": "text" + }, + "total_amount": { + "scaling_factor": 100, + "type": "scaled_float" + }, + "trip_distance": { + "scaling_factor": 100, + "type": "scaled_float" + }, + "pickup_location": { + "type": "geo_point" + } + }, + "dynamic": "strict" + } +} +``` + +## files.txt + +The `files.txt` file lists the files that store the workload data, which are typically stored in a zipped JSON file. + +## _operations and _test-procedures + +To make the workload more human-readable, `_operations` and `_test-procedures` are separated into two directories. + +The `_operations` directory contains a `default.json` file that lists all of the supported operations that the test procedure can use. Some workloads, such as `nyc_taxis`, contain an additional `.json` file that lists feature-specific operations, such as `snapshot` operations. The following JSON example shows a list of operations from the `nyc_taxis` workload: + +```json + { + "name": "index", + "operation-type": "bulk", + "bulk-size": {{bulk_size | default(10000)}}, + "ingest-percentage": {{ingest_percentage | default(100)}} + }, + { + "name": "update", + "operation-type": "bulk", + "bulk-size": {{bulk_size | default(10000)}}, + "ingest-percentage": {{ingest_percentage | default(100)}}, + "conflicts": "{{conflicts | default('random')}}", + "on-conflict": "{{on_conflict | default('update')}}", + "conflict-probability": {{conflict_probability | default(25)}}, + "recency": {{recency | default(0)}} + }, + { + "name": "wait-until-merges-finish", + "operation-type": "index-stats", + "index": "_all", + "condition": { + "path": "_all.total.merges.current", + "expected-value": 0 + }, + "retry-until-success": true, + "include-in-reporting": false + }, + { + "name": "default", + "operation-type": "search", + "body": { + "query": { + "match_all": {} + } + } + }, + { + "name": "range", + "operation-type": "search", + "body": { + "query": { + "range": { + "total_amount": { + "gte": 5, + "lt": 15 + } + } + } + } + }, + { + "name": "distance_amount_agg", + "operation-type": "search", + "body": { + "size": 0, + "query": { + "bool": { + "filter": { + "range": { + "trip_distance": { + "lt": 50, + "gte": 0 + } + } + } + } + }, + "aggs": { + "distance_histo": { + "histogram": { + "field": "trip_distance", + "interval": 1 + }, + "aggs": { + "total_amount_stats": { + "stats": { + "field": "total_amount" + } + } + } + } + } + } + }, + { + "name": "autohisto_agg", + "operation-type": "search", + "body": { + "size": 0, + "query": { + "range": { + "dropoff_datetime": { + "gte": "01/01/2015", + "lte": "21/01/2015", + "format": "dd/MM/yyyy" + } + } + }, + "aggs": { + "dropoffs_over_time": { + "auto_date_histogram": { + "field": "dropoff_datetime", + "buckets": 20 + } + } + } + } + }, + { + "name": "date_histogram_agg", + "operation-type": "search", + "body": { + "size": 0, + "query": { + "range": { + "dropoff_datetime": { + "gte": "01/01/2015", + "lte": "21/01/2015", + "format": "dd/MM/yyyy" + } + } + }, + "aggs": { + "dropoffs_over_time": { + "date_histogram": { + "field": "dropoff_datetime", + "calendar_interval": "day" + } + } + } + } + }, + { + "name": "date_histogram_calendar_interval", + "operation-type": "search", + "body": { + "size": 0, + "query": { + "range": { + "dropoff_datetime": { + "gte": "2015-01-01 00:00:00", + "lt": "2016-01-01 00:00:00" + } + } + }, + "aggs": { + "dropoffs_over_time": { + "date_histogram": { + "field": "dropoff_datetime", + "calendar_interval": "month" + } + } + } + } + }, + { + "name": "date_histogram_calendar_interval_with_tz", + "operation-type": "search", + "body": { + "size": 0, + "query": { + "range": { + "dropoff_datetime": { + "gte": "2015-01-01 00:00:00", + "lt": "2016-01-01 00:00:00" + } + } + }, + "aggs": { + "dropoffs_over_time": { + "date_histogram": { + "field": "dropoff_datetime", + "calendar_interval": "month", + "time_zone": "America/New_York" + } + } + } + } + }, + { + "name": "date_histogram_fixed_interval", + "operation-type": "search", + "body": { + "size": 0, + "query": { + "range": { + "dropoff_datetime": { + "gte": "2015-01-01 00:00:00", + "lt": "2016-01-01 00:00:00" + } + } + }, + "aggs": { + "dropoffs_over_time": { + "date_histogram": { + "field": "dropoff_datetime", + "fixed_interval": "60d" + } + } + } + } + }, + { + "name": "date_histogram_fixed_interval_with_tz", + "operation-type": "search", + "body": { + "size": 0, + "query": { + "range": { + "dropoff_datetime": { + "gte": "2015-01-01 00:00:00", + "lt": "2016-01-01 00:00:00" + } + } + }, + "aggs": { + "dropoffs_over_time": { + "date_histogram": { + "field": "dropoff_datetime", + "fixed_interval": "60d", + "time_zone": "America/New_York" + } + } + } + } + }, + { + "name": "date_histogram_fixed_interval_with_metrics", + "operation-type": "search", + "body": { + "size": 0, + "query": { + "range": { + "dropoff_datetime": { + "gte": "2015-01-01 00:00:00", + "lt": "2016-01-01 00:00:00" + } + } + }, + "aggs": { + "dropoffs_over_time": { + "date_histogram": { + "field": "dropoff_datetime", + "fixed_interval": "60d" + }, + "aggs": { + "total_amount": { "stats": { "field": "total_amount" } }, + "tip_amount": { "stats": { "field": "tip_amount" } }, + "trip_distance": { "stats": { "field": "trip_distance" } } + } + } + } + } + }, + { + "name": "auto_date_histogram", + "operation-type": "search", + "body": { + "size": 0, + "query": { + "range": { + "dropoff_datetime": { + "gte": "2015-01-01 00:00:00", + "lt": "2016-01-01 00:00:00" + } + } + }, + "aggs": { + "dropoffs_over_time": { + "auto_date_histogram": { + "field": "dropoff_datetime", + "buckets": "12" + } + } + } + } + }, + { + "name": "auto_date_histogram_with_tz", + "operation-type": "search", + "body": { + "size": 0, + "query": { + "range": { + "dropoff_datetime": { + "gte": "2015-01-01 00:00:00", + "lt": "2016-01-01 00:00:00" + } + } + }, + "aggs": { + "dropoffs_over_time": { + "auto_date_histogram": { + "field": "dropoff_datetime", + "buckets": "13", + "time_zone": "America/New_York" + } + } + } + } + }, + { + "name": "auto_date_histogram_with_metrics", + "operation-type": "search", + "body": { + "size": 0, + "query": { + "range": { + "dropoff_datetime": { + "gte": "2015-01-01 00:00:00", + "lt": "2016-01-01 00:00:00" + } + } + }, + "aggs": { + "dropoffs_over_time": { + "auto_date_histogram": { + "field": "dropoff_datetime", + "buckets": "12" + }, + "aggs": { + "total_amount": { "stats": { "field": "total_amount" } }, + "tip_amount": { "stats": { "field": "tip_amount" } }, + "trip_distance": { "stats": { "field": "trip_distance" } } + } + } + } + } + }, + { + "name": "desc_sort_tip_amount", + "operation-type": "search", + "index": "nyc_taxis", + "body": { + "query": { + "match_all": {} + }, + "sort" : [ + {"tip_amount" : "desc"} + ] + } + }, + { + "name": "asc_sort_tip_amount", + "operation-type": "search", + "index": "nyc_taxis", + "body": { + "query": { + "match_all": {} + }, + "sort" : [ + {"tip_amount" : "asc"} + ] + } + } +``` + +The `_test-procedures` directory contains a `default.json` file that sets the order of operations performed by the workload. Similar to the `_operations` directory, the `_test-procedures` directory can also contain feature-specific test procedures, such as `searchable_snapshots.json` for `nyc_taxis`. The following examples show the searchable snapshots test procedures for `nyc_taxis`: + +```json + { + "name": "searchable-snapshot", + "description": "Measuring performance for Searchable Snapshot feature. Based on the default test procedure 'append-no-conflicts'.", + "schedule": [ + { + "operation": "delete-index" + }, + { + "operation": { + "operation-type": "create-index", + "settings": { + "index.codec": "best_compression", + "index.refresh_interval": "30s", + "index.translog.flush_threshold_size": "4g" + } + } + }, + { + "name": "check-cluster-health", + "operation": { + "operation-type": "cluster-health", + "index": "nyc_taxis", + "request-params": { + "wait_for_status": "{{ cluster_health | default('green') }}", + "wait_for_no_relocating_shards": "true" + }, + "retry-until-success": true + } + }, + { + "operation": "index", + "warmup-time-period": 240, + "clients": {{ bulk_indexing_clients | default(8) }}, + "ignore-response-error-level": "{{ error_level | default('non-fatal') }}" + }, + { + "name": "refresh-after-index", + "operation": "refresh" + }, + { + "operation": { + "operation-type": "force-merge", + "request-timeout": 7200 + } + }, + { + "name": "refresh-after-force-merge", + "operation": "refresh" + }, + { + "operation": "wait-until-merges-finish" + }, + { + "operation": "create-snapshot-repository" + }, + { + "operation": "delete-snapshot" + }, + { + "operation": "create-snapshot" + }, + { + "operation": "wait-for-snapshot-creation" + }, + { + "operation": { + "name": "delete-local-index", + "operation-type": "delete-index" + } + }, + { + "operation": "restore-snapshot" + }, + { + "operation": "default", + "warmup-iterations": 50, + "iterations": 100 + }, + { + "operation": "range", + "warmup-iterations": 50, + "iterations": 100 + }, + { + "operation": "distance_amount_agg", + "warmup-iterations": 50, + "iterations": 50 + }, + { + "operation": "autohisto_agg", + "warmup-iterations": 50, + "iterations": 100 + }, + { + "operation": "date_histogram_agg", + "warmup-iterations": 50, + "iterations": 100 + } + ] + } +``` + +## Next steps + +Now that you have familiarized yourself with the anatomy of a workload, see the criteria for [Choosing a workload]({{site.url}}{{site.baseurl}}/benchmark/user-guide/understanding-workloads/choosing-a-workload/). diff --git a/_benchmark/user-guide/understanding-workloads/choosing-a-workload.md b/_benchmark/user-guide/understanding-workloads/choosing-a-workload.md new file mode 100644 index 0000000000..d7ae48ad0a --- /dev/null +++ b/_benchmark/user-guide/understanding-workloads/choosing-a-workload.md @@ -0,0 +1,29 @@ +--- +layout: default +title: Choosing a workload +nav_order: 20 +grand_parent: User guide +parent: Understanding workloads +--- + +# Choosing a workload + +The [opensearch-benchmark-workloads](https://github.com/opensearch-project/opensearch-benchmark-workloads) repository contains a list of workloads that you can use to run your benchmarks. Using a workload similar to your cluster's use cases can save you time and effort when assessing your cluster's performance. + +For example, say you're a system architect at a rideshare company. As a rideshare company, you collect and store data based on trip times, locations, and other data related to each rideshare. Instead of building a custom workload and using your own data, which requires additional time, effort, and cost, you can use the [nyc_taxis](https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/nyc_taxis) workload to benchmark your cluster because the data inside the workload is similar to the data that you collect. + +## Criteria for choosing a workload + +Consider the following criteria when deciding which workload would work best for benchmarking your cluster: + +- The cluster's use case. +- The data types that your cluster uses compared to the data structure of the documents contained in the workload. Each workload contains an example document so that you can compare data types, or you can view the index mappings and data types in the `index.json` file. +- The query types most commonly used inside your cluster. The `operations/default.json` file contains information about the query types and workload operations. + +## General search clusters + +For benchmarking clusters built for general search use cases, start with the `[nyc_taxis]`(https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/nyc_taxis) workload. This workload contains data about the rides taken in yellow taxis in New York City in 2015. + +## Log data + +For benchmarking clusters built for indexing and search with log data, use the [`http_logs`](https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/http_logs) workload. This workload contains data about the 1998 World Cup. \ No newline at end of file diff --git a/_benchmark/user-guide/understanding-workloads/index.md b/_benchmark/user-guide/understanding-workloads/index.md new file mode 100644 index 0000000000..6e6d2aa9c1 --- /dev/null +++ b/_benchmark/user-guide/understanding-workloads/index.md @@ -0,0 +1,14 @@ +--- +layout: default +title: Understanding workloads +nav_order: 10 +parent: User guide +has_children: true +--- + +# Understanding workloads + +OpenSearch Benchmark includes a set of [workloads](https://github.com/opensearch-project/opensearch-benchmark-workloads) that you can use to benchmark data from your cluster. Workloads contain descriptions of one or more benchmarking scenarios that use a specific document corpus to perform a benchmark against your cluster. The document corpus contains any indexes, data files, and operations invoked when the workflow runs. + + +