Add understanding workloads section (#6164) (#6298)

* Add understanding workloads section. * Add additional anatomy sections * Add section headers * Fix link * Fix typos * Change example to fix build error. * Apply suggestions from code review * Apply suggestions from code review * Apply suggestions from code review * Apply suggestions from code review * Apply suggestions from code review * Apply suggestions from code review * Update anatomy-of-a-workload.md * Fix build errors * Update anatomy-of-a-workload.md * Update anatomy-of-a-workload.md * Update concepts.md * Update index.md --------- (cherry picked from commit bf4ae72) Signed-off-by: Naarcha-AWS <[email protected]> Signed-off-by: Naarcha-AWS <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Melissa Vagi <[email protected]> Co-authored-by: Nathan Bower <[email protected]>
opensearch-project · Jan 30, 2024 · 7c0785a · 7c0785a
1 parent daf949c
commit 7c0785a
Show file tree

Hide file tree

Showing 5 changed files with 789 additions and 148 deletions.
diff --git a/_benchmark/reference/workloads/index.md b/_benchmark/reference/workloads/index.md
@@ -16,7 +16,7 @@ A workload is a specification of one or more benchmarking scenarios. A workload
 
 This section provides a list of options and examples you can use when customizing or using a workload.
 
-For more information about what comprises a workload, see [Anatomy of a workload]({{site.url}}{{site.baseurl}}/benchmark/user-guide/concepts#anatomy-of-a-workload). 
+For more information about what comprises a workload, see [Anatomy of a workload](({{site.url}}{{site.baseurl}}/benchmark/understanding-workloads/anatomy-of-a-workload/). 
 
 
 ## Workload examples

diff --git a/_benchmark/user-guide/concepts.md b/_benchmark/user-guide/concepts.md
@@ -11,7 +11,7 @@ Before using OpenSearch Benchmark, familiarize yourself with the following conce
 
 ## Core concepts and definitions
 
-- **Workload**: The description of one or more benchmarking scenarios that use a specific document corpus to perform a benchmark against your cluster. The document corpus contains any indexes, data files, and operations invoked when the workflow runs. You can list the available workloads by using `opensearch-benchmark list workloads` or view any included workloads in the [OpenSearch Benchmark Workloads repository](https://github.com/opensearch-project/opensearch-benchmark-workloads/). For more information about the elements of a workload, see [Anatomy of a workload](#anatomy-of-a-workload). For information about building a custom workload, see [Creating custom workloads]({{site.url}}{{site.baseurl}}/benchmark/creating-custom-workloads/).
+- **Workload**: The description of one or more benchmarking scenarios that use a specific document corpus to perform a benchmark against your cluster. The document corpus contains any indexes, data files, and operations invoked when the workflow runs. You can list the available workloads by using `opensearch-benchmark list workloads` or view any included workloads in the [OpenSearch Benchmark Workloads repository](https://github.com/opensearch-project/opensearch-benchmark-workloads/). For more information about the elements of a workload, see [Anatomy of a workload](({{site.url}}{{site.baseurl}}/benchmark/understanding-workloads/anatomy-of-a-workload/). For information about building a custom workload, see [Creating custom workloads]({{site.url}}{{site.baseurl}}/benchmark/creating-custom-workloads/).
 
 - **Pipeline**: A series of steps occurring before and after a workload is run that determines benchmark results. OpenSearch Benchmark supports three pipelines:
   - `from-sources`: Builds and provisions OpenSearch, runs a benchmark, and then publishes the results.
@@ -110,149 +110,5 @@ This latency cascade continues, increasing latency by 100ms for each subsequent
 
 ### Recommendation
 
-As shown by the preceding examples, you should be aware of the average service time of each task and provide a `target-throughput` that accounts for the service time. The OpenSearch Benchmark latency is calculated based on the `target-throughput` set by the user. In other words, the OpenSearch Benchmark latency could be redefined as "throughput-based latency".
-
-## Anatomy of a workload
-
-The following example workload shows all of the essential elements needed to create a `workload.json` file. You can run this workload in your own benchmark configuration to understand how all of the elements work together:
-
-```json
-{
-  "description": "Tutorial benchmark for OpenSearch Benchmark",
-  "indices": [
-    {
-      "name": "movies",
-      "body": "index.json"
-    }
-  ],
-  "corpora": [
-    {
-      "name": "movies",
-      "documents": [
-        {
-          "source-file": "movies-documents.json",
-          "document-count": 11658903, # Fetch document count from command line
-          "uncompressed-bytes": 1544799789 # Fetch uncompressed bytes from command line
-        }
-      ]
-    }
-  ],
-  "schedule": [
-    {
-      "operation": {
-        "operation-type": "create-index"
-      }
-    },
-    {
-      "operation": {
-        "operation-type": "cluster-health",
-        "request-params": {
-          "wait_for_status": "green"
-        },
-        "retry-until-success": true
-      }
-    },
-    {
-      "operation": {
-        "operation-type": "bulk",
-        "bulk-size": 5000
-      },
-      "warmup-time-period": 120,
-      "clients": 8
-    },
-    {
-      "operation": {
-        "name": "query-match-all",
-        "operation-type": "search",
-        "body": {
-          "query": {
-            "match_all": {}
-          }
-        }
-      },
-      "iterations": 1000,
-      "target-throughput": 100
-    }
-  ]
-}
-```
-
-A workload usually includes the following elements:
-
-- [indices]({{site.url}}{{site.baseurl}}/benchmark/workloads/indices/): Defines the relevant indexes and index templates used for the workload.
-- [corpora]({{site.url}}{{site.baseurl}}/benchmark/workloads/corpora/): Defines all document corpora used for the workload.
-- `schedule`: Defines operations and the order in which the operations run inline. Alternatively, you can use `operations` to group operations and the `test_procedures` parameter to specify the order of operations.
-- `operations`: **Optional**. Describes which operations are available for the workload and how they are parameterized.
-
-### Indices
-
-To create an index, specify its `name`. To add definitions to your index, use the `body` option and point it to the JSON file containing the index definitions. For more information, see [indices]({{site.url}}{{site.baseurl}}/benchmark/workloads/indices/).
-
-### Corpora
-
-The `corpora` element requires the name of the index containing the document corpus, for example, `movies`, and a list of parameters that define the document corpora. This list includes the following parameters:
-
--  `source-file`: The file name that contains the workload's corresponding documents. When using OpenSearch Benchmark locally, documents are contained in a JSON file. When providing a `base_url`, use a compressed file format: `.zip`, `.bz2`, `.gz`, `.tar`, `.tar.gz`, `.tgz`, or `.tar.bz2`. The compressed file must have one JSON file containing the name.
--  `document-count`: The number of documents in the `source-file`, which determines which client indexes correlate to which parts of the document corpus. Each N client receives an Nth of the document corpus. When using a source that contains a document with a parent-child relationship, specify the number of parent documents.
-- `uncompressed-bytes`: The size, in bytes, of the source file after decompression, indicating how much disk space the decompressed source file needs.
-- `compressed-bytes`: The size, in bytes, of the source file before decompression. This can help you assess the amount of time needed for the cluster to ingest documents.
-
-### Operations
-
-The `operations` element lists the OpenSearch API operations performed by the workload. For example, you can set an operation to `create-index`, an index in the test cluster to which OpenSearch Benchmark can write documents. Operations are usually listed inside of `schedule`.
-
-### Schedule
-
-The `schedule` element contains a list of actions and operations that are run by the workload. Operations run according to the order in which they appear in the `schedule`. The following example illustrates a `schedule` with multiple operations, each defined by its `operation-type`:
-
-```json
-  "schedule": [
-    {
-      "operation": {
-        "operation-type": "create-index"
-      }
-    },
-    {
-      "operation": {
-        "operation-type": "cluster-health",
-        "request-params": {
-          "wait_for_status": "green"
-        },
-        "retry-until-success": true
-      }
-    },
-    {
-      "operation": {
-        "operation-type": "bulk",
-        "bulk-size": 5000
-      },
-      "warmup-time-period": 120,
-      "clients": 8
-    },
-    {
-      "operation": {
-        "name": "query-match-all",
-        "operation-type": "search",
-        "body": {
-          "query": {
-            "match_all": {}
-          }
-        }
-      },
-      "iterations": 1000,
-      "target-throughput": 100
-    }
-  ]
-}
-```
-
-According to this schedule, the actions will run in the following order:
-
-1. The `create-index` operation creates an index. The index remains empty until the `bulk` operation adds documents with benchmarked data.
-2. The `cluster-health` operation assesses the health of the cluster before running the workload. In this example, the workload waits until the status of the cluster's health is `green`.
-   - The `bulk` operation runs the `bulk` API to index `5000` documents simultaneously.
-   - Before benchmarking, the workload waits until the specified `warmup-time-period` passes. In this example, the warmup period is `120` seconds.
-5. The `clients` field defines the number of clients that will run the remaining actions in the schedule concurrently.
-6. The `search` runs a `match_all` query to match all documents after they have been indexed by the `bulk` API using the 8 clients specified.
-   - The `iterations` field indicates the number of times each client runs the `search` operation. The report generated by the benchmark automatically adjusts the percentile numbers based on this number. To generate a precise percentile, the benchmark needs to run at least 1,000 iterations.
-   - Lastly, the `target-throughput` field defines the number of requests per second each client performs, which, when set, can help reduce the latency of the benchmark. For example, a `target-throughput` of 100 requests divided by 8 clients means that each client will issue 12 requests per second.
+As shown by the preceding examples, you should be aware of the average service time of each task and provide a `target-throughput` that accounts for the service time. The OpenSearch Benchmark latency is calculated based on the `target-throughput` set by the user, that is, the latency could be redefined as "throughput-based latency."
+