Auto-generated / custom datasets for workloads #99

achitojha · 2022-01-06T16:23:48Z

OpenSearch Benchmark workloads currently use an existing dataset to ingest records into OpenSearch/ElasticSearch. The goal for this task is to build support for automatically generated datasets for a workload. This would enable workloads to have large auto-generated datasets without requiring any specific data.

Acceptance Criteria

We have generated 3 to 5 datasets for custom workloads based on real world use cases
The generated workloads can be used by OpenSearch benchmark to measure specific performance stats for OpenSearch cluster
The custom workloads are specific to the needs (Ex: smaller workloads for executing against each PR, medium workloads for more frequent runs, complex workloads to capture specific issues / regression for longevity tests etc..)

dblock · 2022-01-06T16:51:17Z

Hopefully those datasets aren't "random", but maybe "configurable or auto-generated data sets"?

achitojha · 2022-01-10T16:16:20Z

@dblock : Agreed - updated the issue.

bbarani · 2023-02-08T22:37:15Z

@dblock @achitojha Is the plan here to create automated workloads (Ex: using create-workload --workload) using an existing active cluster and use that workload for performance testing?

achitojha · 2023-02-09T01:52:26Z

We don’t have a finalized approach for this. The above suggestion could be a possible step forward. Another option here could be to specify certain attributes and have OpenSearch benchmark general a data-set based on the specified attributes. There may still be more alternate approaches

gkamat · 2023-04-04T17:01:33Z

Initial focus is on providing a capability to increase the data corpus size for a workload.

dtaivpp · 2023-04-04T17:08:23Z

One thing that I wanted to add is it would be handy if in this or future versions we could add some random-ness that could be used to demonstrate anomaly detection.

dblock · 2023-04-25T16:24:25Z

I'd like to be able to generate such data as "1B IP addresses skewed towards US-based IPs" (or other data that flows some statistical distribution). Maybe there are existing tools that can do that well?

dblock · 2023-04-25T16:39:45Z

Is this a dup/subset of #253?

IanHoang · 2023-04-25T16:43:36Z

@dblock yes, this is technically a duplicate/subset of RFC as this issue was created before. The RFC dives deeper.

dblock · 2023-04-25T18:03:01Z

Let's close!

achitojha changed the title ~~Random workload generation~~ Auto-generated datasets for workloads Jan 10, 2022

bbarani changed the title ~~Auto-generated datasets for workloads~~ [META] Auto-generated datasets for workloads Feb 10, 2023

bbarani added this to OpenSearch Engineering Effectiveness Feb 10, 2023

bbarani moved this to Not started in OpenSearch Engineering Effectiveness Feb 10, 2023

bbarani added the enhancement New feature or request label Feb 10, 2023

bbarani changed the title ~~[META] Auto-generated datasets for workloads~~ [META] Auto-generated / custom datasets for workloads Feb 15, 2023

bbarani moved this from Not started to In Progress in OpenSearch Engineering Effectiveness Feb 28, 2023

bbarani assigned gkamat Feb 28, 2023

gkamat changed the title ~~[META] Auto-generated / custom datasets for workloads~~ Auto-generated / custom datasets for workloads Apr 4, 2023

gkamat moved this from In Progress to Backlog in OpenSearch Engineering Effectiveness Apr 4, 2023

IanHoang self-assigned this Apr 18, 2023

dblock closed this as completed Apr 25, 2023

github-project-automation bot moved this from Backlog to Done in OpenSearch Engineering Effectiveness Apr 25, 2023

github-project-automation bot added this to OpenSearch Benchmark Roadmap Aug 30, 2024

github-project-automation bot moved this to Completed in OpenSearch Benchmark Roadmap Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-generated / custom datasets for workloads #99

Auto-generated / custom datasets for workloads #99

achitojha commented Jan 6, 2022 •

edited by bbarani

Loading

dblock commented Jan 6, 2022 •

edited

Loading

achitojha commented Jan 10, 2022 •

edited

Loading

bbarani commented Feb 8, 2023 •

edited

Loading

achitojha commented Feb 9, 2023

gkamat commented Apr 4, 2023

dtaivpp commented Apr 4, 2023

dblock commented Apr 25, 2023

dblock commented Apr 25, 2023

IanHoang commented Apr 25, 2023

dblock commented Apr 25, 2023

Auto-generated / custom datasets for workloads #99

Auto-generated / custom datasets for workloads #99

Comments

achitojha commented Jan 6, 2022 • edited by bbarani Loading

Acceptance Criteria

dblock commented Jan 6, 2022 • edited Loading

achitojha commented Jan 10, 2022 • edited Loading

bbarani commented Feb 8, 2023 • edited Loading

achitojha commented Feb 9, 2023

gkamat commented Apr 4, 2023

dtaivpp commented Apr 4, 2023

dblock commented Apr 25, 2023

dblock commented Apr 25, 2023

IanHoang commented Apr 25, 2023

dblock commented Apr 25, 2023

achitojha commented Jan 6, 2022 •

edited by bbarani

Loading

dblock commented Jan 6, 2022 •

edited

Loading

achitojha commented Jan 10, 2022 •

edited

Loading

bbarani commented Feb 8, 2023 •

edited

Loading