Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-generated / custom datasets for workloads #99

Closed
3 tasks
achitojha opened this issue Jan 6, 2022 · 10 comments
Closed
3 tasks

Auto-generated / custom datasets for workloads #99

achitojha opened this issue Jan 6, 2022 · 10 comments
Assignees
Labels
enhancement New feature or request

Comments

@achitojha
Copy link
Contributor

achitojha commented Jan 6, 2022

OpenSearch Benchmark workloads currently use an existing dataset to ingest records into OpenSearch/ElasticSearch. The goal for this task is to build support for automatically generated datasets for a workload. This would enable workloads to have large auto-generated datasets without requiring any specific data.

Acceptance Criteria

  • We have generated 3 to 5 datasets for custom workloads based on real world use cases
  • The generated workloads can be used by OpenSearch benchmark to measure specific performance stats for OpenSearch cluster
  • The custom workloads are specific to the needs (Ex: smaller workloads for executing against each PR, medium workloads for more frequent runs, complex workloads to capture specific issues / regression for longevity tests etc..)
@dblock
Copy link
Member

dblock commented Jan 6, 2022

Hopefully those datasets aren't "random", but maybe "configurable or auto-generated data sets"?

@achitojha achitojha changed the title Random workload generation Auto-generated datasets for workloads Jan 10, 2022
@achitojha
Copy link
Contributor Author

achitojha commented Jan 10, 2022

@dblock : Agreed - updated the issue.

@bbarani
Copy link
Member

bbarani commented Feb 8, 2023

@dblock @achitojha Is the plan here to create automated workloads (Ex: using create-workload --workload) using an existing active cluster and use that workload for performance testing?

@achitojha
Copy link
Contributor Author

We don’t have a finalized approach for this. The above suggestion could be a possible step forward. Another option here could be to specify certain attributes and have OpenSearch benchmark general a data-set based on the specified attributes. There may still be more alternate approaches

@bbarani bbarani changed the title Auto-generated datasets for workloads [META] Auto-generated datasets for workloads Feb 10, 2023
@bbarani bbarani added the enhancement New feature or request label Feb 10, 2023
@bbarani bbarani changed the title [META] Auto-generated datasets for workloads [META] Auto-generated / custom datasets for workloads Feb 15, 2023
@bbarani bbarani moved this from Not started to In Progress in OpenSearch Engineering Effectiveness Feb 28, 2023
@gkamat gkamat changed the title [META] Auto-generated / custom datasets for workloads Auto-generated / custom datasets for workloads Apr 4, 2023
@gkamat gkamat moved this from In Progress to Backlog in OpenSearch Engineering Effectiveness Apr 4, 2023
@gkamat
Copy link
Collaborator

gkamat commented Apr 4, 2023

Initial focus is on providing a capability to increase the data corpus size for a workload.

@dtaivpp
Copy link

dtaivpp commented Apr 4, 2023

One thing that I wanted to add is it would be handy if in this or future versions we could add some random-ness that could be used to demonstrate anomaly detection.

@IanHoang IanHoang self-assigned this Apr 18, 2023
@dblock
Copy link
Member

dblock commented Apr 25, 2023

I'd like to be able to generate such data as "1B IP addresses skewed towards US-based IPs" (or other data that flows some statistical distribution). Maybe there are existing tools that can do that well?

@dblock
Copy link
Member

dblock commented Apr 25, 2023

Is this a dup/subset of #253?

@IanHoang
Copy link
Collaborator

@dblock yes, this is technically a duplicate/subset of RFC as this issue was created before. The RFC dives deeper.

@dblock
Copy link
Member

dblock commented Apr 25, 2023

Let's close!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Archived in project
Development

No branches or pull requests

6 participants