Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DataGenerator tool #1059

Merged
merged 2 commits into from
Oct 13, 2024
Merged

Conversation

peternied
Copy link
Member

@peternied peternied commented Oct 10, 2024

Description

New programmatic and CLI way to generate test data mirroring geonames, http_logs, nyc_taxis, and nested workloads from OSB.

In the past this repo has been making use of OpenSearch Benchmark which has been valuable for real world data. However, there are dependencies on external internet sources and python libraries that have caused issues in the past. This tool gives this project similar functionality without those issues.

Issues Resolved

Testing

  • Ran ./gradlew TrafficCapture:dockerSolution:composeUp
  • ES 7.10 cluster check ./gradlew DataGenerator:run --args=' --target-host https://172.18.0.1:19200 --target-insecure --target-username admin --target-password admin --docs-per-workload 10'
  • Verified output of curl -k -u admin:admin https://172.18.0.1:19200/_cat/indices?v
  • OS 2.15 cluster check ./gradlew DataGenerator:run --args=' --target-host https://172.18.0.1:29200 --target-insecure --target-username admin --target-password myStrongPassword123!'
  • Verified output of curl -k -u 'admin:myStrongPassword123!' https://172.18.0.1:29200/_cat/indices?v

Note; tool comes with its own end to end test case and is used by other test cases.

Check List

  • New functionality includes testing
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link

codecov bot commented Oct 10, 2024

Codecov Report

Attention: Patch coverage is 95.68966% with 15 lines in your changes missing coverage. Please review.

Project coverage is 80.52%. Comparing base (bee316d) to head (fe01c79).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
.../java/org/opensearch/migrations/DataGenerator.java 51.85% 13 Missing ⚠️
.../org/opensearch/migrations/data/FieldBuilders.java 90.90% 1 Missing ⚠️
...opensearch/migrations/data/RandomDataBuilders.java 90.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1059      +/-   ##
============================================
+ Coverage     80.16%   80.52%   +0.36%     
- Complexity     2744     2818      +74     
============================================
  Files           367      379      +12     
  Lines         13743    14088     +345     
  Branches        949      956       +7     
============================================
+ Hits          11017    11345     +328     
- Misses         2149     2165      +16     
- Partials        577      578       +1     
Flag Coverage Δ
gradle-test 78.57% <95.68%> (+0.50%) ⬆️
python-test 90.24% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -39,24 +43,19 @@ public class ParallelDocumentMigrationsTest extends SourceTestBase {
static final List<SearchClusterContainer.ContainerVersion> TARGET_IMAGES = List.of(SearchClusterContainer.OS_V2_14_0);

public static Stream<Arguments> makeDocumentMigrationArgs() {
List<Object[]> sourceImageArgs = SOURCE_IMAGES.stream()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up not using the preloaded images, uploading all of the data it was only taking ~4 seconds. While we will be repeating this cost for each test, it seems like this is nearly the same runtime when compared against the previous run.

I've left all the existing code in place so we are ready to jump to much larger data volumes in future test cases.

New programmatic and CLI way to generate test data mirroring geonames,
http_logs, nyc_taxis, and nested workloads from OSB.

In the past this repo has been making use of OpenSearch
Benchmark which has been valuable for real world data. However, there
are dependencies on external internet sources and python libraries that
have caused issues in the past.  This tool gives this project similar
functionality without those issues.

Signed-off-by: Peter Nied <[email protected]>
@peternied
Copy link
Member Author

peternied commented Oct 10, 2024

With the completion of the test runs I've got some data points for the performance difference, there were two classes impact ProcessLifecycleTest and ParallelDocumentMigrationsTest. Inspecting the results from a recent merge that executed the test classes [1] vs this pull requests details [2] we see the following:

ProcessLifecycleTest

Test Method Previous Duration /w Data Generator Duration
testExitsZeroThenThreeForSimpleSetup() 46.498s 47.516s
[1] NEVER, 0 2m48.86s 36.077s
[2] AT_STARTUP, 1 54.133s 32.270s
[3] WITH_DELAYS, 2 54.912s 34.258s

ParallelDocumentMigrationsTest

Test Method Previous Duration /w Data Generator Duration
[1] 1, OPENSEARCH 2.14.0, ELASTICSEARCH 7.10.2, true 33.510s 36.988s
[2] 1, OPENSEARCH 2.14.0, ELASTICSEARCH 7.10.2, false 32.063s 36.452s
[3] 3, OPENSEARCH 2.14.0, ELASTICSEARCH 7.10.2, true 28.801s 32.277s
[4] 3, OPENSEARCH 2.14.0, ELASTICSEARCH 7.10.2, false 27.703s 32.587s
[5] 40, OPENSEARCH 2.14.0, ELASTICSEARCH 7.10.2, true 35.489s 39.517s
[6] 40, OPENSEARCH 2.14.0, ELASTICSEARCH 7.10.2, false 34.058s 36.908s

While it looks like the cached version is ~4 seconds faster per test, avoiding the large 2 minute cost to run OSB for the first time seems to make this overall change a net positive. FYI @gregschohn

Signed-off-by: Peter Nied <[email protected]>
Copy link
Member

@chelma chelma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change, Peter!

@peternied peternied merged commit b50a596 into opensearch-project:main Oct 13, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants