-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DataGenerator tool #1059
Add DataGenerator tool #1059
Conversation
75b5f95
to
34d4e9b
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1059 +/- ##
============================================
+ Coverage 80.16% 80.52% +0.36%
- Complexity 2744 2818 +74
============================================
Files 367 379 +12
Lines 13743 14088 +345
Branches 949 956 +7
============================================
+ Hits 11017 11345 +328
- Misses 2149 2165 +16
- Partials 577 578 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@@ -39,24 +43,19 @@ public class ParallelDocumentMigrationsTest extends SourceTestBase { | |||
static final List<SearchClusterContainer.ContainerVersion> TARGET_IMAGES = List.of(SearchClusterContainer.OS_V2_14_0); | |||
|
|||
public static Stream<Arguments> makeDocumentMigrationArgs() { | |||
List<Object[]> sourceImageArgs = SOURCE_IMAGES.stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up not using the preloaded images, uploading all of the data it was only taking ~4 seconds. While we will be repeating this cost for each test, it seems like this is nearly the same runtime when compared against the previous run.
I've left all the existing code in place so we are ready to jump to much larger data volumes in future test cases.
New programmatic and CLI way to generate test data mirroring geonames, http_logs, nyc_taxis, and nested workloads from OSB. In the past this repo has been making use of OpenSearch Benchmark which has been valuable for real world data. However, there are dependencies on external internet sources and python libraries that have caused issues in the past. This tool gives this project similar functionality without those issues. Signed-off-by: Peter Nied <[email protected]>
34d4e9b
to
9f66fa9
Compare
With the completion of the test runs I've got some data points for the performance difference, there were two classes impact ProcessLifecycleTest and ParallelDocumentMigrationsTest. Inspecting the results from a recent merge that executed the test classes [1] vs this pull requests details [2] we see the following: ProcessLifecycleTest
ParallelDocumentMigrationsTest
While it looks like the cached version is ~4 seconds faster per test, avoiding the large 2 minute cost to run OSB for the first time seems to make this overall change a net positive. FYI @gregschohn |
DataGenerator/src/main/java/org/opensearch/migrations/data/WorkloadOptions.java
Show resolved
Hide resolved
DataGenerator/src/main/java/org/opensearch/migrations/data/WorkloadOptions.java
Outdated
Show resolved
Hide resolved
...gration/src/test/java/org/opensearch/migrations/bulkload/ParallelDocumentMigrationsTest.java
Show resolved
Hide resolved
RFS/src/main/java/org/opensearch/migrations/bulkload/common/OpenSearchClient.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Peter Nied <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the change, Peter!
Description
New programmatic and CLI way to generate test data mirroring geonames, http_logs, nyc_taxis, and nested workloads from OSB.
In the past this repo has been making use of OpenSearch Benchmark which has been valuable for real world data. However, there are dependencies on external internet sources and python libraries that have caused issues in the past. This tool gives this project similar functionality without those issues.
Issues Resolved
Testing
./gradlew TrafficCapture:dockerSolution:composeUp
./gradlew DataGenerator:run --args=' --target-host https://172.18.0.1:19200 --target-insecure --target-username admin --target-password admin --docs-per-workload 10'
curl -k -u admin:admin https://172.18.0.1:19200/_cat/indices?v
./gradlew DataGenerator:run --args=' --target-host https://172.18.0.1:29200 --target-insecure --target-username admin --target-password myStrongPassword123!'
curl -k -u 'admin:myStrongPassword123!' https://172.18.0.1:29200/_cat/indices?v
Note; tool comes with its own end to end test case and is used by other test cases.
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.