Fix traffic limiting #526

gregschohn · 2024-02-25T04:31:19Z

Description

Divide traffic source accumulation into a producer and consumer.

Previously, a single thread acquired TrafficStreams from the TrafficSource and then sent them through the CapturedTrafficToHttpTransactionAccumulator. That accumulator itself calls a callback object when requests and responses have been rematerialized. The callbacks that the TrafficReplayer supplied, however, also blocked when the number of requests was high and they needed to be throttled.

Now, the TrafficStreamLimiter takes a callback task to execute instead of just providing blocking functions. The new queueWork() function will never block. Instead, it will move work onto a transfer queue. That transferQueue is read by a new worker thread that is also owned by the TrafficStreamLimiter that will consume or throttle consumption in the order that items were received.

We still throttle the number of requests, but we allow them to accumulate in memory without processing them or sending them to the target server, which is what the original design has always intended to do. Now we're able to do it without impacting the traffic source objects, which can now expect to be running on a dedicated thread where only the traffic source itself is responsible for any blocking.

Category - Bug fix
Why these changes are required? When there are more requests outstanding than --max-concurrent-requests (default of 1024) that causes the TrafficStreamLimiter to kick in and throttle AND some of those throttled messages took more than the kafka poll timeout to be released, the Kafka consumer will fall out of the group.
What is the old behavior before changes and new behavior after changes? Old behavior was that the replayer would fall out of the group, causing it to do and drop a significant amount of work. New behavior will still buffer a lot of memory but should keep the kafka thread active to repeatedly call poll() to keep the consumer in the client group.

Issues Resolved

https://opensearch.atlassian.net/browse/MIGRATIONS-1515

Testing

A new test was added specifically to mimic the up-front burst pattern. Some other tests failed because of subtle design changes that were then reverted. A number of other tests had failed due to concurrency and general flakiness. Those were fixed/improved, but given the novelty of concurrent test runs (from an earlier PR), there are likely other ones that still need to be cleaned up.

Check List

New functionality includes testing
- All tests pass, including unit test, integration test and doctest
New functionality has been documented
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…g processed. Signed-off-by: Greg Schohn <[email protected]>

…impler and support trying to recreate channels that are not active for subsequent request setups. Signed-off-by: Greg Schohn <[email protected]>

…e tests to pass Signed-off-by: Greg Schohn <[email protected]>

Signed-off-by: Greg Schohn <[email protected]>

Signed-off-by: Greg Schohn <[email protected]> # Conflicts: # TrafficCapture/dockerSolution/src/main/docker/docker-compose.yml

…ting

… into their own package in test fixtures. That should make it easier to see how to write new functional tests with the TrafficReplayer. Signed-off-by: Greg Schohn <[email protected]>

…tion is now in test fixtures rather than inner classes. Signed-off-by: Greg Schohn <[email protected]>

AndreKurait · 2024-03-08T20:14:13Z

TrafficCapture/build.gradle

-            systemProperty 'junit.jupiter.execution.parallel.mode.default', "concurrent"
-            systemProperty 'junit.jupiter.execution.parallel.mode.classes.default', 'concurrent'
-        }
+//        useJUnitPlatform {


instead of just commenting, can you set junit.jupiter.execution.parallel.enabled = false for individual projects with a comment

I've uncommented these. We'll see how they do for the next couple builds.

Signed-off-by: Greg Schohn <[email protected]> # Conflicts: # TrafficCapture/captureKafkaOffloader/src/main/java/org/opensearch/migrations/trafficcapture/kafkaoffloader/KafkaCaptureFactory.java

…ng with a fix for it. The fix is to add a nonce close that pertains to the replayer's starting point (in traffic stream index) and pass that all the way down to the session keys. That lets us work on a target session basis to close out streams rather than on the source, which might get split any number of times. Signed-off-by: Greg Schohn <[email protected]>

…ed. Cap the number of retries to 16 and don't do it for things that aren't IOExceptions. There's lots of work that needs to happen for retries, the code in place now isn't meant to finish that, but rather setup the potential that a connection could have been broken and that when a new request comes in, if things are working well, that it can get a new connection. Without this code, FullTrafficReplayerTests were not completing because during TrafficReplayer teardowns, IllegalStateExceptions were being thrown because the netty eventloops weren't accepting work. I've also re-enabled slowTests which were disabled. Signed-off-by: Greg Schohn <[email protected]>

AndreKurait · 2024-03-11T19:14:17Z

...re/trafficReplayer/src/main/java/org/opensearch/migrations/replay/AccumulationCallbacks.java

-    void onConnectionClose(int channelInteractionNumber,
-                           @NonNull IReplayContexts.IChannelKeyContext ctx,
+    void onConnectionClose(int channelInteractionNum,
+                           IReplayContexts.@NonNull IChannelKeyContext ctx, int channelSessionNumber,


Did you intend to keep this ReplayContexts.@NonNull IChannelKeyContext ctx?

That's a very strange way to annotate. I've changed it to @NonNull ReplayContexts.IChannelKeyContext.

… be more careful about polling for metrics from the InMemory verifications. Using specific name prefixes for threads allows us to wait for specific threads to die rather than ones that haven't been started by the current test. Signed-off-by: Greg Schohn <[email protected]>

codecov · 2024-03-14T04:56:56Z

Codecov Report

Attention: Patch coverage is 79.25697% with 67 lines in your changes are missing coverage. Please review.

Project coverage is 76.32%. Comparing base (01b62dc) to head (32b4db6).
Report is 11 commits behind head on main.

Files	Patch %	Lines
...search/migrations/replay/ClientConnectionPool.java	68.62%	12 Missing and 4 partials ⚠️
...replay/datahandlers/NettyPacketToHttpConsumer.java	71.92%	11 Missing and 5 partials ⚠️
.../opensearch/migrations/replay/TrafficReplayer.java	72.50%	4 Missing and 7 partials ⚠️
...ions/replay/datatypes/ConnectionReplaySession.java	77.41%	2 Missing and 5 partials ⚠️
...org/opensearch/migrations/replay/Accumulation.java	76.19%	1 Missing and 4 partials ⚠️
...lay/util/DiagnosticTrackableCompletableFuture.java	50.00%	0 Missing and 3 partials ⚠️
...arch/migrations/replay/tracing/ReplayContexts.java	33.33%	2 Missing ⚠️
...ficcapture/kafkaoffloader/KafkaCaptureFactory.java	0.00%	1 Missing ⚠️
...netty/ConditionallyReliableLoggingHttpHandler.java	0.00%	1 Missing ⚠️
...tions/trafficcapture/netty/LoggingHttpHandler.java	0.00%	1 Missing ⚠️
... and 4 more

Additional details and impacted files

@@             Coverage Diff              @@
##               main     #526      +/-   ##
============================================
- Coverage     76.54%   76.32%   -0.22%     
- Complexity     1373     1387      +14     
============================================
  Files           154      154              
  Lines          5930     5986      +56     
  Branches        532      537       +5     
============================================
+ Hits           4539     4569      +30     
- Misses         1038     1054      +16     
- Partials        353      363      +10

Flag	Coverage Δ
unittests	`76.32% <79.25%> (-0.22%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Andre Kurait <[email protected]>

…8x increase), by reducing the level of some logs for BlockingTrafficSource, and use a dedicated thread to run the blocking async activities from. Signed-off-by: Greg Schohn <[email protected]>

…assembled. As expected, that created more latency, which was enough to break all of the e2e tests. They are admittedly brittle, but if a response takes 5 minutes, holding up the request for that long is unacceptable. We'll need another way to tackle the selective auth-transformation problem (which is easier if you know the source status code of the response). Anyway, I didn't bring this back the same way that it was when it went out. Now the callback for the request returns a Consumer that's called once the full request comes in. That eliminates the need for another Map and a lot more book keeping. Now all of the e2e tests pass locally for me. Same for all of the unit tests and slowTests. Signed-off-by: Greg Schohn <[email protected]>

…-migrations into FixTrafficLimiting

AndreKurait · 2024-03-19T15:05:54Z

TrafficCapture/buildSrc/src/main/groovy/org/opensearch/migrations/common/CommonUtils.groovy

@@ -64,7 +64,7 @@ class CommonUtils {
            // constructed in the configuration phase and the classpath won't be realized until the
            // execution phase.  Therefore, we need to have docker run the command to resolve the classpath
            // and it's simplest to pack that up into a helper script.
-            runCommand("printf \"#!/bin/sh\\njava -cp `echo /jars/*.jar | tr \\   :` \\\"\\\$@\\\" \" > /runJavaWithClasspath.sh");
+            runCommand("printf \"#!/bin/sh\\njava -XX:MaxRAMPercentage=80.0 -cp `echo /jars/*.jar | tr \\   :` \\\"\\\$@\\\" \" > /runJavaWithClasspath.sh");


Thanks for this, should we document this somewhere?
Do we need to define ram limits in the docker solution to limit each container?

AndreKurait

I read through all the code in this PR. I don't understand it all completely, though I didn't see anything concerning, each file seems to be an improvement over the previous version.

We have an open question regarding if connection close behavior is impacted by this CR. Overall, it seems this CR is an improvement to replayer scaling, but i'd like to make sure we don't hamper the existing lower throughput use case with it. At a minimum, i'd like to see that as a fast follow.

I'm good if you force the TrafficReplayer slowTests to run in series to get them to pass for this PR

Signed-off-by: Greg Schohn <[email protected]>

…urce cleanup too. * There was a surprising and minor bug in ReplayEngine::isWorkOutstanding - it was simply off by one. * Consider this response to PR-Feedback from my own review... * General refactoring - on RequestResponsePacketPair. I wasn't happy that the lambdas were attached to the RRPair for the lifetime of that pair (through tuple outputting) so I introduced a holder class RequestResponsePacketPairWithCallback, which is held directly by the Accumulation and recycled for every new request. When the object is passed downstream, only the RequestResponsePacketPair is persisted, allowing the callbacks to evaporate to the GC. * A unit test was failing with a ConcurrentModificationException because close() wasn't being called on the same thread as polls. That was remedied via the kafkaExecutor in KafkaTrafficCaptureSource. * More threads/executors are shutdown, including within BlockingTrafficSource::close (executorForBlockingActivity) * I've added a timestamp to progress lines so that it's easier follow the progress. * For testing, the TimeShifter class now takes an additive value to shift realtime value. That allows a test to make sure that that a new test, testLongRequestEndingAfterEOFStillCountsCorrectly, to schedule a close far out into the future so that we can confirm that an EOF/teardown doesn't destroy pending work. * Many unit tests now wrap closeable resources within try-with-finally blocks. I was tired of looking at large numbers of threads. * I had to refactor the TrafficReplayer a bit to all others to override the wait for work mechanisms within teardown so that post-confirmation can be complete for some tests. Signed-off-by: Greg Schohn <[email protected]>

gregschohn added 4 commits February 22, 2024 19:18

First scratch commit to more safely limit the number of requests bein…

58c9b8f

…g processed. Signed-off-by: Greg Schohn <[email protected]>

Trying to simplify RequestSenderOrchestrator and to make code safer/s…

cba61e9

…impler and support trying to recreate channels that are not active for subsequent request setups. Signed-off-by: Greg Schohn <[email protected]>

Downgrade and fix the version of OpenSearch to 2.9.0 to try to get th…

fe45fb6

…e tests to pass Signed-off-by: Greg Schohn <[email protected]>

Checkpoint. Will update description in next push

cb084e0

Signed-off-by: Greg Schohn <[email protected]>

gregschohn force-pushed the FixTrafficLimiting branch from fb13bfc to cb084e0 Compare February 26, 2024 15:59

gregschohn added 7 commits February 26, 2024 12:42

Disable more parallel testing to stop builds from running out of memory

147b599

Signed-off-by: Greg Schohn <[email protected]>

Merge branch 'main' into FixTrafficLimiting

51ad945

Signed-off-by: Greg Schohn <[email protected]> # Conflicts: # TrafficCapture/dockerSolution/src/main/docker/docker-compose.yml

Merge branch 'main' into FixTrafficLimiting

0beb25a

Merge branch 'main' into FixTrafficLimiting

23941a9

Merge branch 'RemovingOpenSearchAnalyticsCluster' into FixTrafficLimi…

b4f0f8b

…ting

Refactoring to separate higher-level TrafficReplayer functional tests…

6db9a59

… into their own package in test fixtures. That should make it easier to see how to write new functional tests with the TrafficReplayer. Signed-off-by: Greg Schohn <[email protected]>

Further refactoring to make future tests easier. TrafficStream genera…

5609d54

…tion is now in test fixtures rather than inner classes. Signed-off-by: Greg Schohn <[email protected]>

gregschohn changed the base branch from main to api_poc March 8, 2024 18:48

gregschohn changed the base branch from api_poc to main March 8, 2024 18:48

AndreKurait reviewed Mar 8, 2024

View reviewed changes

gregschohn added 2 commits March 9, 2024 20:36

Merge branch 'main' into FixTargetReplayerConnectionCollisions

af490c5

Signed-off-by: Greg Schohn <[email protected]> # Conflicts: # TrafficCapture/captureKafkaOffloader/src/main/java/org/opensearch/migrations/trafficcapture/kafkaoffloader/KafkaCaptureFactory.java

gregschohn marked this pull request as ready for review March 10, 2024 05:08

gregschohn requested review from chelma, kartg, lewijacn, mikaylathompson, okhasawn and sumobrian as code owners March 10, 2024 05:08

gregschohn changed the base branch from main to build-complete-local-from-base March 10, 2024 05:09

gregschohn changed the base branch from build-complete-local-from-base to main March 10, 2024 05:09

AndreKurait reviewed Mar 11, 2024

View reviewed changes

AndreKurait and others added 4 commits March 14, 2024 15:02

Update for WhiteFir

14194dc

Signed-off-by: Andre Kurait <[email protected]>

Improve tracing and busy work by increasing the buffer window to 8s (…

a6b95f9

…8x increase), by reducing the level of some logs for BlockingTrafficSource, and use a dedicated thread to run the blocking async activities from. Signed-off-by: Greg Schohn <[email protected]>

Merge branch 'FixTrafficLimiting' of github.com:gregschohn/opensearch…

769fd27

…-migrations into FixTrafficLimiting

gregschohn force-pushed the FixTrafficLimiting branch 2 times, most recently from 50f4a6c to 9de5c74 Compare March 19, 2024 02:08

AndreKurait reviewed Mar 19, 2024

View reviewed changes

AndreKurait approved these changes Mar 19, 2024

View reviewed changes

Minor changes from PR feedback.

f2bf136

Signed-off-by: Greg Schohn <[email protected]>

gregschohn force-pushed the FixTrafficLimiting branch from 9de5c74 to f2bf136 Compare March 20, 2024 03:15

gregschohn merged commit 5e097c5 into opensearch-project:main Mar 21, 2024
6 of 7 checks passed

gregschohn deleted the FixTrafficLimiting branch April 3, 2024 12:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix traffic limiting #526

Fix traffic limiting #526

gregschohn commented Feb 25, 2024 •

edited

Loading

AndreKurait Mar 8, 2024

gregschohn Mar 19, 2024

AndreKurait Mar 11, 2024

gregschohn Mar 19, 2024

codecov bot commented Mar 14, 2024 •

edited

Loading

AndreKurait Mar 19, 2024

AndreKurait left a comment •

edited

Loading

Fix traffic limiting #526

Fix traffic limiting #526

Conversation

gregschohn commented Feb 25, 2024 • edited Loading

Description

Issues Resolved

Testing

Check List

AndreKurait Mar 8, 2024

Choose a reason for hiding this comment

gregschohn Mar 19, 2024

Choose a reason for hiding this comment

AndreKurait Mar 11, 2024

Choose a reason for hiding this comment

gregschohn Mar 19, 2024

Choose a reason for hiding this comment

codecov bot commented Mar 14, 2024 • edited Loading

Codecov Report

AndreKurait Mar 19, 2024

Choose a reason for hiding this comment

AndreKurait left a comment • edited Loading

Choose a reason for hiding this comment

gregschohn commented Feb 25, 2024 •

edited

Loading

codecov bot commented Mar 14, 2024 •

edited

Loading

AndreKurait left a comment •

edited

Loading