-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replayer instrumentation #475
Replayer instrumentation #475
Conversation
…thod for others to leverage. Signed-off-by: Greg Schohn <[email protected]>
Most of this is just playing, but making the StreamManager implement AutoCloseable gives a place to end spans to show how long a serializer/connection factory was relevant for. Signed-off-by: Greg Schohn <[email protected]>
…to the collector to prometheus, zipkin, etc Signed-off-by: Greg Schohn <[email protected]>
Signed-off-by: Greg Schohn <[email protected]>
…e config hierarchy. This was broken from the merge https://github.com/opensearch-project/opensearch-migrations/pull/376/files#diff-430f89dc33402ecf692b9a8372f66e585bb2f9215596433216580efc2a56795c. Signed-off-by: Greg Schohn <[email protected]>
…lotted within the same graph in prometheus. Signed-off-by: Greg Schohn <[email protected]>
…ics into an optional. Dropping the optionals makes the code simpler and if we don't want to do logging, we can just not fill in the configuration for the SDK. Signed-off-by: Greg Schohn <[email protected]>
…troduce some more typesafe wrappers for contexts. Lots more to come. Signed-off-by: Greg Schohn <[email protected]>
…explicitly passing strongly typed context objects. Signed-off-by: Greg Schohn <[email protected]>
Signed-off-by: Greg Schohn <[email protected]>
Make sure that the context is using the right requestKey, which also will have the appropriate indices as per the test context. Signed-off-by: Greg Schohn <[email protected]>
…ayerRequestContext to the replayer Signed-off-by: Greg Schohn <[email protected]>
Signed-off-by: Greg Schohn <[email protected]>
Don't bother showing the Kakfa offloader just buffering (was called recordStream). Now the offloader span is a child span of the connection span from the handler, so we can see the handler gathering the request/response (or waiting for the response). Signed-off-by: Greg Schohn <[email protected]>
That makes it a separate state for the logging handler superclass. Signed-off-by: Greg Schohn <[email protected]>
…rocessor. Prometheus metrics already have an export_name that is unique, the processors weren't doing anything useful, & the namespace was appending EVERYTHING from one of the two services. Signed-off-by: Greg Schohn <[email protected]>
…d (less so for now) metrics can be exported across more of the lifetime of a request/connection. Signed-off-by: Greg Schohn <[email protected]>
… a test bug. Signed-off-by: Greg Schohn <[email protected]>
Signed-off-by: Greg Schohn <[email protected]> # Conflicts: # TrafficCapture/nettyWireLogging/src/main/java/org/opensearch/migrations/trafficcapture/netty/ConditionallyReliableLoggingHttpRequestHandler.java # TrafficCapture/nettyWireLogging/src/main/java/org/opensearch/migrations/trafficcapture/netty/LoggingHttpRequestHandler.java # TrafficCapture/nettyWireLogging/src/test/java/org/opensearch/migrations/trafficcapture/netty/ConditionallyReliableLoggingHttpRequestHandlerTest.java # TrafficCapture/trafficCaptureProxyServer/src/main/java/org/opensearch/migrations/trafficcapture/proxyserver/netty/ProxyChannelInitializer.java # TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/Accumulation.java # TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/CapturedTrafficToHttpTransactionAccumulator.java # TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/RequestResponsePacketPair.java # TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/RequestSenderOrchestrator.java # TrafficCapture/trafficReplayer/src/test/java/org/opensearch/migrations/replay/SimpleCapturedTrafficToHttpTransactionAccumulatorTest.java
Signed-off-by: Greg Schohn <[email protected]>
…when the request was ignored and remove the now-unused file for responses. I'd like to revisit this eventually to make sure that it's as efficient as possible and to organize it better. However, it does get the job done for now and tests were updated to confirm. Signed-off-by: Greg Schohn <[email protected]>
…uests and responses. Signed-off-by: Greg Schohn <[email protected]>
…ifferent levels of them. This has the minimal amount of work to get those relationships to simply compile. Nearly every unit test fails and the code is more clunky than it needs to be, but getting to this point was alone a major lift. Signed-off-by: Greg Schohn <[email protected]>
Signed-off-by: Greg Schohn <[email protected]>
Signed-off-by: Greg Schohn <[email protected]>
…in, but the replayer isn't crashing. Signed-off-by: Greg Schohn <[email protected]>
Signed-off-by: Greg Schohn <[email protected]>
…scovered by trace inspection. 1) close was called AFTER the RRPair was rotated, so there were no traffic streams being committed, resulting in a perpetual hole in the commit log. 2) close was being scheduled immediately, before all requests (in most cases) because the channelInteractionNumber was being miscalculated. Signed-off-by: Greg Schohn <[email protected]>
…using the same ConnectionReplaySession to be returned for every connection, which resulted in serious corruption of ordering and tests never completing. Signed-off-by: Greg Schohn <[email protected]>
Signed-off-by: Greg Schohn <[email protected]> # Conflicts: # TrafficCapture/captureOffloader/src/test/java/org/opensearch/migrations/trafficcapture/StreamChannelConnectionCaptureSerializerTest.java # TrafficCapture/dockerSolution/src/main/docker/migrationConsole/runTestBenchmarks.sh # deployment/cdk/opensearch-service-migration/lib/service-stacks/migration-analytics-stack.ts
Signed-off-by: Greg Schohn <[email protected]>
82b71cc
to
9a31210
Compare
return getExponentialBucketsBetween(firstBucketSize, lastBucketCeiling, 2.0); | ||
} | ||
|
||
private static List<Double> getExponentialBucketsBetween(double firstBucketSize, double lastBucketCeiling, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we intend this to work with negative numbers? Say bucket boundaries of [-2, 2]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
...icCapture/coreUtilities/src/main/java/org/opensearch/migrations/tracing/RootOtelContext.java
Show resolved
Hide resolved
...icCapture/coreUtilities/src/main/java/org/opensearch/migrations/tracing/RootOtelContext.java
Show resolved
Hide resolved
double firstBucketSize, double lastBucketCeiling) { | ||
var buckets = getExponentialBucketsBetween(firstBucketSize, lastBucketCeiling, 2.0); | ||
log.atInfo().setMessage(() -> "Setting buckets for " + activityName + " to " + | ||
buckets.stream().map(x -> "" + x).collect(Collectors.joining(",", "[", "]"))).log(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this actually printing? Are we accurately showing what's a < end vs <=?
…t constructor override, throw an IllegalArgumentException. Signed-off-by: Greg Schohn <[email protected]>
…losed. Signed-off-by: Greg Schohn <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address CI/CD build failure
1a830f3
to
66dcad2
Compare
Bugfix is in the capture proxy's channel context's `sendMeterEventsForEnd()` override to call super so that we'll pickup duration, etc metrics too. The lint fix is in a new python script to facilitate testing from the migration console. The test fix is to give each TrafficReplayer run a fresh TestContext. That context includes channel contexts, which should not be reused across process boundaries and likewise shouldn't be getting reused if we're trying to simulate that for repeated runs. Signed-off-by: Greg Schohn <[email protected]>
…en though the channel context was. Also make minor fixes to other tests to make them more resilient. Signed-off-by: Greg Schohn <[email protected]>
66dcad2
to
16a9010
Compare
…mplicitly manage them within a ChannelContext. I've also made a couple more test and production data structures more threadsafe. Signed-off-by: Greg Schohn <[email protected]>
…ks in test code. The SocketContext is now closed in the callback for the channel close rather than before we call close(). That should make a test failure due to duplicate close() calls much less likely and should also give us a better idea of when the socket was actually closed. There were some OutOfMemoryErrors coming back from the github action after 10 minutes of testing. I believe that this was due to having a number of InMemoryMetricExporters that periodic metric exporters were pumping to (in perpetuity, even after the test was complete). We were also potentially tracking lots of backtraces in the ContextTrackers. Both of these now have close() calls that clears all of that logged data. That's now called by TestContext.close(), which is wired for each InstrumentationTest. The next commit will tie off a lot more loose ends, but this commit was tested more extensively, hence the reason that I'm keeping them separate. Signed-off-by: Greg Schohn <[email protected]>
Signed-off-by: Greg Schohn <[email protected]>
…nstants, even if an Instant is what's passed in to the constructor Signed-off-by: Greg Schohn <[email protected]>
…tributes end up as annotations instead of metadata so that they can be searched. To search in x-ray, use `annotationId.ATTRIBUTE_NAME="..."`. Signed-off-by: Greg Schohn <[email protected]>
Description
See the README included with this PR.
TLDR - lots of proxy and replayer code now passes contexts throughout the call graph. Those Instumented Contexts are invoked to signal different pieces of progress, which are then turned into metrics and traces that are sent to an otel-collector. The otel collector is similar to what it was before this PR, but the configurations and its deployment, both in docker and for AWS, have been slightly adjusted.
For AWS Cloud deployments, we now configure CloudWatch and X-Ray to received metrics and traces (via OTEL's Collector). For the docker solution, we use Prometheus and Jaeger ("all-in-one") to store metric and trace data. A Grafana container is also started alongside the other containers to allow for discovery and (eventually) dashboards.
Issues Resolved
https://opensearch.atlassian.net/browse/MIGRATIONS-1438
Is this a backport? If so, please add backport PR # and/or commits #
Testing
Unit testing + E2E testing
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.