Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replayer instrumentation #475

Merged
merged 100 commits into from
Feb 9, 2024

Conversation

gregschohn
Copy link
Collaborator

@gregschohn gregschohn commented Dec 15, 2023

Description

See the README included with this PR.

TLDR - lots of proxy and replayer code now passes contexts throughout the call graph. Those Instumented Contexts are invoked to signal different pieces of progress, which are then turned into metrics and traces that are sent to an otel-collector. The otel collector is similar to what it was before this PR, but the configurations and its deployment, both in docker and for AWS, have been slightly adjusted.

For AWS Cloud deployments, we now configure CloudWatch and X-Ray to received metrics and traces (via OTEL's Collector). For the docker solution, we use Prometheus and Jaeger ("all-in-one") to store metric and trace data. A Grafana container is also started alongside the other containers to allow for discovery and (eventually) dashboards.

  • Category Enhancement
  • Why these changes are required? Observability
  • What is the old behavior before changes and new behavior after changes?

Issues Resolved

https://opensearch.atlassian.net/browse/MIGRATIONS-1438

Is this a backport? If so, please add backport PR # and/or commits #

Testing

Unit testing + E2E testing

Check List

  • [partial] New functionality includes testing
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…thod for others to leverage.

Signed-off-by: Greg Schohn <[email protected]>
Most of this is just playing, but making the StreamManager implement AutoCloseable gives a place to end spans to show how long a serializer/connection factory was relevant for.

Signed-off-by: Greg Schohn <[email protected]>
…to the collector to prometheus, zipkin, etc

Signed-off-by: Greg Schohn <[email protected]>
Signed-off-by: Greg Schohn <[email protected]>
…lotted within the same graph in prometheus.

Signed-off-by: Greg Schohn <[email protected]>
…ics into an optional.

Dropping the optionals makes the code simpler and if we don't want to do logging, we can just not fill in the configuration for the SDK.

Signed-off-by: Greg Schohn <[email protected]>
…troduce some more typesafe wrappers for contexts.

Lots more to come.

Signed-off-by: Greg Schohn <[email protected]>
…explicitly passing strongly typed context objects.

Signed-off-by: Greg Schohn <[email protected]>
Signed-off-by: Greg Schohn <[email protected]>
Make sure that the context is using the right requestKey, which also will have the appropriate indices as per the test context.

Signed-off-by: Greg Schohn <[email protected]>
…ayerRequestContext to the replayer

Signed-off-by: Greg Schohn <[email protected]>
Don't bother showing the Kakfa offloader just buffering (was called recordStream).  Now the offloader span is a child span of the connection span from the handler, so we can see the handler gathering the request/response (or waiting for the response).

Signed-off-by: Greg Schohn <[email protected]>
That makes it a separate state for the logging handler superclass.

Signed-off-by: Greg Schohn <[email protected]>
…rocessor.

Prometheus metrics already have an export_name that is unique, the processors weren't doing anything useful, & the namespace was appending EVERYTHING from one of the two services.

Signed-off-by: Greg Schohn <[email protected]>
…d (less so for now) metrics can be exported across more of the lifetime of a request/connection.

Signed-off-by: Greg Schohn <[email protected]>
Signed-off-by: Greg Schohn <[email protected]>

# Conflicts:
#	TrafficCapture/nettyWireLogging/src/main/java/org/opensearch/migrations/trafficcapture/netty/ConditionallyReliableLoggingHttpRequestHandler.java
#	TrafficCapture/nettyWireLogging/src/main/java/org/opensearch/migrations/trafficcapture/netty/LoggingHttpRequestHandler.java
#	TrafficCapture/nettyWireLogging/src/test/java/org/opensearch/migrations/trafficcapture/netty/ConditionallyReliableLoggingHttpRequestHandlerTest.java
#	TrafficCapture/trafficCaptureProxyServer/src/main/java/org/opensearch/migrations/trafficcapture/proxyserver/netty/ProxyChannelInitializer.java
#	TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/Accumulation.java
#	TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/CapturedTrafficToHttpTransactionAccumulator.java
#	TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/RequestResponsePacketPair.java
#	TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/RequestSenderOrchestrator.java
#	TrafficCapture/trafficReplayer/src/test/java/org/opensearch/migrations/replay/SimpleCapturedTrafficToHttpTransactionAccumulatorTest.java
…when the request was ignored and remove the now-unused file for responses.

I'd like to revisit this eventually to make sure that it's as efficient as possible and to organize it better.  However, it does get the job done for now and tests were updated to confirm.

Signed-off-by: Greg Schohn <[email protected]>
…ifferent levels of them.

This has the minimal amount of work to get those relationships to simply compile.  Nearly every unit test fails and the code is more clunky than it needs to be, but getting to this point was alone a major lift.

Signed-off-by: Greg Schohn <[email protected]>
Signed-off-by: Greg Schohn <[email protected]>
…in, but the replayer isn't crashing.

Signed-off-by: Greg Schohn <[email protected]>
…scovered by trace inspection.

1) close was called AFTER the RRPair was rotated, so there were no traffic streams being committed, resulting in a perpetual hole in the commit log.
2) close was being scheduled immediately, before all requests (in most cases) because the channelInteractionNumber was being miscalculated.

Signed-off-by: Greg Schohn <[email protected]>
…using the same ConnectionReplaySession to be returned for every connection, which resulted in serious corruption of ordering and tests never completing.

Signed-off-by: Greg Schohn <[email protected]>
Signed-off-by: Greg Schohn <[email protected]>

# Conflicts:
#	TrafficCapture/captureOffloader/src/test/java/org/opensearch/migrations/trafficcapture/StreamChannelConnectionCaptureSerializerTest.java
#	TrafficCapture/dockerSolution/src/main/docker/migrationConsole/runTestBenchmarks.sh
#	deployment/cdk/opensearch-service-migration/lib/service-stacks/migration-analytics-stack.ts
@gregschohn gregschohn force-pushed the ReplayerInstrumentation branch from 82b71cc to 9a31210 Compare February 6, 2024 04:35
@gregschohn gregschohn marked this pull request as ready for review February 6, 2024 04:44
return getExponentialBucketsBetween(firstBucketSize, lastBucketCeiling, 2.0);
}

private static List<Double> getExponentialBucketsBetween(double firstBucketSize, double lastBucketCeiling,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we intend this to work with negative numbers? Say bucket boundaries of [-2, 2]?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

double firstBucketSize, double lastBucketCeiling) {
var buckets = getExponentialBucketsBetween(firstBucketSize, lastBucketCeiling, 2.0);
log.atInfo().setMessage(() -> "Setting buckets for " + activityName + " to " +
buckets.stream().map(x -> "" + x).collect(Collectors.joining(",", "[", "]"))).log();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this actually printing? Are we accurately showing what's a < end vs <=?

…t constructor override, throw an IllegalArgumentException.

Signed-off-by: Greg Schohn <[email protected]>
Copy link
Member

@AndreKurait AndreKurait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address CI/CD build failure

@gregschohn gregschohn force-pushed the ReplayerInstrumentation branch from 1a830f3 to 66dcad2 Compare February 7, 2024 17:42
Bugfix is in the capture proxy's channel context's `sendMeterEventsForEnd()` override to call super so that we'll pickup duration, etc metrics too.
The lint fix is in a new python script to facilitate testing from the migration console.
The test fix is to give each TrafficReplayer run a fresh TestContext.   That context includes channel contexts, which should not be reused across process boundaries and likewise shouldn't be getting reused if we're trying to simulate that for repeated runs.

Signed-off-by: Greg Schohn <[email protected]>
…en though the channel context was.

Also make minor fixes to other tests to make them more resilient.

Signed-off-by: Greg Schohn <[email protected]>
@gregschohn gregschohn force-pushed the ReplayerInstrumentation branch from 66dcad2 to 16a9010 Compare February 7, 2024 23:31
…mplicitly manage them within a ChannelContext.

I've also made a couple more test and production data structures more threadsafe.

Signed-off-by: Greg Schohn <[email protected]>
…ks in test code.

The SocketContext is now closed in the callback for the channel close rather than before we call close().  That should make a test failure due to duplicate close() calls much less likely and should also give us a better idea of when the socket was actually closed.
There were some OutOfMemoryErrors coming back from the github action after 10 minutes of testing.  I believe that this was due to having a number of InMemoryMetricExporters that periodic metric exporters were pumping to (in perpetuity, even after the test was complete).  We were also potentially tracking lots of backtraces in the ContextTrackers.  Both of these now have close() calls that clears all of that logged data.  That's now called by TestContext.close(), which is wired for each InstrumentationTest.
The next commit will tie off a lot more loose ends, but this commit was tested more extensively, hence the reason that I'm keeping them separate.

Signed-off-by: Greg Schohn <[email protected]>
…nstants, even if an Instant is what's passed in to the constructor

Signed-off-by: Greg Schohn <[email protected]>
…tributes end up as annotations instead of metadata so that they can be searched.

To search in x-ray, use `annotationId.ATTRIBUTE_NAME="..."`.

Signed-off-by: Greg Schohn <[email protected]>
@gregschohn gregschohn merged commit ffbb46e into opensearch-project:main Feb 9, 2024
8 checks passed
@gregschohn gregschohn deleted the ReplayerInstrumentation branch April 3, 2024 12:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants