-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Otel metrics and traces #460
Otel metrics and traces #460
Conversation
…thod for others to leverage. Signed-off-by: Greg Schohn <[email protected]>
Most of this is just playing, but making the StreamManager implement AutoCloseable gives a place to end spans to show how long a serializer/connection factory was relevant for. Signed-off-by: Greg Schohn <[email protected]>
…to the collector to prometheus, zipkin, etc Signed-off-by: Greg Schohn <[email protected]>
Signed-off-by: Greg Schohn <[email protected]>
…e config hierarchy. This was broken from the merge https://github.com/opensearch-project/opensearch-migrations/pull/376/files#diff-430f89dc33402ecf692b9a8372f66e585bb2f9215596433216580efc2a56795c. Signed-off-by: Greg Schohn <[email protected]>
…lotted within the same graph in prometheus. Signed-off-by: Greg Schohn <[email protected]>
…ics into an optional. Dropping the optionals makes the code simpler and if we don't want to do logging, we can just not fill in the configuration for the SDK. Signed-off-by: Greg Schohn <[email protected]>
…troduce some more typesafe wrappers for contexts. Lots more to come. Signed-off-by: Greg Schohn <[email protected]>
…explicitly passing strongly typed context objects. Signed-off-by: Greg Schohn <[email protected]>
Signed-off-by: Greg Schohn <[email protected]>
Make sure that the context is using the right requestKey, which also will have the appropriate indices as per the test context. Signed-off-by: Greg Schohn <[email protected]>
…ayerRequestContext to the replayer Signed-off-by: Greg Schohn <[email protected]>
Signed-off-by: Greg Schohn <[email protected]>
Don't bother showing the Kakfa offloader just buffering (was called recordStream). Now the offloader span is a child span of the connection span from the handler, so we can see the handler gathering the request/response (or waiting for the response). Signed-off-by: Greg Schohn <[email protected]>
That makes it a separate state for the logging handler superclass. Signed-off-by: Greg Schohn <[email protected]>
…rocessor. Prometheus metrics already have an export_name that is unique, the processors weren't doing anything useful, & the namespace was appending EVERYTHING from one of the two services. Signed-off-by: Greg Schohn <[email protected]>
…d (less so for now) metrics can be exported across more of the lifetime of a request/connection. Signed-off-by: Greg Schohn <[email protected]>
… a test bug. Signed-off-by: Greg Schohn <[email protected]>
Signed-off-by: Greg Schohn <[email protected]> # Conflicts: # TrafficCapture/nettyWireLogging/src/main/java/org/opensearch/migrations/trafficcapture/netty/ConditionallyReliableLoggingHttpRequestHandler.java # TrafficCapture/nettyWireLogging/src/main/java/org/opensearch/migrations/trafficcapture/netty/LoggingHttpRequestHandler.java # TrafficCapture/nettyWireLogging/src/test/java/org/opensearch/migrations/trafficcapture/netty/ConditionallyReliableLoggingHttpRequestHandlerTest.java # TrafficCapture/trafficCaptureProxyServer/src/main/java/org/opensearch/migrations/trafficcapture/proxyserver/netty/ProxyChannelInitializer.java # TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/Accumulation.java # TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/CapturedTrafficToHttpTransactionAccumulator.java # TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/RequestResponsePacketPair.java # TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/RequestSenderOrchestrator.java # TrafficCapture/trafficReplayer/src/test/java/org/opensearch/migrations/replay/SimpleCapturedTrafficToHttpTransactionAccumulatorTest.java
Signed-off-by: Greg Schohn <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments, but had a general question:
Wasn't sure what the status of MetricsLogger is? Are we in a transition state with it currently, but plan to remove it in future?
- migrations | ||
depends_on: | ||
- opensearchanalytics | ||
# otel-collector: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove this commented out block?
@@ -70,6 +88,7 @@ services: | |||
- migrations | |||
volumes: | |||
- sharedReplayerOutput:/shared-replayer-output | |||
- /Users/schohn/dev/opensearch-migrations/TrafficCapture/containerLogs:/logs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reference to local needs removal
@@ -132,6 +153,7 @@ services: | |||
- sharedReplayerOutput:/shared-replayer-output | |||
environment: | |||
- MIGRATION_KAFKA_BROKER_ENDPOINTS=kafka:9092 | |||
# command: ./runTestBenchmarks.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove
capture-proxy-es: | ||
image: 'migrations/capture_proxy:latest' | ||
|
||
prometheus: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having prometheus,jaeger,and zipkin containers as a default seems like a heavy tax. Seems like we should have the option to have these or not. Seems like we also need a general cleanup of this file of unneeded comments
api group: 'io.netty', name: 'netty-codec-http' | ||
api group: 'io.netty', name: 'netty-handler' | ||
|
||
implementation group: 'io.opentelemetry', name:'opentelemetry-api' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this or line 26 when coreUtilities already has these dependencies?
ch.pipeline().addLast(new ConditionallyReliableLoggingHttpRequestHandler<T>(offloader, | ||
requestCapturePredicate, this::shouldGuaranteeMessageOffloading)); | ||
var connectionId = ch.id().asLongText(); | ||
ch.pipeline().addLast(new ConditionallyReliableLoggingHttpRequestHandler<T>("n", "c", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it was hardcoded for testing and not changed back
@@ -1,4 +1,4 @@ | |||
status = info | |||
status = debug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't default be info?
logger.MetricsLogger.name = MetricsLogger | ||
logger.MetricsLogger.level = info | ||
logger.MetricsLogger.additivity = false | ||
logger.MetricsLogger.appenderRef.METRICS.ref = METRICS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we do any SLF4J logging for the MetricsLogger?
@@ -0,0 +1,9 @@ | |||
# Set the global logging level for all loggers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this file control that log4j2.properties file cannot control?
@@ -18,3 +18,8 @@ logger.OutputTupleJsonLogger.level = OFF | |||
logger.KPC.name = org.opensearch.migrations.replay.kafka.KafkaProtobufConsumer | |||
logger.KPC.level = DEBUG | |||
logger.KPC.appenderRef.stdout.ref = Console | |||
|
|||
logger.RSO.name = org.opensearch.migrations.replay.RequestSenderOrchestrator | |||
logger.RSO.level = TRACE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be at trace as a default?
Description
This change will break the flow of messages emitted by the existing MetricsLogger class into OpenSearch. The biggest reason for that is that the change is that I'm now using the otel/opentelemetry-collector:latest image rather than a bespoke one.
Aside from the breakage, there's a lot being added in the form of Metrics and Traces. We're using OpenTelemetry to send both, currently to a collector (just as we were doing with logs). The collector sends data to two new trace collection containers (Jaeger & Zipkin) and has a Prometheus container pulling metrics from it.
The Java application code itself eschews some of the typical OpenTelemetry techniques for instrumentation. Instead of using ThreadLocals to pass maybe-present values around within contexts, which each instrumentation point needs to determine how to use them, custom context classes for Connections, Requests, KafkaRecords, etc are constructed and explicitly passed into functions and into callbacks. Those classes implement IWithAttributes and the fillAttributes() function to select which fields should be included within the instruments that are being emitted.
The contexts themselves are tightly related to Spans. Usually a new context will have a new span, a new span will always require a new context. The context classes themselves have the ability to chain back to a parent scope. When the context is converted into an Attributes object for the instruments, the attributes from parent contexts will also be included (with key-values from subclasses overwriting their parent's values in cases of conflict).
There are some judicious uses of generic wildcard constraints to make it quicker and more foolproof to create spans so that they're appropriately associated as children with the containing context's span. There's also support to make it easier to store start timestamps to emit duration metrics.
There's a LOT left to do here, but there's a lot that's done and I'd like to get feedback on the patterns that are emerging. Some of the top remaining items.
More metrics, more traces
Getting separate namespaces for capture and replayer working for metrics
Figuring out our cloud story. Should we deploy some more ECS containers for the AWS CDK or should we use AWS cloud native stuff, AWS hosted stuff?
Tests, Tests, Tests - Otel has some easy to use facilities to simplify checking within tests.
Hardening the interfaces more. There are lots of inline strings. These should be managed from more centralized places and we should have tests to do double-entry book-keeping so that they don't change without warnings.
Moving the existing MetricsLogger code out.
Category - Enhancement
Why these changes are required? Visibility into what our services are doing.
Issues Resolved
Part of Improve Metrics Explanations
Testing
Lots of manual testing for now
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.