Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Otel metrics and traces #460

Merged
merged 21 commits into from
Feb 9, 2024

Conversation

gregschohn
Copy link
Collaborator

Description

This change will break the flow of messages emitted by the existing MetricsLogger class into OpenSearch. The biggest reason for that is that the change is that I'm now using the otel/opentelemetry-collector:latest image rather than a bespoke one.

Aside from the breakage, there's a lot being added in the form of Metrics and Traces. We're using OpenTelemetry to send both, currently to a collector (just as we were doing with logs). The collector sends data to two new trace collection containers (Jaeger & Zipkin) and has a Prometheus container pulling metrics from it.

The Java application code itself eschews some of the typical OpenTelemetry techniques for instrumentation. Instead of using ThreadLocals to pass maybe-present values around within contexts, which each instrumentation point needs to determine how to use them, custom context classes for Connections, Requests, KafkaRecords, etc are constructed and explicitly passed into functions and into callbacks. Those classes implement IWithAttributes and the fillAttributes() function to select which fields should be included within the instruments that are being emitted.

The contexts themselves are tightly related to Spans. Usually a new context will have a new span, a new span will always require a new context. The context classes themselves have the ability to chain back to a parent scope. When the context is converted into an Attributes object for the instruments, the attributes from parent contexts will also be included (with key-values from subclasses overwriting their parent's values in cases of conflict).

There are some judicious uses of generic wildcard constraints to make it quicker and more foolproof to create spans so that they're appropriately associated as children with the containing context's span. There's also support to make it easier to store start timestamps to emit duration metrics.

There's a LOT left to do here, but there's a lot that's done and I'd like to get feedback on the patterns that are emerging. Some of the top remaining items.

  • More metrics, more traces

  • Getting separate namespaces for capture and replayer working for metrics

  • Figuring out our cloud story. Should we deploy some more ECS containers for the AWS CDK or should we use AWS cloud native stuff, AWS hosted stuff?

  • Tests, Tests, Tests - Otel has some easy to use facilities to simplify checking within tests.

  • Hardening the interfaces more. There are lots of inline strings. These should be managed from more centralized places and we should have tests to do double-entry book-keeping so that they don't change without warnings.

  • Moving the existing MetricsLogger code out.

  • Category - Enhancement

  • Why these changes are required? Visibility into what our services are doing.

Issues Resolved

Part of Improve Metrics Explanations

Testing

Lots of manual testing for now

Check List

  • New functionality includes testing
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…thod for others to leverage.

Signed-off-by: Greg Schohn <[email protected]>
Most of this is just playing, but making the StreamManager implement AutoCloseable gives a place to end spans to show how long a serializer/connection factory was relevant for.

Signed-off-by: Greg Schohn <[email protected]>
…to the collector to prometheus, zipkin, etc

Signed-off-by: Greg Schohn <[email protected]>
Signed-off-by: Greg Schohn <[email protected]>
…lotted within the same graph in prometheus.

Signed-off-by: Greg Schohn <[email protected]>
…ics into an optional.

Dropping the optionals makes the code simpler and if we don't want to do logging, we can just not fill in the configuration for the SDK.

Signed-off-by: Greg Schohn <[email protected]>
…troduce some more typesafe wrappers for contexts.

Lots more to come.

Signed-off-by: Greg Schohn <[email protected]>
…explicitly passing strongly typed context objects.

Signed-off-by: Greg Schohn <[email protected]>
Signed-off-by: Greg Schohn <[email protected]>
Make sure that the context is using the right requestKey, which also will have the appropriate indices as per the test context.

Signed-off-by: Greg Schohn <[email protected]>
Copy link

codecov bot commented Nov 30, 2023

Codecov Report

Attention: 109 lines in your changes are missing coverage. Please review.

Comparison is base (4ee54ef) 73.57% compared to head (9cf2540) 74.26%.

Files Patch % Lines
...arch/migrations/tracing/SimpleMeteringClosure.java 40.47% 47 Missing and 3 partials ⚠️
...atahandlers/http/HttpJsonTransformingConsumer.java 72.41% 7 Missing and 1 partial ⚠️
...ions/trafficcapture/tracing/ConnectionContext.java 56.25% 7 Missing ⚠️
...ficcapture/kafkaoffloader/KafkaCaptureFactory.java 84.21% 2 Missing and 4 partials ⚠️
...onditionallyReliableLoggingHttpRequestHandler.java 57.14% 1 Missing and 5 partials ⚠️
...afficcapture/netty/RequestContextStateMachine.java 0.00% 6 Missing ⚠️
...rafficcapture/netty/LoggingHttpRequestHandler.java 89.36% 2 Missing and 3 partials ⚠️
...afficcapture/netty/LoggingHttpResponseHandler.java 0.00% 3 Missing ⚠️
...tions/trafficcapture/proxyserver/CaptureProxy.java 0.00% 3 Missing ⚠️
.../opensearch/migrations/replay/TrafficReplayer.java 66.66% 1 Missing and 2 partials ⚠️
... and 9 more
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #460      +/-   ##
============================================
+ Coverage     73.57%   74.26%   +0.68%     
- Complexity     1181     1272      +91     
============================================
  Files           124      136      +12     
  Lines          4886     5152     +266     
  Branches        439      453      +14     
============================================
+ Hits           3595     3826     +231     
- Misses          996     1020      +24     
- Partials        295      306      +11     
Flag Coverage Δ
unittests 74.26% <76.30%> (+0.68%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…ayerRequestContext to the replayer

Signed-off-by: Greg Schohn <[email protected]>
Don't bother showing the Kakfa offloader just buffering (was called recordStream).  Now the offloader span is a child span of the connection span from the handler, so we can see the handler gathering the request/response (or waiting for the response).

Signed-off-by: Greg Schohn <[email protected]>
That makes it a separate state for the logging handler superclass.

Signed-off-by: Greg Schohn <[email protected]>
…rocessor.

Prometheus metrics already have an export_name that is unique, the processors weren't doing anything useful, & the namespace was appending EVERYTHING from one of the two services.

Signed-off-by: Greg Schohn <[email protected]>
…d (less so for now) metrics can be exported across more of the lifetime of a request/connection.

Signed-off-by: Greg Schohn <[email protected]>
Signed-off-by: Greg Schohn <[email protected]>

# Conflicts:
#	TrafficCapture/nettyWireLogging/src/main/java/org/opensearch/migrations/trafficcapture/netty/ConditionallyReliableLoggingHttpRequestHandler.java
#	TrafficCapture/nettyWireLogging/src/main/java/org/opensearch/migrations/trafficcapture/netty/LoggingHttpRequestHandler.java
#	TrafficCapture/nettyWireLogging/src/test/java/org/opensearch/migrations/trafficcapture/netty/ConditionallyReliableLoggingHttpRequestHandlerTest.java
#	TrafficCapture/trafficCaptureProxyServer/src/main/java/org/opensearch/migrations/trafficcapture/proxyserver/netty/ProxyChannelInitializer.java
#	TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/Accumulation.java
#	TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/CapturedTrafficToHttpTransactionAccumulator.java
#	TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/RequestResponsePacketPair.java
#	TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/RequestSenderOrchestrator.java
#	TrafficCapture/trafficReplayer/src/test/java/org/opensearch/migrations/replay/SimpleCapturedTrafficToHttpTransactionAccumulatorTest.java
@gregschohn gregschohn changed the base branch from main to metrics-in-cdk December 12, 2023 00:06
@gregschohn gregschohn changed the base branch from metrics-in-cdk to main December 12, 2023 00:06
@gregschohn gregschohn mentioned this pull request Dec 12, 2023
4 tasks
Copy link
Collaborator

@lewijacn lewijacn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments, but had a general question:

Wasn't sure what the status of MetricsLogger is? Are we in a transition state with it currently, but plan to remove it in future?

- migrations
depends_on:
- opensearchanalytics
# otel-collector:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this commented out block?

@@ -70,6 +88,7 @@ services:
- migrations
volumes:
- sharedReplayerOutput:/shared-replayer-output
- /Users/schohn/dev/opensearch-migrations/TrafficCapture/containerLogs:/logs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference to local needs removal

@@ -132,6 +153,7 @@ services:
- sharedReplayerOutput:/shared-replayer-output
environment:
- MIGRATION_KAFKA_BROKER_ENDPOINTS=kafka:9092
# command: ./runTestBenchmarks.sh
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove

capture-proxy-es:
image: 'migrations/capture_proxy:latest'

prometheus:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having prometheus,jaeger,and zipkin containers as a default seems like a heavy tax. Seems like we should have the option to have these or not. Seems like we also need a general cleanup of this file of unneeded comments

api group: 'io.netty', name: 'netty-codec-http'
api group: 'io.netty', name: 'netty-handler'

implementation group: 'io.opentelemetry', name:'opentelemetry-api'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this or line 26 when coreUtilities already has these dependencies?

ch.pipeline().addLast(new ConditionallyReliableLoggingHttpRequestHandler<T>(offloader,
requestCapturePredicate, this::shouldGuaranteeMessageOffloading));
var connectionId = ch.id().asLongText();
ch.pipeline().addLast(new ConditionallyReliableLoggingHttpRequestHandler<T>("n", "c",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it was hardcoded for testing and not changed back

@@ -1,4 +1,4 @@
status = info
status = debug
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't default be info?

Comment on lines -22 to -25
logger.MetricsLogger.name = MetricsLogger
logger.MetricsLogger.level = info
logger.MetricsLogger.additivity = false
logger.MetricsLogger.appenderRef.METRICS.ref = METRICS
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we do any SLF4J logging for the MetricsLogger?

@@ -0,0 +1,9 @@
# Set the global logging level for all loggers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this file control that log4j2.properties file cannot control?

@@ -18,3 +18,8 @@ logger.OutputTupleJsonLogger.level = OFF
logger.KPC.name = org.opensearch.migrations.replay.kafka.KafkaProtobufConsumer
logger.KPC.level = DEBUG
logger.KPC.appenderRef.stdout.ref = Console

logger.RSO.name = org.opensearch.migrations.replay.RequestSenderOrchestrator
logger.RSO.level = TRACE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be at trace as a default?

@gregschohn gregschohn merged commit 9cf2540 into opensearch-project:main Feb 9, 2024
6 of 7 checks passed
@gregschohn gregschohn deleted the OtelMetricsAndTraces branch April 3, 2024 12:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants