Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not merge more metrics #459

Closed

Conversation

gregschohn
Copy link
Collaborator

@gregschohn gregschohn commented Nov 28, 2023

Description

This change will break the flow of messages emitted by the existing MetricsLogger class into OpenSearch. The biggest reason for that is that the change is that I'm now using the otel/opentelemetry-collector:latest image rather than a bespoke one.

Aside from the breakage, there's a lot being added in the form of Metrics and Traces. We're using OpenTelemetry to send both, currently to a collector (just as we were doing with logs). The collector sends data to two new trace collection containers (Jaeger & Zipkin) and has a Prometheus container pulling metrics from it.

The Java application code itself eschews some of the typical OpenTelemetry techniques for instrumentation. Instead of using ThreadLocals to pass maybe-present values around within contexts, which each instrumentation point needs to determine how to use them, custom context classes for Connections, Requests, KafkaRecords, etc are constructed and explicitly passed into functions and into callbacks. Those classes implement IWithAttributes and the fillAttributes() function to select which fields should be included within the instruments that are being emitted.

The contexts themselves are tightly related to Spans. Usually a new context will have a new span, a new span will always require a new context. The context classes themselves have the ability to chain back to a parent scope. When the context is converted into an Attributes object for the instruments, the attributes from parent contexts will also be included (with key-values from subclasses overwriting their parent's values in cases of conflict).

There are some judicious uses of generic wildcard constraints to make it quicker and more foolproof to create spans so that they're appropriately associated as children with the containing context's span. There's also support to make it easier to store start timestamps to emit duration metrics.

There's a LOT left to do here, but there's a lot that's done and I'd like to get feedback on the patterns that are emerging. Some of the top remaining items.

  • More metrics, more traces

  • Getting separate namespaces for capture and replayer working for metrics

  • Figuring out our cloud story. Should we deploy some more ECS containers for the AWS CDK or should we use AWS cloud native stuff, AWS hosted stuff?

  • Tests, Tests, Tests - Otel has some easy to use facilities to simplify checking within tests.

  • Hardening the interfaces more. There are lots of inline strings. These should be managed from more centralized places and we should have tests to do double-entry book-keeping so that they don't change without warnings.

  • Moving the existing MetricsLogger code out.

  • Category - Enhancement

  • Why these changes are required? Visibility into what our services are doing.

Issues Resolved

Part of Improve Metrics Explanations

Testing

Lots of manual testing for now

Check List

  • New functionality includes testing
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…thod for others to leverage.

Signed-off-by: Greg Schohn <[email protected]>
Most of this is just playing, but making the StreamManager implement AutoCloseable gives a place to end spans to show how long a serializer/connection factory was relevant for.

Signed-off-by: Greg Schohn <[email protected]>
Copy link

codecov bot commented Nov 28, 2023

Codecov Report

Attention: 97 lines in your changes are missing coverage. Please review.

Comparison is base (f70e914) 73.16% compared to head (4b43262) 73.50%.

Files Patch % Lines
...ch/migrations/coreutils/SimpleMeteringClosure.java 40.47% 47 Missing and 3 partials ⚠️
...afficcapture/netty/LoggingHttpResponseHandler.java 36.36% 14 Missing ⚠️
...atahandlers/http/HttpJsonTransformingConsumer.java 74.19% 7 Missing and 1 partial ⚠️
...ficcapture/kafkaoffloader/KafkaCaptureFactory.java 86.36% 2 Missing and 4 partials ⚠️
...onditionallyReliableLoggingHttpRequestHandler.java 76.92% 1 Missing and 2 partials ⚠️
...tions/trafficcapture/proxyserver/CaptureProxy.java 0.00% 3 Missing ⚠️
.../opensearch/migrations/replay/TrafficReplayer.java 62.50% 1 Missing and 2 partials ⚠️
...ure/kafkaoffloader/tracing/KafkaRecordContext.java 90.47% 2 Missing ⚠️
...arch/migrations/replay/tracing/RequestContext.java 88.23% 2 Missing ⚠️
...s/trafficcapture/FileConnectionCaptureFactory.java 0.00% 1 Missing ⚠️
... and 5 more
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #459      +/-   ##
============================================
+ Coverage     73.16%   73.50%   +0.34%     
- Complexity     1149     1213      +64     
============================================
  Files           120      131      +11     
  Lines          4732     4993     +261     
  Branches        420      431      +11     
============================================
+ Hits           3462     3670     +208     
- Misses         1002     1047      +45     
- Partials        268      276       +8     
Flag Coverage Δ
unittests 73.50% <75.81%> (+0.34%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…to the collector to prometheus, zipkin, etc

Signed-off-by: Greg Schohn <[email protected]>
Signed-off-by: Greg Schohn <[email protected]>
…lotted within the same graph in prometheus.

Signed-off-by: Greg Schohn <[email protected]>
…ics into an optional.

Dropping the optionals makes the code simpler and if we don't want to do logging, we can just not fill in the configuration for the SDK.

Signed-off-by: Greg Schohn <[email protected]>
…troduce some more typesafe wrappers for contexts.

Lots more to come.

Signed-off-by: Greg Schohn <[email protected]>
…explicitly passing strongly typed context objects.

Signed-off-by: Greg Schohn <[email protected]>
Signed-off-by: Greg Schohn <[email protected]>
@gregschohn gregschohn force-pushed the DoNotMerge_MoreMetrics branch from ed6abcd to 3746a8e Compare November 30, 2023 04:51
Make sure that the context is using the right requestKey, which also will have the appropriate indices as per the test context.

Signed-off-by: Greg Schohn <[email protected]>
@gregschohn gregschohn marked this pull request as ready for review November 30, 2023 15:58
@gregschohn
Copy link
Collaborator Author

See #460

@gregschohn gregschohn closed this Nov 30, 2023
@gregschohn gregschohn deleted the DoNotMerge_MoreMetrics branch April 3, 2024 12:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant