Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Telemetry] Collector telemetry not including opencensus metrics along with internal metrics when feature flag useOtelForInternalMetrics is true #5687

Closed
timannguyen opened this issue Jul 14, 2022 · 11 comments
Labels
bug Something isn't working help wanted Good issue for contributors to OpenTelemetry Service to pick up priority:p2 Medium

Comments

@timannguyen
Copy link
Contributor

timannguyen commented Jul 14, 2022

Describe the bug
using flag --feature-gates=telemetry.useOtelForInternalMetrics using demo or through

featuregate.Apply(map[string]bool{
	"telemetry.useOtelForInternalMetrics": true,
})

, calling http://localhost:8888/metrics does not displays the metrics mentioned in monitoring documentation.

Steps to reproduce

  1. set config yaml with below or through code, set the config map service::telemetry::metrics::address: ":8888"
service:
  telemetry:
    metrics:
      address: ":8888"
  1. set flag with --feature-gates=telemetry.useOtelForInternalMetrics or in code set telemetry.useOtelForInternalMetrics to true in featuregate.
  2. go to http://localhost:8888/metrics
  3. it will return empty 200 or if any metrics was set through featuregate, only those are shown. Ones from monitoring documentation does not appear.

What did you expect to see?

I expected metrics from monitoring documentation and any internal metrics added using otel metrics sdk.

EXAMPLE:

# HELP otelcol_exporter_enqueue_failed_log_records Number of log records failed to be added to the sending queue.
# TYPE otelcol_exporter_enqueue_failed_log_records counter
otelcol_exporter_enqueue_failed_log_records{exporter="jaeger",service_instance_id="71249946-a6bd-4e88-86e3-6c310b093c54",service_version="latest"} 0
otelcol_exporter_enqueue_failed_log_records{exporter="logging",service_instance_id="71249946-a6bd-4e88-86e3-6c310b093c54",service_version="latest"} 0
otelcol_exporter_enqueue_failed_log_records{exporter="prometheus",service_instance_id="71249946-a6bd-4e88-86e3-6c310b093c54",service_version="latest"} 0
otelcol_exporter_enqueue_failed_log_records{exporter="zipkin",service_instance_id="71249946-a6bd-4e88-86e3-6c310b093c54",service_version="latest"} 0
...
...
# TYPE my_metrics counter
my_metrics(tags="foo",tags="bar") 0
...
...

What did you see instead?

empty 200 response

OR

# TYPE my_metrics counter
my_metrics(tags="foo",tags="bar") 0
...

What version did you use?
demo: docker image : (fb056cba11cd)
code: Version: (v0.48.0)

What config did you use?

receivers:
  otlp:
    protocols:
      grpc:

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
    const_labels:
      label1: value1

  logging:

  zipkin:
    endpoint: "http://zipkin-all-in-one:9411/api/v2/spans"
    format: proto

  jaeger:
    endpoint: jaeger-all-in-one:14250
    tls:
      insecure: true

processors:
  batch:

extensions:
  health_check:
  pprof:
    endpoint: :1888
  zpages:
    endpoint: :55679

service:
  telemetry:
    metrics:
      address: ":8888"
  extensions: [pprof, zpages, health_check]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [logging, zipkin, jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [logging, prometheus]

Environment
OS: (e.g., "M1 MAC")
Docker

Edit: define featuregate calls
Edit2: formatting

@timannguyen timannguyen added the bug Something isn't working label Jul 14, 2022
@tigrannajaryan
Copy link
Member

I can reproduce this. http://localhost:8888/metrics is empty with --feature-gates=telemetry.useOtelForInternalMetrics flag.

@open-telemetry/collector-approvers does anyone know how Otel-based metrics are supposed to work? With the feature flag enabled are the metrics still supposed to be exposed on 8888?

@timannguyen
Copy link
Contributor Author

have talked to tigran and we will move to use opencensus.

@dmitryax dmitryax reopened this Aug 10, 2022
@dmitryax
Copy link
Member

Reopening since the issue is still not resolved.

@dmitryax dmitryax added help wanted Good issue for contributors to OpenTelemetry Service to pick up priority:p2 Medium labels Aug 10, 2022
@fatsheep9146
Copy link
Contributor

fatsheep9146 commented Sep 26, 2022

I think this bug is due to this #3816, #2204

In #3816, @kirbyquerby has started trying to export internal metrics via OpenTelemetry, and made an excellent first step:

  • adding MeterProvider with prometheus exporter to service.telemetryInitializer.

But there still exists several works waiting to be done. All internal metrics are still instrumented with OC(OpenCensus) not OTEL(OpenTelemetry). So when you start with feature gate telemetry.useOtelForInternalMetrics, you still can not see any metric exported.

In order to transfer internal metrics from OC to OTEL, @dashpole always provided a mature propose --- through OpenCensus bridge for OpenTelemetry
#3816 (review)

As an alternative, I would propose that we use both opencensus and opentelemetry client libraries at the same time while we migrate. We can use the OpenCensus bridge for OpenTelemetry to make both opencensus and opentelemetry clients serve metrics on the same prometheus endpoint.
We could roll this out with the following steps:

  • Switch from the OpenCensus prometheus exporter to a "wrapped" OpenTelemetry prometheus exporter. Verify that self obs metrics on the prometheus endpoint remain the same.
  • Add the OpenTelemetry API/SDK and use the same prometheus exporter that OC uses.
  • Migrate each metric from OC to OTel individually, and verify that the metric doesn't change.

But this propose is blocked by issue #2204, the prometheus exporter does not satisfy the requirement for OpenCensus metric bridge for OpenTelemetry, which is described in https://pkg.go.dev/go.opentelemetry.io/otel/bridge/[email protected]#readme-metrics

@dmitryax

@MadVikingGod
Copy link

Howdy from the GO Sig.

Has anyone attempted to register both the OC SDK, and the Otel SDK with the same prometheus Registerer, and use the connected Gatherer as the tempory bridge?

I ask because the Go Sig is exploring a few options to unblock your development in this area.

  1. Is to provide the MetricProducer API, targeted around this problem. The delay here is that we, the go sig, would like to finalize this at the spec level before we release something that can be usable.
  2. Is to provide a temporary API to enable this. The drawback here is we will remove this when 1 is finished. So we want to work out a plan with the collector to synchronize both the use and the eventual disuse of this API.
  3. Would be what I proposed above. If this hasn't already been tested, this may allow an immediate path forward for you, and you can migrate to 1 at your leisure.

@paivagustavo
Copy link
Member

I like that idea (3) and that is totally possible, just tried it.

The one thing that must be noted, is that otel and opencensus has different ways to deal with resource attributes.
OpenCensus Prometheus Exporter appends them as "constant labels" and Otel emits the resource attributes with a target_info metric, see this for more context.

If we went to that path, I think we would need to get the otel prometheus exporter to emit the targetinfo metric (open-telemetry/opentelemetry-go#3166) and remove the constant labels from opencensus exporter. Prometheus receiver is already capable to transform a target_info metric to a otel resource, so this shouldn't be a problem.

Or we can just ignore this for now and continue adding these as constant labels to this unified prometheus.Registerer, either way should unblock the oc->otel migration.

c.c. @bogdandrutu since this is related to our latest discussion.

@dashpole
Copy link
Contributor

dashpole commented Oct 5, 2022

To clarify what I was trying to get across at the SIG meeting today:

  • Combining prometheus metrics from both OC and OTel requires a given metric to only be collected by either OpenCensus or OpenTelemetry. Otherwise, the metrics will collide, and be rejected by the prometheus client or will produce invalid prometheus data.
  • Given that, we would have two choices for how to ensure this:
    • Side-by-side: Have two different implementations of all instrumentation. One using OpenCensus, and one using OpenTelemetry. The one applied is determined by a feature-gate.
    • Replacement: For each line of instrumentation, remove the OpenCensus, and add OpenTelemetry instrumentation.

I have a few concerns/thoughts either way:

Side-by-side:

  • If we are maintaining side-by-side instrumentation, I don't see much of a benefit in the "combine OC and OTel into the same registry" idea. The main benefit would be that new instrumentation could only use OpenTelemetry, and coexist with OpenCensus telemetry. Maybe I'm missing the use-case, but it seems like we might as well keep both entirely separate.

Replacement:

  • The OTel prometheus exporter still needs a number of changes to be spec-compliant. Those include adding target_info, handling instrumentation scope, and updating metric naming. Those may be breaking for the collector's uses. If there are breaking differences between the exporters, each time a metric is changed from OC -> OTel, it will break users.
  • No feature-gate means users can't revert back to using OC if there is an issue with the OTel SDK or prometheus exporter.

@paivagustavo
Copy link
Member

Combining prometheus metrics from both OC and OTel requires a given metric to only be collected by either OpenCensus or OpenTelemetry. Otherwise, the metrics will collide, and be rejected by the prometheus client or will produce invalid prometheus data.

That is a great point @dashpole, we need to make sure that we have everything protected by the feature gate like you described.

Maybe I'm missing the use-case, but it seems like we might as well keep both entirely separate.

The idea for doing a side-by-side instrumentation is that we can start the migration and early adopters would not lose any metrics by doing so. It doesn't make much sense for users to enable the otel featuregate if they're only getting the obsreceivermetrics for example.

And with that we still have the guarantee that if something is not working as expected, they can always just disable the featuregate and use OC only. Which is basically the reason for not going with replacement from the beginning.

@paivagustavo
Copy link
Member

Based on Aaron's suggestion to use the prometheus.Registry as a bridge between OC and OTel Go, I've tried to outline a migration plan. I would like get some feedback about it and to see if we can start to work on this. I've made draft to show this is possible: #6297.

Caveat that was brought during the sig meeting: This blocks us from exporting collector's telemetry with other formats for the moment, but that is ok since prometheus is currently the only option to export metrics.

Migration Plan

  1. Share prometheus.Registry between both sdks
    1. OC sdk will always register itself
    2. OTel sdk will only register if the featuregate is enabled
  2. Instrument Collector’s core with OTel
    1. Instrumentation will be made side-by-side with OC using a featuregate
    2. No metric should be recorded by both sdks at the same time
  3. After two releases (open to suggestion about the timing), mark instrumentation with OTel as stable
  4. Cleanup the featuregate
    1. Remove the featuregate and OC instrumentation from core
    2. Both OC sdk and OTel sdk should be always registered on the registry.
  5. Migrate the contrib repo
    1. OTel Instrumentation must replace the OC metrics at this point
  6. Clean up OC SDK from Core and Contrib

Notes:

  • At any point that the OTel Bridge is ready, we can start using it, even if in the middle of the migration. Using the OTel bridge instead of using the prometheus.Registry bridge allows us to add other exporters to the otel go sdk.

Diagrams

It was asked on during the sig meeting that this was discussed, that people wanted to some diagrams to help visualize the path and the data flow for the migration. @smithclay have done these to help:
image

Another important piece of the migration is how data gets generated during this migration.

For core's migration where we will have a featuregate to enabled/disabled the OTel Go SDK, there will exist instruments of both OC and OTel for the same metrics, but only one should be used at any time.

For contrib's migration, each OC metric should be replaced with an equivalent OTel metric.

image

@bogdandrutu
Copy link
Member

+1 on this

After two releases (open to suggestion about the timing), mark instrumentation with OTel as stable

I think we should flip the "feature-gate" as soon as a significant amount of metrics in core are switched to otel. Then wait in this state (continuing to migrate everything) for 2-3 releases.

@codeboten
Copy link
Contributor

The original issue was fixed in #8716, and the feature gate is now marked as beta.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Good issue for contributors to OpenTelemetry Service to pick up priority:p2 Medium
Projects
None yet
Development

No branches or pull requests

9 participants