Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade OpenTelemetry Collector dependency #3858

Merged
merged 4 commits into from
Jun 30, 2023
Merged

Upgrade OpenTelemetry Collector dependency #3858

merged 4 commits into from
Jun 30, 2023

Conversation

ptodev
Copy link
Contributor

@ptodev ptodev commented May 11, 2023

WIP PR so that we can track the changes going into the feature branch, and whether the CI completes successfully.

Fixes #3660

Notes to the reviewer

  • Some metrics from Otel components will be missing. This is because we are enabling the telemetry.useOtelForInternalMetrics feature gate and not all Otel components support native Otel metrics. There are also new metrics which will be present on the Agent's /metrics endpoint which were not previously available. The new and missing metrics are listed here:
  • In pkg/flow/tracing/tracing.go we update go.opentelemetry.io/otel/semconv from version 1.12.0 to 1.17.0. I believe this should be fine due to backwards compatibility guarantees. For more information on what this package contains, see Otel's wiki page about Semantic Conventions.
  • In the more recent version of the Collector's BearerTokenAuth extension, there is now a filename field. I did not port it over. We can recommend that people use local.file instead. Thus we only expose the scheme and token attributes.
  • otelcol.exporter.loki now also adds instrumentation scope like this:
    "instrumentation_scope":{"name":"name","version":"version"}}
  • It is no longer possible to sort Otel pcommon attributes. For comparisons in tests we now use AsRaw():
    [pdata] Deprecate pcommon.Map.Sort() open-telemetry/opentelemetry-collector#6989

Things to do before merging

  • Use the Grafana fork of Collector instead of Paulin's own fork
  • Contribute native Otel metric support for some Collector components, so that metrics don't go missing from the Agent's /metrics endpoint. IMO this is a nice to have, and not a must?
  • Review all the Collector components we ported so far to see if their configuration or documentation changed?

@ptodev ptodev force-pushed the upgrade-otel branch 5 times, most recently from d9aace2 to 7e13bbe Compare May 15, 2023 17:36
@ptodev ptodev requested review from a team and clayton-cornell as code owners June 8, 2023 11:09
@ptodev ptodev force-pushed the upgrade-otel branch 2 times, most recently from 045b0c0 to 7202edc Compare June 8, 2023 11:41
@rfratto
Copy link
Member

rfratto commented Jun 8, 2023

@ptodev should this PR be in draft since it's still a WIP?

@ptodev ptodev marked this pull request as draft June 8, 2023 18:04
@ptodev
Copy link
Contributor Author

ptodev commented Jun 8, 2023

@ptodev should this PR be in draft since it's still a WIP?

Yes... Apologies. I didn't know about this feature back when I created it 😄 I converted it to Draft now, although it's almost time to open it for reviews.

Comment on lines 115 to 129
if filename, exists := lokiEntry.Labels["filename"]; exists {
filenameStr := string(filename)
// The `promtailreceiver` from the opentelemetry-collector-contrib
// repo adds these two labels based on these "semantic conventions
// for log media".
// https://opentelemetry.io/docs/reference/specification/logs/semantic_conventions/media/
// We're keeping them as well, but we're also adding the `filename`
// attribute so that it can be used from the
// `loki.attribute.labels` hint for when the opposite OTel -> Loki
// transformation happens.
lr.Attributes().PutStr("log.file.path", filenameStr)
lr.Attributes().PutStr("log.file.name", path.Base(filenameStr))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The promtailreceiver does no longer exist, and the lokireceiver replacement does not follow the same logic (maybe because it's considered more of a network-based source?), and these conventions are experimental anyway.

I'd recommend we remove this to keep us more in-line with upstream and adding a comment like this

// TODO: We should consult with upstream to add the `log.file.path` and `log.file.name` attributes once the semantic conventions for log media are mature 
// https://opentelemetry.io/docs/reference/specification/logs/semantic_conventions/media/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitant to do this in this PR because this would be a breaking change. Would you mind if we do this separately, on another PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll make the change separately. If you'd put a TODO(@tpaschalis) comment I'd appreciate it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ptodev Can you add the TODO comment here prior to merging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, just added it 👍

@ptodev
Copy link
Contributor Author

ptodev commented Jun 16, 2023

@ptodev Can you let me know what you've tested manually, so I can see if there's any missing gaps?

@rfratto I start a local Tempo using the instructions in the Tempo repo. I make a few minor modifications to make this run with a locally running Agent - my config file is docker-compose copy.yaml.txt

Then I start an Agent with one of these configs:

I can't run two Agents side by side, so I firstly test an "old" Agent, stop it, then start a "new" one.

These are the things I look for:

  • I run a {} query on my local Tempo to make sure it is receiving traces.
  • I compare the metrics from /metrics form both agents. I check if the metric names are the same and that all metrics from the old agent are in the new agent.
  • I make sure that there are metrics for the components which I started (e.g. service graph metrics).
  • I make sure that the Collector's Service didn't start its own /metrics endpoint on another port.
  • I check the logs for errors or anything suspicious.
  • When running on static mode I also check:
    • loki, to see if it got the logs generated from traces.
    • localhost:8899/metrics, to see if the spanmetrics component is working

A few gaps in my testing:

  • I only test traces flowing through otelcol.* components. Not metrics or logs. I think it's safe to assume that if traces work, the other two should work too.
  • I haven't tried testing with the attributes processor. I think it'll add more metrics and I'll need to mention this in the dogs in the debugging info section.
  • I can't test Agents side by side. This means that some of the service graph metrics can be a little different, because one agent didn't pick up some service that the other agent picked up. I don't think that's a problem. There's enough data to confirm that everything works.

Copy link
Contributor

@clayton-cornell clayton-cornell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple small suggestions for docs

@ptodev
Copy link
Contributor Author

ptodev commented Jun 20, 2023

If metrics go missing, that sounds like a big deal. I'd like to not break observability for anyone.
Can you explain what it would mean to "contribute native Otel metric support for some Collector components" exactly?

@rfratto I am referring to functionality similar to the one in the batch processor. There, for example the batch_size_trigger_send counter metric is available in both OpenCensus and in Otel formats. In order to not lose metrics in static mode, we would have to add support for Otel metrics to these components:

I'll ask the Otel team if such work is already being planned, and in general if they think the useOtelForInternalMetrics feature gate will be enabled by default anytime soon. But I have a feeling that it'll come down to:

  • not setting the useOtelForInternalMetrics feature gate
  • implementing OpenCensus metrics in addition to the Otel ones in service graph processor
  • integrating Agent with the OpenCensus functionality in the Collector again
  • potentially leaving the existing work on Otel instrumentation in place, so that we are ready whenever the useOtelForInternalMetrics is the default (when it goes from "alpha" to "beta")

@ptodev ptodev requested a review from rfratto June 20, 2023 11:46
@rfratto
Copy link
Member

rfratto commented Jun 29, 2023

@rfratto I am referring to functionality similar to the one in the batch processor. There, for example the batch_size_trigger_send counter metric is available in both OpenCensus and in Otel formats.

@ptodev I see. I'm in favor of always using OTel SDK metrics, and waiting for the upstream components to switch off of OpenCensus. I think I misunderstood in my earlier comment how many metrics were affected (i.e., not literally all of them). The metrics I'm particularly interested in making sure we keep are the high-level ones for exports/receivers/processors, which we should still have since those are all OTel SDK-compatible now.

Copy link
Member

@rfratto rfratto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM minus the final set of comments, let's get this merged!

cloud.google.com/go/pubsub v1.28.0
contrib.go.opencensus.io/exporter/prometheus v0.4.2
cloud.google.com/go/pubsub v1.30.0
contrib.go.opencensus.io/exporter/prometheus v0.4.2 // indirect
Copy link
Member

@rfratto rfratto Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Bump on this conversation)

return fmt.Errorf("error while setting up the remote sampling source: %w", err)
}
conn, err := grpc.Dial(jrse.cfg.Source.Remote.Endpoint, opts...)
conn, err := jrse.cfg.Source.Remote.ToClientConn(context.Background(), host, jrse.telemetry)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This package is a fork from upstream, and upstream also passes the context here, so we should too IMO.

go.mod Outdated
Comment on lines 581 to 610
require (
go.opentelemetry.io/collector/component v0.79.0
go.opentelemetry.io/collector/confmap v0.79.0
go.opentelemetry.io/collector/consumer v0.79.0
go.opentelemetry.io/collector/exporter v0.79.0
go.opentelemetry.io/collector/featuregate v1.0.0-rcv0012
go.opentelemetry.io/collector/processor/batchprocessor v0.79.0
go.opentelemetry.io/collector/processor/memorylimiterprocessor v0.79.0
go.opentelemetry.io/collector/receiver v0.79.0
go.opentelemetry.io/collector/receiver/otlpreceiver v0.79.0
)

require (
github.com/alecthomas/participle/v2 v2.0.0 // indirect
github.com/gobwas/glob v0.2.3 // indirect
github.com/google/s2a-go v0.1.4 // indirect
github.com/iancoleman/strcase v0.2.0 // indirect
github.com/open-telemetry/opentelemetry-collector-contrib/internal/filter v0.79.0 // indirect
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/ottl v0.79.0 // indirect
github.com/shoenig/go-m1cpu v0.1.6 // indirect
github.com/tilinna/clock v1.1.0 // indirect
go.opentelemetry.io/contrib/propagators/b3 v1.17.0 // indirect
go.opentelemetry.io/otel/bridge/opencensus v0.39.0 // indirect
google.golang.org/genproto/googleapis/api v0.0.0-20230530153820-e85fd2cbaebc // indirect
google.golang.org/genproto/googleapis/rpc v0.0.0-20230530153820-e85fd2cbaebc // indirect
)
Copy link
Member

@rfratto rfratto Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Bump on this conversation)

Comment on lines +674 to +688
go.opentelemetry.io/collector => github.com/ptodev/opentelemetry-collector v0.0.0-20230608105454-0c02b84d03d4
go.opentelemetry.io/collector/exporter/otlpexporter => github.com/ptodev/opentelemetry-collector/exporter/otlpexporter v0.0.0-20230608105454-0c02b84d03d4
go.opentelemetry.io/collector/exporter/otlphttpexporter => github.com/ptodev/opentelemetry-collector/exporter/otlphttpexporter v0.0.0-20230608105454-0c02b84d03d4
go.opentelemetry.io/collector/featuregate => github.com/ptodev/opentelemetry-collector/featuregate v0.0.0-20230608105454-0c02b84d03d4
go.opentelemetry.io/collector/pdata => github.com/ptodev/opentelemetry-collector/pdata v0.0.0-20230608105454-0c02b84d03d4
Copy link
Member

@rfratto rfratto Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(bump on this conversation; your PR is approved so it can be merged and we can switch to using the Grafana org's fork)

Copy link
Contributor

@clayton-cornell clayton-cornell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc input is fine as-is.

@ptodev ptodev merged commit b1a3fb0 into main Jun 30, 2023
@ptodev ptodev deleted the upgrade-otel branch June 30, 2023 10:42
@ptodev
Copy link
Contributor Author

ptodev commented Jun 30, 2023

There are some issues which need fixing, but they will be fixed in a follow-up PR.

The most important issue revolves around OpenCensus metrics coming from the collector. Static mode allows multiple tracing configs to be created for the same agent. In those configs could be components with the same names. Unfortunately, OpenCensus metrics are registered globally and it is very hard to isolate them so that each metric is specific to a component instance. Apparently the current code on the Agent is buggy since actually if there is more than 1 traces_config, the OpenCensus metrics from the Collector components will display inaccurate values.

Also, my comment that we need to contribute Otel instrumentation in order to see OpenCensus metrics was wrong. Apparently the way to see OpenCensus metrics is though a line like this. However, due to the issue mentioned above, we are not going to incorporate OpenCensus metrics.

If there is a need to incorporate the metrics which go missing due to this upgrade, we will contribute Otel instrumentation to the relevant Collector components.

clayton-cornell pushed a commit that referenced this pull request Aug 14, 2023
clayton-cornell pushed a commit that referenced this pull request Aug 14, 2023
@github-actions github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 23, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 23, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upgrade Agent to the latest version of Otel Collector
4 participants