Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: open telemetry #4664

Merged
merged 2 commits into from
Jun 28, 2023
Merged

feat: open telemetry #4664

merged 2 commits into from
Jun 28, 2023

Conversation

TimBeyer
Copy link
Contributor

What this PR does / why we need it:
Adds OTEL tracing support to get deep insights into the build performance.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

@TimBeyer TimBeyer requested review from eysi09, edvald and 10ko and removed request for edvald June 20, 2023 10:13
@TimBeyer TimBeyer force-pushed the feat/open-telemetry branch from 2cb7f75 to 1536994 Compare June 21, 2023 09:37
Copy link
Collaborator

@eysi09 eysi09 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Truly great work!

I mostly had questions since I'm not super familiar with OTel SDK.

I would also very much appreciate getting @edvald's eyes on this, since it touches on some critical parts of the code, in particular around the initialisation.

@@ -122,9 +123,9 @@ Use ${chalk.bold("up/down")} arrow keys to scroll through your command history.
},
})

useInput((input, key) => {
useInput(bindActiveContext((input, key) => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow here. Why is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SDK tracks the current context using AsyncLocalStorage.

Since those callbacks get executed at a random time in the future from an event emitter, they lose the automatic binding that would be in place normally. So for that reason, event handlers need to be bound explicitly to keep the context being propagated into the code that gets triggered by the event handler.

core/src/commands/dev.tsx Outdated Show resolved Hide resolved
@@ -6,4 +6,7 @@
* file, You can obtain one at http://mozilla.org/MPL/2.0/.
*/

import { initTracing } from "./util/tracing/tracing"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this in multiple places?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this here so anything that when garden is loaded, we immediately install all the hooks.
It might be redundant here but since the other initialization happened only in /cli I wanted to ensure that there's at least also one entry point in /core itself.

return otelSDK
}

if (!gardenEnv.GARDEN_ENABLE_TRACING) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not super familiar with this flow, but shouldn't we just exit here as opposed to overwriting the env and continuing the function execution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're still going to call OTEL SDK functions even if we don't initialize the SDK, the behavior of which we're not entirely sure about if we don't init the SDK.

Thus we went with OTEL_SDK_DISABLED which ensures that all operations are no-ops on the OTEL SDK level.

@@ -1108,6 +1119,9 @@ export class Garden {
})
}

@OtelTraced({
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of this is quite granular.

I think the end user will mostly care about:

  • Overall command execution
  • Config resolution
  • Individual actions

That being said, this could be very useful for our own internal debugging. But wondering if we should discern between that and data that users are interested in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can build upon this and maybe have different detail levels.
I would do that as a follow up though.

@@ -353,6 +354,7 @@ function runPersistent({
env: {
...getDefaultEnvVars(action),
...(env ? mapValues(env, (v) => v + "") : {}),
...getTracePropagationEnvVars(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed here?

If so, are there other places we could be missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ensures that in the - admittedly very specific - case of someone calling a garden command from within the exec command, but not using gardenCommand we still trace things through.

Also in theory any other piece of software that we spawn there and that follows the environment variable spec will be able to pick up the traces.

I didn't find any other obvious spots to add the propagation vars, but I may have missed some.
I looked for all execa calls to see if they do something that might require the headers.

@TimBeyer TimBeyer force-pushed the feat/open-telemetry branch from b79d4ad to d300a06 Compare June 22, 2023 15:10
@eysi09
Copy link
Collaborator

eysi09 commented Jun 23, 2023

Thanks for the response! All of that make sense to me.

I'd still very much like to get @edvald take before we merge. In particular since we're planning on a release on Monday.

If Jón is super busy, I'm happy to merge on Monday after the release and just dog food this some.

@TimBeyer TimBeyer requested a review from edvald June 26, 2023 08:36
wip: trace commands

wip: working tracing decorator

wip: more tracing for build and deploy tasks

chore: update TS

refactor: remove explicitly typed `this` argument

feat: add more traces

feat: automatically propagate session context

feat: make HTTP requests also add the session ID attributes

feat: namespace garden attributes

feat: OTLP exporter and other stuff

chore: clearer naming of internal tasks

feat: trace more things

feat: more detailed tracing

chore: minor fix to OTEL shutdown message

feat: use finally for tracing wrapper

fix: remove redundant tracing

chore: support custom commands

refactor: ensure OTEL SDK is always initialized first before anything else

refactor: some renaming

build: pin otel dependencies correctly

chore: added an env. variable to control tracing, default false

feat: propagate traces to exec commands via environment variables

feat: trace dev mode

chore: adding a timeout to the OTEL SDK shutdown

refactor: decompose into smaller modules

fix: lint error

fix: rebase mistake

fix: licenses

fix: redundant async
@TimBeyer TimBeyer force-pushed the feat/open-telemetry branch from d300a06 to fc0c544 Compare June 26, 2023 08:46
result = await cli.run({ args, exitOnError })
// initialize the tracing to capture the full cli execution
result = await withContextFromEnv(() =>
wrapActiveSpan("garden", async (span) => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a major thing, but stylistically I'd tend to prefer wrapping these functions in an object/namespace, e.g. tracing.withContextFromEnv and tracing.wrapActiveSpan. That's what I've been doing with the new plugin SDK for example, and it makes it a bit more obvious what these functions are about when reading code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case we'd have to import and re-export all the different functions from the different submodules since we have for example withContextFromEnv in one module and wrapActiveSpan in another.

I feel like then it would be preferable if we used an index.ts that groups all the things together so they lie under util/tracing directly. Then in there we have to group those all into an object since we can't really enforce the import * as tracing from "tracing" syntax we could use if we simply just re-export everything.

To me that seems to introduce different redundant ways of importing those modules which can't really be enforced, or discourage modularization in order to prevent that redundancy.

I'm not sure that's worth it. Maybe what would be better is to use clear naming everywhere so it's always obvious it's withTracingContextFromEnv or something like that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One quite simple way of doing that is a good ol' underscore prefix on the "private" functions re-exported in one place. I do also tend to prefer a named object over import * from... though.

This is not crucial btw, just a thought for better readability and discoverability going forward.

@TimBeyer TimBeyer merged commit 10aee8b into main Jun 28, 2023
@TimBeyer TimBeyer deleted the feat/open-telemetry branch June 28, 2023 12:08
ShankyJS pushed a commit that referenced this pull request Jul 10, 2023
* feat: OTEL tracing
* chore: document tracing functions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants