feat: otel collector integration #4769

TimBeyer · 2023-07-07T12:46:46Z

What this PR does / why we need it:
This PR enables us to locally deploy an otel collector by configuring it as a provider.

The provider once configured finds an open port, creates a configuration file on the fly and then spawns the collector on the local machine. It then reconfigures the existing otel exporter to target the new collector, which then proceeds to export the data to the desired destination.

The collector gracefully shuts down at the end of the garden session after attempting to flush all remaining data.

An example project config could look like this:

apiVersion: garden.io/v0
kind: Project
name: otel-collector-test
environments:
  - name: local
    defaultNamespace: ${var.dev-env-name}
providers:
  - name: local-kubernetes
    environments: [local]
    namespace: ${environment.namespace}
  - name: otel-collector
    exporters:
      - name: otlphttp
        enabled: true
        endpoint: http://localhost:4318
      - name: newrelic
        enabled: true
        apiKey: ${secrets.NR_API_KEY}
      - name: honeycomb
        enabled: true
        apiKey: ${secrets.HONEYCOMB_API_KEY}
        dataset: ${secrets.HONEYCOMB_DATASET}
      - name: datadog
        enabled: true
        site: ${secrets.DD_EXPORTER_SITE}
        apiKey: ${secrets.DD_EXPORTER_API_KEY}
variables:
  dev-env-name: ${project.name}-testing-${local.username}

Currently supported are a generic otlphttp type for any HTTP OTLP endpoint, and configs for newrelic, honeycomb and datadog.

The environment variable GARDEN_ENABLE_TRACING is now true by default, allowing just the override to disable it.
When not configured however, it sets up a "no op" exporter so that we're not leaking memory.
By default no data is sent anywhere without explicit configuration, this is just so that there's still an override for disabling it but there's no need to manually set it to enable the collector provider to work.

If a user would like to override the exporter at the lowest level, skipping the provider based setup and just using an environment variable, that is still possible via setting OTEL_TRACES_EXPORTER based on the options documented in https://github.com/open-telemetry/opentelemetry-js/blob/main/experimental/packages/opentelemetry-sdk-node/README.md#configure-trace-exporter-from-environment

For example to trace to a JSON HTTP OTLP endpoint on localhost, one could use

OTEL_TRACES_EXPORTER=otlp OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318/ OTEL_EXPORTER_OTLP_PROTOCOL=http/json garden deploy

In the majority of cases that should not be necessary, but it still does exist in case a user has their own OTEL setup and does not want to deploy an additional collector via the provider.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

core/src/util/process.ts

Orzelius

I think the formatter might have been misconfigured. There are many styling changes which seem to be unrelated and make it hard to review

…n exports spans

… config types

…ured

Orzelius

Thank you! Incredibly clean and easy to read code! I've posted some comments and am wondering about the lack of tests and docs, but other than that it looks sweet. Sorry for the long wait for the review, I just forgot about the pr.

core/src/garden.ts

core/src/plugins/otel-collector/config/logging.ts

core/src/plugins/otel-collector/otel-collector.ts

Orzelius · 2023-07-13T08:44:05Z

core/src/plugins/otel-collector/otel-collector.ts

+export const provider = gardenPlugin.createProvider({ configSchema: providerConfigSchema, outputsSchema: s.object({}) })
+
+provider.addHandler("getEnvironmentStatus", async ({ ctx }) => {
+  return { ready: false, outputs: {} }


Is the hardcoded ready: false intentional?

I believe it is. If I recall this correctly, garden would check first if the environment is ready, and if so it would not call prepareEnvironment again.
We need to call prepareEnvironment every time, so we always say it's not ready.

Even worse is that if you just once leave out this hook, the status is assumed to be complete with cache enabled, and the cached state is written to the filesystem making it that without a --force the provider will never initialize again.

We also found that returning disableCache: true in getEnvironmentStatus does not actually disable the cache and will still proceed caching the state.

Thus it is required to return ready: false here and then later on return disableCache: true from the prepareEnvironment handler.

core/src/plugins/otel-collector/otel-collector.ts

Orzelius · 2023-07-13T08:56:22Z

docs/reference/providers/otel-collector.md

+
+    exporters:
+      - name:
+
+        enabled:
+
+        verbosity: normal


Will the docs come later?

The problem here is that oneOf and anyOf schemas aren't correctly converted into docs yet.

I think we should fix this soon since it makes the docs incomplete, but since it's not strictly related and once fixed the docs will show up for this as well, I decided to not make it part of this PR.

…rter is used

shumailxyz · 2023-07-13T10:30:49Z

Great work @TimBeyer 💯

Just read through the PR. Wondering the same what @Orzelius mentioned about lack of tests and docs.

TimBeyer · 2023-07-13T14:41:45Z

@shumailxyz @Orzelius I thought about how to test this but couldn't come up with a good way.
If we unit test this we need to mock almost everything that makes this work in the first place leading to not very solid tests.
Probably e2e tests would be best, but then we'd have to create an OTLP HTTP endpoint and verify that we get all the data, which is difficult since we're sending quite a lot of data that can also change at any time if new spans are added.
The schemas for the config generation could be tested, but they are very simple and the typesystem already makes sure that inputs and outputs are correct.
That just leaves the ReconfigurableExporter and the function to find a string in a childprocess output.
I can add tests for those if you prefer.
Also if you have a good idea for how to tests the overall feature without having to mock a lot of things and that will always work even if we change the exact things we trace, I'm happy to take a look at that too.

Orzelius · 2023-07-13T14:45:58Z

I think a e2e test is the way to go here. It would serve as a smoke test for if we make a fatal mistake that would kill the functionality completely and would double as an example configuration for when we need to do further work on the functionality.

We already have some clumsy e2e tests that do a lot, but that's the business we're in

stefreak · 2023-07-13T16:32:30Z

core/src/util/util.ts

@@ -90,13 +90,18 @@ export async function shutdown(code?: number) {
      // eslint-disable-next-line no-console
      console.log(getDefaultProfiler().report())
    }
-    process.exit(code)
+    gracefulExit(code)


🚀 Nice that's much better

TimBeyer · 2023-07-14T12:03:46Z

@shumailxyz @Orzelius Added some e2e smoke tests for the feature.

…treamLogs` do it implicitly

Orzelius

Looks good, but what about docs 👉👈?

Orzelius

@TimBeyer and I decided that the provider reference docs should suffice. This is good to merge from my side. cc @garden-io/core

TimBeyer force-pushed the cloud-otel-integration branch 2 times, most recently from 24dd4ca to 9f26eac Compare July 7, 2023 13:31

TimBeyer requested a review from a team July 7, 2023 13:32

TimBeyer commented Jul 7, 2023

View reviewed changes

core/src/util/process.ts Outdated Show resolved Hide resolved

Orzelius suggested changes Jul 10, 2023

View reviewed changes

TimBeyer force-pushed the cloud-otel-integration branch from 9f26eac to 894d1d4 Compare July 10, 2023 10:44

TimBeyer requested review from Orzelius and a team July 12, 2023 08:18

mkhq and others added 22 commits July 12, 2023 11:48

chore: WIP for an otel-collector provider

082a29d

feat: create exporter that can be configured at a later point and the…

bca0636

…n exports spans

feat: send traces to OTLP exporter once collector is spawned

1abeef4

feat: wait for initialization of collector

478932c

chore: remove redundant createModuleTypes

68261fe

feat: create temporary config if it does not exist

79fa871

wip: start creating config for datadog

7be8710

feat: ensure that async hooks are run when the process is shut down

3ae034b

feat: log logged processes output with silly loglevel

e6f650f

feat: ensure collector has enough time to terminate

03c04ff

feat: wait for datadog export log line explicitly

88ff550

refactor: simplify config generation and get actual hostname

4ec753e

feat: make datadog configurable from the API secrets

7ce0b9d

feat: support multiple concurrent http exporters, refactor config

2ac7013

refactor: extract out base config

30bcc10

feat: trace the otel collector startup times

51eb179

feat: configure exporters in provider config

8722658

feat: timeout for final datadog sync

9c9691c

feat: wait for collector process to terminate

a171e71

feat: add honeycomb support

b0ad45f

fix: bind to localhost, handle empty exporters

d917f1f

fix: otel provider example and added the logging exporter

f0702b3

TimBeyer added 5 commits July 12, 2023 11:48

chore: code formatting

bbd6ffa

docs: update docs

0a5e483

chore: remove two todos

0d1136e

chore: remove projectId from plugin context

a1c3d1d

refactor: move validators into exporter modules and use those for the…

875467b

… config types

TimBeyer force-pushed the cloud-otel-integration branch from acc6b0c to 875467b Compare July 12, 2023 09:48

TimBeyer added 2 commits July 12, 2023 16:45

feat: always enable tracing by default but no-op when it's not config…

7234d73

…ured

fix: remove console.log statements

366bf0f

Orzelius reviewed Jul 13, 2023

View reviewed changes

TimBeyer added 2 commits July 13, 2023 11:34

feat: address review comments

de50fcc

fix: make sure when there are no active otel exporters the no-op expo…

e6b5cff

…rter is used

stefreak reviewed Jul 13, 2023

View reviewed changes

TimBeyer force-pushed the cloud-otel-integration branch 4 times, most recently from 90bd1f7 to 225666f Compare July 14, 2023 10:33

test: e2e smoke test for otel exporter

72fedd4

TimBeyer force-pushed the cloud-otel-integration branch from 225666f to 72fedd4 Compare July 14, 2023 10:45

fix: correct hostname for test

5b57c87

TimBeyer requested a review from Orzelius July 14, 2023 13:01

fix: explicitly forward log events from provider instead of having `s…

e90d92d

…treamLogs` do it implicitly

TimBeyer force-pushed the cloud-otel-integration branch from 744bc03 to e90d92d Compare July 14, 2023 14:26

Orzelius reviewed Jul 17, 2023

View reviewed changes

Orzelius approved these changes Jul 17, 2023

View reviewed changes

TimBeyer merged commit 9c44055 into main Jul 17, 2023

TimBeyer deleted the cloud-otel-integration branch July 17, 2023 11:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: otel collector integration #4769

feat: otel collector integration #4769

TimBeyer commented Jul 7, 2023 •

edited

Loading

Orzelius left a comment

Orzelius left a comment

Orzelius Jul 13, 2023

TimBeyer Jul 13, 2023 •

edited

Loading

Orzelius Jul 13, 2023

TimBeyer Jul 13, 2023

shumailxyz commented Jul 13, 2023

TimBeyer commented Jul 13, 2023

Orzelius commented Jul 13, 2023

stefreak Jul 13, 2023

TimBeyer commented Jul 14, 2023

Orzelius left a comment

Orzelius left a comment

feat: otel collector integration #4769

feat: otel collector integration #4769

Conversation

TimBeyer commented Jul 7, 2023 • edited Loading

Orzelius left a comment

Choose a reason for hiding this comment

Orzelius left a comment

Choose a reason for hiding this comment

Orzelius Jul 13, 2023

Choose a reason for hiding this comment

TimBeyer Jul 13, 2023 • edited Loading

Choose a reason for hiding this comment

Orzelius Jul 13, 2023

Choose a reason for hiding this comment

TimBeyer Jul 13, 2023

Choose a reason for hiding this comment

shumailxyz commented Jul 13, 2023

TimBeyer commented Jul 13, 2023

Orzelius commented Jul 13, 2023

stefreak Jul 13, 2023

Choose a reason for hiding this comment

TimBeyer commented Jul 14, 2023

Orzelius left a comment

Choose a reason for hiding this comment

Orzelius left a comment

Choose a reason for hiding this comment

TimBeyer commented Jul 7, 2023 •

edited

Loading

TimBeyer Jul 13, 2023 •

edited

Loading