Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: otel collector integration #4769

Merged
merged 37 commits into from
Jul 17, 2023
Merged

feat: otel collector integration #4769

merged 37 commits into from
Jul 17, 2023

Conversation

TimBeyer
Copy link
Contributor

@TimBeyer TimBeyer commented Jul 7, 2023

What this PR does / why we need it:
This PR enables us to locally deploy an otel collector by configuring it as a provider.

The provider once configured finds an open port, creates a configuration file on the fly and then spawns the collector on the local machine. It then reconfigures the existing otel exporter to target the new collector, which then proceeds to export the data to the desired destination.

The collector gracefully shuts down at the end of the garden session after attempting to flush all remaining data.

An example project config could look like this:

apiVersion: garden.io/v0
kind: Project
name: otel-collector-test
environments:
  - name: local
    defaultNamespace: ${var.dev-env-name}
providers:
  - name: local-kubernetes
    environments: [local]
    namespace: ${environment.namespace}
  - name: otel-collector
    exporters:
      - name: otlphttp
        enabled: true
        endpoint: http://localhost:4318
      - name: newrelic
        enabled: true
        apiKey: ${secrets.NR_API_KEY}
      - name: honeycomb
        enabled: true
        apiKey: ${secrets.HONEYCOMB_API_KEY}
        dataset: ${secrets.HONEYCOMB_DATASET}
      - name: datadog
        enabled: true
        site: ${secrets.DD_EXPORTER_SITE}
        apiKey: ${secrets.DD_EXPORTER_API_KEY}
variables:
  dev-env-name: ${project.name}-testing-${local.username}

Currently supported are a generic otlphttp type for any HTTP OTLP endpoint, and configs for newrelic, honeycomb and datadog.

The environment variable GARDEN_ENABLE_TRACING is now true by default, allowing just the override to disable it.
When not configured however, it sets up a "no op" exporter so that we're not leaking memory.
By default no data is sent anywhere without explicit configuration, this is just so that there's still an override for disabling it but there's no need to manually set it to enable the collector provider to work.

If a user would like to override the exporter at the lowest level, skipping the provider based setup and just using an environment variable, that is still possible via setting OTEL_TRACES_EXPORTER based on the options documented in https://github.com/open-telemetry/opentelemetry-js/blob/main/experimental/packages/opentelemetry-sdk-node/README.md#configure-trace-exporter-from-environment

For example to trace to a JSON HTTP OTLP endpoint on localhost, one could use

OTEL_TRACES_EXPORTER=otlp OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318/ OTEL_EXPORTER_OTLP_PROTOCOL=http/json garden deploy

In the majority of cases that should not be necessary, but it still does exist in case a user has their own OTEL setup and does not want to deploy an additional collector via the provider.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

@TimBeyer TimBeyer force-pushed the cloud-otel-integration branch 2 times, most recently from 24dd4ca to 9f26eac Compare July 7, 2023 13:31
@TimBeyer TimBeyer requested a review from a team July 7, 2023 13:32
core/src/util/process.ts Outdated Show resolved Hide resolved
Copy link
Contributor

@Orzelius Orzelius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the formatter might have been misconfigured. There are many styling changes which seem to be unrelated and make it hard to review

@TimBeyer TimBeyer force-pushed the cloud-otel-integration branch from 9f26eac to 894d1d4 Compare July 10, 2023 10:44
@TimBeyer TimBeyer requested review from Orzelius and a team July 12, 2023 08:18
@TimBeyer TimBeyer force-pushed the cloud-otel-integration branch from acc6b0c to 875467b Compare July 12, 2023 09:48
Copy link
Contributor

@Orzelius Orzelius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Incredibly clean and easy to read code! I've posted some comments and am wondering about the lack of tests and docs, but other than that it looks sweet. Sorry for the long wait for the review, I just forgot about the pr.

core/src/garden.ts Outdated Show resolved Hide resolved
core/src/plugins/otel-collector/otel-collector.ts Outdated Show resolved Hide resolved
export const provider = gardenPlugin.createProvider({ configSchema: providerConfigSchema, outputsSchema: s.object({}) })

provider.addHandler("getEnvironmentStatus", async ({ ctx }) => {
return { ready: false, outputs: {} }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the hardcoded ready: false intentional?

Copy link
Contributor Author

@TimBeyer TimBeyer Jul 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it is. If I recall this correctly, garden would check first if the environment is ready, and if so it would not call prepareEnvironment again.
We need to call prepareEnvironment every time, so we always say it's not ready.

Even worse is that if you just once leave out this hook, the status is assumed to be complete with cache enabled, and the cached state is written to the filesystem making it that without a --force the provider will never initialize again.

We also found that returning disableCache: true in getEnvironmentStatus does not actually disable the cache and will still proceed caching the state.

Thus it is required to return ready: false here and then later on return disableCache: true from the prepareEnvironment handler.

core/src/plugins/otel-collector/otel-collector.ts Outdated Show resolved Hide resolved
Comment on lines +35 to +41

exporters:
- name:

enabled:

verbosity: normal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the docs come later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here is that oneOf and anyOf schemas aren't correctly converted into docs yet.

I think we should fix this soon since it makes the docs incomplete, but since it's not strictly related and once fixed the docs will show up for this as well, I decided to not make it part of this PR.

@shumailxyz
Copy link
Contributor

Great work @TimBeyer 💯

Just read through the PR. Wondering the same what @Orzelius mentioned about lack of tests and docs.

@TimBeyer
Copy link
Contributor Author

@shumailxyz @Orzelius I thought about how to test this but couldn't come up with a good way.
If we unit test this we need to mock almost everything that makes this work in the first place leading to not very solid tests.
Probably e2e tests would be best, but then we'd have to create an OTLP HTTP endpoint and verify that we get all the data, which is difficult since we're sending quite a lot of data that can also change at any time if new spans are added.
The schemas for the config generation could be tested, but they are very simple and the typesystem already makes sure that inputs and outputs are correct.
That just leaves the ReconfigurableExporter and the function to find a string in a childprocess output.
I can add tests for those if you prefer.
Also if you have a good idea for how to tests the overall feature without having to mock a lot of things and that will always work even if we change the exact things we trace, I'm happy to take a look at that too.

@Orzelius
Copy link
Contributor

I think a e2e test is the way to go here. It would serve as a smoke test for if we make a fatal mistake that would kill the functionality completely and would double as an example configuration for when we need to do further work on the functionality.

We already have some clumsy e2e tests that do a lot, but that's the business we're in

@@ -90,13 +90,18 @@ export async function shutdown(code?: number) {
// eslint-disable-next-line no-console
console.log(getDefaultProfiler().report())
}
process.exit(code)
gracefulExit(code)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 Nice that's much better

@TimBeyer TimBeyer force-pushed the cloud-otel-integration branch 4 times, most recently from 90bd1f7 to 225666f Compare July 14, 2023 10:33
@TimBeyer TimBeyer force-pushed the cloud-otel-integration branch from 225666f to 72fedd4 Compare July 14, 2023 10:45
@TimBeyer
Copy link
Contributor Author

@shumailxyz @Orzelius Added some e2e smoke tests for the feature.

@TimBeyer TimBeyer requested a review from Orzelius July 14, 2023 13:01
@TimBeyer TimBeyer force-pushed the cloud-otel-integration branch from 744bc03 to e90d92d Compare July 14, 2023 14:26
Copy link
Contributor

@Orzelius Orzelius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but what about docs 👉👈?

Copy link
Contributor

@Orzelius Orzelius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TimBeyer and I decided that the provider reference docs should suffice. This is good to merge from my side. cc @garden-io/core

@TimBeyer TimBeyer merged commit 9c44055 into main Jul 17, 2023
@TimBeyer TimBeyer deleted the cloud-otel-integration branch July 17, 2023 11:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants