Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datadog and ingress-nginx #10082

Closed
dgoffredo opened this issue Jun 13, 2023 · 35 comments
Closed

Datadog and ingress-nginx #10082

dgoffredo opened this issue Jun 13, 2023 · 35 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@dgoffredo
Copy link
Contributor

dgoffredo commented Jun 13, 2023

In a recent pull request to update the version of the Datadog tracer used
within ingress-nginx, @esigo pointed out that the project plans to move
away from OpenTracing and towards OpenTelemetry. As a result, OpenTracing-only
tracing libraries, such as Datadog, will no longer be supported.

Datadog would like to continue integrating with ingress-nginx. I can think of
a few ways to do this, and would like to discuss our potential options here.

@dgoffredo dgoffredo added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 13, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority labels Jun 13, 2023
@longwuyuan
Copy link
Contributor

AFAIK, datadog can consume traces via otlp-grpc receiving collector https://kubernetes.github.io/ingress-nginx/user-guide/third-party-addons/opentelemetry/

NOTE: While the option is called otlp-collector-host, you will need to point this to any backend that receives otlp-grpc.

/assign @esigo

@longwuyuan
Copy link
Contributor

@dgoffredo
Copy link
Contributor Author

@longwuyuan Yes, that's one option. It even works today.

One shortcoming of that option, though, as I explain in a related issue in the OpenTelemetry project, is that it prevents us from providing some features that are difficult or impossible to express in OpenTelemetry as it currently is. One such feature is trace sampling.

Here are some ways we could go about continuing support for Datadog in the ingress controller:

  1. Do nothing. Use OpenTelemetry to send traces to the Datadog Agent, which is an OTLP collector.
    • This is a degradation in support from a Datadog customer's point of view, as it precludes existing and future Datadog-specific tracing features.
  2. Modify ingress-nginx to have a Datadog-specific tracing mode that uses the Datadog nginx module under the hood.
    • This is a departure from the "just OpenTracing" or "just OpenTelemetry" design.
    • The Datadog nginx module uses different nginx configuration directives than does otel_ngx_module.
    • The benefit is that nginx-datadog is already available and will support all Datadog-specific features.
  3. Integrate Datadog into opentelemetry-cpp in such a way that Datadog-specific features are supported. This idea is being explored in the issue mentioned above.
    • It is likely that this would require modifications to both otel_ngx_module and the ingress controller, as the Datadog tracer would likely involve a different "tracer provider" and possibly an additional source repository.

@longwuyuan
Copy link
Contributor

longwuyuan commented Jun 13, 2023

Need to wait for @esigo comments I guess and I am not even a developer.

Option 1 seems not right to degrade.
Option 2 does not seem feasible due to lack of resources or due to need to be lean & mean.
Option 3 seems practical but wonder why change to controller, when module ought to bring the needful.

Once again, just placeholder comments from me, my opinion, and I am not even a developer.

@longwuyuan
Copy link
Contributor

cc @rikatz news on datadog here

@esigo
Copy link
Member

esigo commented Jun 13, 2023

@longwuyuan Yes, that's one option. It even works today.

One shortcoming of that option, though, as I explain in a related issue in the OpenTelemetry project, is that it prevents us from providing some features that are difficult or impossible to express in OpenTelemetry as it currently is. One such feature is trace sampling.

Here are some ways we could go about continuing support for Datadog in the ingress controller:

  1. Do nothing. Use OpenTelemetry to send traces to the Datadog Agent, which is an OTLP collector.
    • This is a degradation in support from a Datadog customer's point of view, as it precludes existing and future Datadog-specific tracing features.
  2. Modify ingress-nginx to have a Datadog-specific tracing mode that uses the Datadog nginx module under the hood.
    • This is a departure from the "just OpenTracing" or "just OpenTelemetry" design.
    • The Datadog nginx module uses different nginx configuration directives than does otel_ngx_module.
    • The benefit is that nginx-datadog is already available and will support all Datadog-specific features.
  3. Integrate Datadog into opentelemetry-cpp in such a way that Datadog-specific features are supported. This idea is being explored in the issue mentioned above.
    • It is likely that this would require modifications to both otel_ngx_module and the ingress controller, as the Datadog tracer would likely involve a different "tracer provider" and possibly an additional source repository.

@dgoffredo thanks for the info. could you please be specific with the feature of DD that otel doesn't support right now?
I had a glimpse into the sampler you mentioned. It seems like similar sampler can be implemented for otel-nginx as well. AFAIK only tail-based sampler is tricky with otel-cpp as it requires to implement an exporter as well but I didn't see such sampler in the link.

OpenTracing project is already archived. I can't see why we need to support archived project.
Regarding option 3 this has to be discussed in otel-cpp community. Would be happy to discuss if you join one of otel-cpp community meetings. Generally speaking though, features which are not in the otel-specification could live in the contrib repo only.

@rikatz
Copy link
Contributor

rikatz commented Jun 13, 2023

Removing my ingress-nginx maintainer hat, and, putting my "user" hat: I think it would be great if DD integration with OTEL is 100% covered. This would help me not get into a "lock in" where I should use solution A because my Ingress or whatever can just talk with A or B, and migrating would be a breaking change

Putting back my maintainer hat:
I think that company-specific features are not healthy for the project, as (and please, don't take this as a criticism, I'm not saying about DD!!):

  • It turns hard for us to test the feature - We now will need a DD environment to test if the changes are compliant with DD. And I'm not saying DD cannot offer us, I know they can. But is one more thing to be tested.
  • This adds complexity that we may not want on the project. Today we already suffer with Openresty versioning, and we are looking into how to get rid of this problem. Adding DD specific module adds another thing to be compiled, another possible point of core dump on Nginx, etc.
  • Usually once the feature is here, we don't get traction from the company to triage issues related to the feature
  • We don't know what is the amount of users covered by this specific feature and, this ends up shadowing the overall project roadmap
  • We open precedence of "if you did for that company, you should do that for us too" from other companies.
  • There is no further contribution of the company to the overall project, and this ends up being bad for us (this is not in general, but when it happens it is really bad).

We could "kind of" isolate this on cloud providers as we generate manifests specific for one or another based on customizations, but not having a Cloud A or Cloud B specific feature.

I think DD have a great opportunity here, already pointed by @esigo above which is getting the missing feature/integration as part of OTEL, and IIUC this would be a gain for the overall OTEL community :)

Sorry for being harsh, I don't mean at all :) I'm just pointing and reasoning out that we need to focus the project more on generic features.

@dgoffredo
Copy link
Contributor Author

dgoffredo commented Jun 13, 2023

thanks for the info. could you please be specific with the feature of DD that otel doesn't support right now?
I had a glimpse into the sampler you mentioned. It seems like similar sampler can be implemented for otel-nginx as well. AFAIK only tail-based sampler is tricky with otel-cpp as it requires to implement an exporter as well but I didn't see such sampler in the link.

@esigo Sampling is a sticky example, but the one with which I'm the most familiar. I can think of three sampling-related features that Datadog provides and for which I am not aware of an OpenTelemetry equivalent. There's a description in the OpenTracing-based version of the Datadog tracer. Briefly:

  • Sampling rates (probability of sending a trace to Datadog) vary dynamically, controlled by the Datadog Agent (collector).
  • "Sampling rules" can be defined that enforce difference sampling rates and limits based on the properties of the trace root span.
  • "Span sampling rules" can be defined that select individual spans to be sent to Datadog even when the trace is to be dropped.

There are other features, too, currently not implemented in C++, that we might like to add in the future. Examples include automatic profiling and application security monitoring.

OpenTracing project is already archived. I can't see why we need to support archived project.

I agree. We wrote dd-trace-cpp precisely so that Datadog tracing could exist without OpenTracing.

@dgoffredo
Copy link
Contributor Author

I think DD have a great opportunity here, already pointed by @esigo above which is getting the missing feature/integration as part of OTEL, and IIUC this would be a gain for the overall OTEL community :)

Sorry for being harsh, I don't mean at all :) I'm just pointing and reasoning out that we need to focus the project more on generic features.

@rikatz Perhaps contributing to opentelemetry-cpp is a great opportunity.

One example of a tracing implementation that integrates with OpenTelemetry while adding a lot of its own code is the Event Tracing for Windows (ETW) tracer. It lives in the exporters/ directory, but it is not merely an exporter. It implements Span, Tracer, TracerProvider, and its own sampler.

OpenTelemetry C++ code (i.e. code written in terms of opentelemetry-cpp's public API) can interface with this ETW tracer, but a client program must specify the implementation on startup by installing the specific TracerProvider.

Datadog might be able to integrate with opentelemetry-cpp in a similar way. All code that uses opentelemetry::Span, opentelemetry::Tracer, etc., would not need to change. However, the "owner of main()," such as otel_ngx_module, must allow a way for the particular TracerProvider to be chosen, which is a code change.

In this way, "company-specific features" can be provided entirely within the OpenTelemetry API, but a client must choose that implementation when first setting up tracing.

Pending the relevant discussion with the opentelemetry-cpp maintainers, does adding the "OpenTelemetry provider" degree of freedom to otel_ngx_module and to ingress-nginx sound reasonable to you?

@dgoffredo
Copy link
Contributor Author

dgoffredo commented Jun 22, 2023

Per open-telemetry/opentelemetry-cpp#2196, the opentelemetry-cpp project is on board with the idea of Datadog creating a library that implements the OpenTelemetry API using the OpenTelemetry SDK.

otel_ngx_module could then be modified to have the new library as a dependency, and configuration would determine whether to use the default "tracer provider" or the Datadog provider.

Then ingress-nginx could be modified to have a "tracer provider" configuration option (or whatever we might call it) that gets passed into the nginx module's configuration.

Is this course of action acceptable?

@rikatz
Copy link
Contributor

rikatz commented Jun 29, 2023

I agree with the approach :) If this is just a simple OTEL module configuration after the change, so let's do it!

Thanks for sticking with us @dgoffredo :)

@rikatz
Copy link
Contributor

rikatz commented Jun 29, 2023

(and please keep me posted with it)

@dgoffredo
Copy link
Contributor Author

Great, thanks @rikatz. I'll reach out to the otel_ngx_module maintainers and revisit all of this internally too.

@dgoffredo
Copy link
Contributor Author

To summarize:

Datadog can integrate with opentelemetry-cpp by providing a separate library that implements opentelemetry-cpp's API in terms of the Datadog-specific dd-trace-cpp.

Then opentelemetry-cpp-contrib/instrumentation/nginx can be modified to depend on the new library, and optionally use it at runtime if so configured.

Then ingress-nginx can be modified to expose the configuration option that was added to opentelemetry-cpp-contrib/instrumentation/nginx.

You'll hear more from me when this work begins.

Thanks for weighing in!

@rikatz
Copy link
Contributor

rikatz commented Aug 24, 2023

@dgoffredo are you folks following https://github.com/nginxinc/nginx-otel ? What your thoughts on it?

From a project perspective, would be great to use official nginx pre compiled modules, let me know how dd integration is going on it and if you folks are on some conversation with f5 as well :)

@dgoffredo
Copy link
Contributor Author

I wasn't aware of nginx-otel until you mentioned it, thanks for the heads up.

My initial observations are:

  1. It's nice and small, which is refreshing.
  2. It doesn't really use opentelemetry-cpp. It depends on their gRPC proto definitions, and some utilities for generating trace IDs, but otherwise the module looks like it manually tracks span information and then sends it directly over gRPC. This means that my plan of implementing an opentelemetry::trace::TracerProvider would be no use in nginx-otel, because nginx-otel does not use a Tracer at all.

From a project perspective, would be great to use official nginx pre compiled modules

Do you plan to move to using nginx-otel instead of otel_ngx_module?

let me know how dd integration is going

I'm working on it, though recently we've been onboarding a new team member, so I've taken a short break from it. To reiterate the tentative overall plan:

  • Create a dd-trace-cpp based TracerProvider implementation in a new library "dd-opentelemetry-cpp".
  • Modify otel_ngx_module to have a "provider" configuration argument, e.g. with values "default" and "datadog".
  • Modify ingress-nginx to allow the specification of the nginx module's "datadog" provider configuration option.

[let me know if] you folks are on some conversation with f5 as well

Nope. We met with F5 a number of months ago to get a feel for what a partnership might hold. Both sides seemed open to the idea, but there was not then anything to work on together.

@dgoffredo dgoffredo reopened this Aug 24, 2023
@rikatz
Copy link
Contributor

rikatz commented Oct 10, 2023

@dgoffredo sorry for the delay, missed the notification of this.

Yes, the plan would be to use nginx-otel, as this would reduce a lot our compilation scope, including the fact that in a future if we decide to move back to packaged NGINX, this will probably be an already dynamic module that we can just use.

Additionally, this module does not use curl or other libraries that we've been willing to remove from controller. Is it possible that this module can be extended with F5 folks to support datadog? I know they may have a lot of customer cases for DD + Nginx+ as well.

Thanks!!

@rikatz
Copy link
Contributor

rikatz commented Oct 12, 2023

was testing nginx-otel compilation here, but it also needs opentelemetry-cpp @esigo :)

So I think for the experimental image I will drop everything else and leave the opentelemetry image as Ehsan has built for it, then we can discuss what next.

This means as soon as the datadog features are supported also by OTEL contrib, we can use it

@dgoffredo
Copy link
Contributor Author

was testing nginx-otel compilation here, but it also needs opentelemetry-cpp

Ah, I missed that! Here I was trying to figure out our other options. I'll have another look at the source of nginx-otel to see whether the TracerProvider scheme could be massaged into it.

@dgoffredo
Copy link
Contributor Author

dgoffredo commented Oct 12, 2023

As far as I can see, nginx-otel does not use any of the types opentelemetry::trace::TracerProvider, opentelemetry::trace::Tracer, or opentelemetry::trace::Span.

Rather than use opentelemetry-cpp directly, nginx-otel dips into opentelemetry-cpp for some concrete details (generating an ID, proto definitions) but otherwise contains a complete tracer implementation separate from anything in opentelemetry-cpp.

leave the opentelemetry image as Ehsan has built for it, then we can discuss what next

My main reason for implementing the interfaces from opentelemetry-cpp is to contribute the implementation to opentelemetry-cpp-contrib's otel_ngx_module, which ingress-nginx can pass configuration to (e.g. provider = "datadog"). I'm still working on the implementation.

If/when ingress-nginx moves to nginx-otel (for the reasons @rikatz described in previous comments), I'd have to find another approach. At that point one option would be to give up on Datadog-specific code built into ingress-nginx. For example, nginx-otel has an interesting sampling feature that Datadog customers might be able to use in lieu of dd-trace-cpp's DD_TRACE_SAMPLING_RULES or nginx-datadog's datadog_sample_rate.

I'm wondering, if I continue on the opentelemetry-cpp based path, how soon (if ever) we'll have to move to something else.

@dgoffredo
Copy link
Contributor Author

Hi again, @rikatz. I see that OpenTracing (and thus Datadog) was removed from ingress-nginx in #10615.

In order to determine how to support Datadog customers using ingress-nginx, I'd like to know from you which OpenTelemetry nginx module this project plans to use:

@rikatz
Copy link
Contributor

rikatz commented Nov 6, 2023

As there is no advabtage on using F5 one, my vote is to keep using otel-contrib

@rikatz
Copy link
Contributor

rikatz commented Feb 28, 2024

/close

Closing it for now, we are sticking with OTEL only as the solution for v1.10

@dgoffredo let me know once we have some movement on otel side to support the specifics of Datadog :)

@k8s-ci-robot
Copy link
Contributor

@rikatz: Closing this issue.

In response to this:

/close

Closing it for now, we are sticking with OTEL only as the solution for v1.10

@dgoffredo let me know once we have some movement on otel side to support the specifics of Datadog :)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dgoffredo
Copy link
Contributor Author

@rikatz Will do. We have some shifting priorities but will revisit this soon.

@michael-mader
Copy link

michael-mader commented Feb 29, 2024

do I get it right that currently when updating to helm chart 4.10.0 and ingress-nginx 1.10.0 the datadog APM (tracing) integration is not working like before and is lacking a lot of features?
@dgoffredo do you have any migration hint when coming from open-tracing: "true" in 4.9.1? Is it enough to switch to open-telemetry: "true"? After reading the docs I am not sure whether datadog-collector-host is still used or if this is only applicable for opentracing 😕

@dgoffredo
Copy link
Contributor Author

@michael-mader We (Datadog) need to update our public documentation so that it either advises use of OpenTelemetry or some other interim solution until a Datadog-specific OpenTelemetry configuration option is available.

It would have been better for us to do this before the most recent release, but here we are.

My teammate or I will update this github issue when the new documentation is available. For now, you can use ingress-nginx's OpenTelemetry feature, and point it at the Datadog Agent's OTLP collector interface.

@flavioelawi
Copy link

@dgoffredo will this conflict with trace samplin rules?
In this piece of datadog documentation you mention that sampling rate is not applied to non-datadog tracing libraries, I think you have to explicitly call it out also in the documentation you linked.

@dgoffredo
Copy link
Contributor Author

Datadog trace sampling rules, i.e. those configured by the DD_TRACE_SAMPLING_RULES environment variable, are not supported by OpenTelemetry. Until Datadog has a programmatic integration within ingress-nginx, trace sampling rules will not work. I'll think of how best to convey that in Datadog's documentation.

At least for the most common use case, though, OpenTelemetry's otel-sampler-ratio is adequate to configure sampling. It's equivalent to datadog-sample-rate, which is equivalent to having one sampling rule that matches every trace.

@dmehala, my teammate, for visiblity.

@mscrivo
Copy link

mscrivo commented Jun 18, 2024

For lack of a better place to put this, and to help others who might run into the same, I thought I'd call out a few issues we ran into while trying to upgrade from 1.9.x to 1.10.1:

  • This documentation is not quite right. You actually need to set the OTEL_EXPORTER_OTLP_ENDPOINT env var with the URL as per this comment.
  • Setting opentelemetry-operation-name seems to have no effect. Our operation names in the traces have the default request_method only.
  • Cannot find a way to set the environment. Previously, you could set datadog-environment to whatever you wanted. Now there is so such var, so I thought I could set DD_ENV on the actual controller deployment to see if it would pick that up, but no luck. All our ingress nginx traces have no env set, which is a deal breaker.

@dmehala
Copy link
Contributor

dmehala commented Jun 18, 2024

Hi @mscrivo

Thank you for sharing your feedback and experiences.

I am taking the responsibility to update the public documentation and raise the opentelemetry-operation-name issue to our team responsible for oTel interoperability.

We are currently investigating the possibility of reintroducing the Datadog tracer to ingress-nginx to address these concerns. In the meantime, I recommend submitting a feature request here for visibility.

@kdeyko
Copy link

kdeyko commented Sep 2, 2024

@mscrivo

  • Cannot find a way to set the environment

I managed to pass the environment by setting the OTEL_RESOURCE_ATTRIBUTES env variable (mentioned here). Here's the excerpt from my Helm values:

---
controller:
  config:
    main-snippet: |
      env OTEL_RESOURCE_ATTRIBUTES;
  extraEnvs:
  - name: HOST_IP
    valueFrom:
      fieldRef:
        fieldPath: status.hostIP
  - name: OTEL_EXPORTER_OTLP_ENDPOINT
    value: "http://$(HOST_IP):4317"
  - name: OTEL_RESOURCE_ATTRIBUTES
    value: "deployment.environment=production"

@kdeyko
Copy link

kdeyko commented Oct 11, 2024

@dmehala any update on the opentelemetry-operation-name option?

@kdeyko
Copy link

kdeyko commented Oct 25, 2024

Update to my previous comment: turned out it works even this way:

controller:
  config:
    main-snippet: |
      env OTEL_RESOURCE_ATTRIBUTES=env=production;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
Archived in project
Development

No branches or pull requests

10 participants