-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Environment based sampling rules are not being honored vs opentracing implementation #92
Comments
Hi, @Smeb. The bug where Envoy's Datadog tracer does not set the "error" property on spans that are considered errors by Envoy was fixed in a recent set of PRs to Envoy (main, v1.27, v1.28). The fix has not yet found its way into Istio. None of the Envoy/Istio Datadog bugs that customers have been encountering since v1.27 are due to dd-trace-cpp. Instead, they are bugs that I introduced to Envoy itself while rewriting Envoy's Datadog extension to use dd-trace-cpp (as opposed to the older OpenTracing-based dd-opentracing-cpp). I don't think that it's appropriate to backport the latest release of dd-trace-cpp onto Envoy's release branches. Those release branches can continue to use whichever versions of dd-trace-cpp they use. The bugs, which are in Envoy's source tree, have all been backported already. What's new to me is how your trace sampling went to 100%. The only sampling-related bug that I'm aware of in Envoy v1.27 could cause fewer traces to be sampled, but never more. This sampling-related change that you've encountered is best debugged with your system's complete Istio configuration in hand (whatever that might mean). You can open a Support ticket, which will likely get escalated to me, and/or we can investigate the sampling issue here on GitHub. My initial goal will be to get an Istio configuration that allows me to reproduce the unexpected 100% sampling behavior. For this the following from you would be helpful: complete YAML files, helm charts, etc. If any of this information is sensitive, then we can instead carry on in Datadog's Support system (linked above). |
I can gather that information tomorrow - if it helps here are some relevant details.
With proxy debug logs on tracing, this is the configuration:
Some notes on the above - I also tried setting span sampling - that didn't help unfortunately.
With the nightly the error issue appears to be fixed, which is great. Tomorrow I will try to get this running/reproduced in kind, and then I can get you some charts. |
Update - the behaviour setting traces per second is a bit odd. In the ingestion dashboard datadog reports 180 traces ingested a second (it should be 3). In APM, there are around 5000 spans over 30 minutes, which is roughly what you would expect. In this case I suppose I should trust the ingestion dashboard (since I believe that's what we get billed for) Update to the update - I expect the indexed spans is just lagged. Now it shows 750,000 spans. |
Thanks for the info and explanation, @Smeb. I'll be unavailable until Monday next week. Here are some thoughts until then. There are multiple layers of sampling configuration involved.
The question is, why now are you seeing 100% of traces in the Datadog UI, when previously you saw only 10%? It might be helpful to pursue this in Datadog's Support system, as our Support representatives can see the things in the Datadog UI that you are seeing, and this would help to diagnose the issue. For now, it would be helpful to have the complete configuration of Istio in your system. For example, are you using an Again, the goal is for me to have a reproduction of the issue on my test Kubernetes cluster. |
Small follow up - we're just working on getting a minimal configuration we can share. In parallel we're reaching out to support, so hopefully we should have something we can send to you soon. We can't share the whole current configuration here as is, plus I think it probably makes sense if we try to find a minimal reproduction, since that should hopefully help to pinpoint the issue. |
Bringing some news (but no configs). I believe this should help narrowing down the problem though.
I'm continuing to trim our config, but this seems to narrow it down. I'll try with a no-op lua filter and see if that shows the same behaviour. |
Thanks for the update, @Smeb. I'm unfamiliar with Envoy's Lua extension and how it might interact with tracing and with Istio. That will have to be part of my attempted reproduction once you have a configuration for me. I appreciate your putting forth the effort to produce a minimal example, that's very helpful. |
@Smeb I think that you have indeed found the root cause. It's another difference in behavior in Envoy's Datadog tracing extension between Envoy <= 1.26 and Envoy >= 1.27. In the old OpenTracing-based code, "overriding" the sampling decision looked like this:
In the two scenarios above, Lua setting the sampling decision would do nothing. In the new dd-trace-cpp based code, overriding the sampling decision looks like this: The new code performs an unconditional override of the sampling decision, even if one has been made previously. In the two scenarios above, Lua setting the sampling decision could override the configured sampling behavior. I'm not sure what is best to do about this. More information about your setup would still be helpful. One thing that I could do is alter setSampled so that it ignores its argument whenever a sampling decision has already been made. I could then backport that change onto Envoy v1.27, v1.28, and v1.29, and you'd have to either wait for Istio to pick up the change or use a non-release version of Istio's Enovy-based proxy. Or, perhaps the new behavior is preferable, and there's a way that you can prevent the Lua extension from modifying the sampling decision. What do you think? |
If this is the root cause, awesome! re: our setup, I'm currently experimenting with just changing the envoy lua example to also use datadog tracing - this seemed easier than trying to trim down our config, which uses istio operator, sidecars et al. Really we just want the Envoy bit. In our configuration though, I did find the following:
This is the istio patch logic:
The Regarding the behaviour, we would prefer the old behaviour, since I don't see an obvious way to configure the tracing behaviour itself in these docs. |
I haven't tested this yet, but perhaps this is what's happening:
So, it looks like the documented contract for
The question is still whether I should change Envoy's Datadog tracing extension to the old OpenTracing behavior, or if I should take this up with the maintainers of Envoy. Perhaps either Maybe it's best for me to submit a PR to update the contract of Span::setSampled, and see what the Envoy maintainers think. In the mean time, there's the question of what to do about the sampling of your system. Even if you explicitly specify a sampling decision in Lua, there does not appear to be a way to say "use whatever decision has already been made." If there were a way to retrieve the current sampling decision in Lua, then you could use that as a workaround until the Datadog/Envoy/sampling question is decided, but I don't see a way to do that. Perhaps it's best to restore the old behavior in Envoy's Datadog tracing extension, and leave the larger question for another time. Any ideas? |
I just talked through this with my teammate, and we have an idea. Instead of changing the Datadog extension's
resp_headers, _ = request_handle:httpCall(cluster, req_headers, "", {timeout_ms=1000, trace_sampled=nil}) I still would need to address the underlying contractual issue with the Envoy maintainers, but this is a workaround that you and others could use, and it minimizes any possible unintended side effects. |
Sorry for the delay in replying - timezones. My suggestion would be to restore the old behaviour, and then discuss with Envoy about the possibility of updating the lua tracing behaviour in the lua filter for the future. I think the workaround comes with a few issues:
That said, I do agree that the current usage is problematic and worth addressing in the long term. Also, thanks for the investigative work - all the code links really help clarify the source of the issue. |
Yes, I think I'll restore the old behavior in Envoy's Datadog extension, and open a separate issue in Envoy about the intended behavior of |
Hi - just to follow up, is there a rough ETA on reverting the behaviour? Happy to help to expedite this if there's any way we can support. |
I'll throw some code at it today. Will let you know if I expect a delay after that. |
Hi @Smeb, Sorry for the delay. I'll take care of @dgoffredo's work. I'm on PTO now, but I'll make it my top priority when I'm back the week of April 8th. Thank you for your patience. |
Thanks for the follow up 👍 |
Hi - we upgraded to Istio 1.19.5, and thereby to 0.1.8. We noticed the following changes:
This is for an Istio gateway service running Istio 1.19.5, Envoy 1.27, dd-trace-cpp 0.1.8.
We're trying to mitigate the second issue, but so far the environment variables don't seem to have an effect.
Originally, we had the following environment variable set:
Which comes from here. With opentracing, this meant the agent was controlling the sample rate.
With the new library, this does not seem to happen.
We've also tried the following configurations (without success):
Unfortunately, none of them had an effect.
The one thing that did work, was setting
DD_TRACE_ENABLED=false
, which, as expected, disabled the tracing (showing that at least that environment variable is being used).Other (maybe) relevant configurations: we have istio tracing set to 100%, since the datadog configuration is expected to handle dropping traces.
The text was updated successfully, but these errors were encountered: