Only count time spent in queue as part of the `http_server.queue` span #2591

agrobbin · 2023-01-31T19:50:50Z

What does this PR do?

This is an attempt at resolving the inconsistency raised in #2064 between http_server.queue's duration and real world server queuing time by introducing an outer http_server.request span, which contains http_server.queue and rack.request as sibling spans, rather than the current parent/child relationship that exists between the 2 of them.

Motivation

If you have a request that spends 10ms in queue, and 5s being executed, you currently end up with these 2 spans in APM:

http_server.queue (duration: 5.01s)
- rack.request (duration: 5s)

However, that isn't really semantically correct. The queuing itself did not take 5.01s, the entire lifecycle of the request took 5.01s, with queueing making up 10ms of that time.

After this change, you should end up with something like this:

http_server.request (duration: 5.01s)
- http_server.queue (duration: 10ms)
- rack.request (duration: 5s)

I'd love to get some general feedback on whether this kind of change is reasonable, as it is a breaking change (the definition of http_server.queue would change for anyone who depends on it). I feel that it is correcting a misleading and confusing APM span, but I'm not the one who maintains this library, or fields inbound customer support requests!

delner · 2023-02-02T21:10:07Z

Hey @agrobbin! Thanks for sharing this!

Generally speaking I really like the idea here. The trace it produces seems more realistic, and it addresses a practical issue.

The only a few (but not insurmountable) reservations that I have are:

The artificiality of the new http_server.request span. What service does it belong to? It bridges two actual processes (a proxy server and the Rack process). It seems like the span is basically analogous to "the trace", as it's just a container for all the spans generated within the trace. (Or maybe it's more than that, but that's just my perception).
Changing root spans is more or less a breaking change. It will affect the monitors/metrics/dashboards of users who use this queuing feature.
Following from (2), 2.0 isn't on the docket yet. How do we incorporate such a breaking change, if we were to merge it tomorrow? Feature flags are an option, but if there's a better idea, I'd like to consider that before adding in another configuration option that would eventually be deprecated/removed.
Additionally, on the Datadog side (where we tend to treat root spans as special) we will want to double check this won't have side effects that need to be handled elsewhere, and that it fits with the general pattern for how we want instrumentation to work.

My overall disposition is that I'm inclined to adopt this new behavior; it's just a matter of addressing the logistics stated above. We'll do our part to start a discussion internally, but I'm interested in hearing your thoughts to those points.

agrobbin · 2023-02-02T21:29:59Z

Thanks for the thoughts @delner! I'm excited at the prospect of getting this included! I've responded to each of your questions below.

The artificiality of the new http_server.request span. What service does it belong to? It bridges two actual processes (a proxy server and the Rack process). It seems like the span is basically analogous to "the trace", as it's just a container for all the spans generated within the trace. (Or maybe it's more than that, but that's just my perception).

I definitely had a similar concern originally, but what led me here is that we need some root span, and conceptually, the overarching HTTP request is that "root thing". I didn't consider it totally surprising to be within the same service as the existing http_server.queue, as it mapped in my head to the same high-level system (even if technically the queuing time can be spent at the HTTP server-level and also potentially at the load balancer-level). I'm open to an alternative, though!

Changing root spans is more or less a breaking change. It will affect the monitors/metrics/dashboards of users who use this queuing feature.

Following from (2), 2.0 isn't on the docket yet. How do we incorporate such a breaking change, if we were to merge it tomorrow? Feature flags are an option, but if there's a better idea, I'd like to consider that before adding in another configuration option that would eventually be deprecated/removed.

I completely agree this is breaking, and I imagine this is something currently depended on by a bunch of users.

I do lean toward a configuration option, but maybe request_queuing switches from a boolean to an enumerated config of :inclusive and :exclusive), and the current true maps to :inclusive until 2.0? What are your thoughts? Maintaining backwards compatibility here is not a massive lift, even past 2.0 (where just the meaning of true could change, but inclusive is still a valid option). Let me know what you're thinking!

Additionally, on the Datadog side (where we tend to treat root spans as special) we will want to double check this won't have side effects that need to be handled elsewhere, and that it fits with the general pattern for how we want instrumentation to work.

I'll defer to you on this for sure, but if there is anything I can do in terms of experimentation or testing here, let me know and I can set that up within my company's non-production systems.

delner · 2023-02-02T22:13:15Z

I think the model proposed is reasonable. I just need to reconcile with our plans for consistency with other Datadog instrumentation in other Datadog tracers.

I do lean toward a configuration option, but maybe request_queuing switches from a boolean to an enumerated config of :inclusive and :exclusive, and the current true maps to :inclusive until 2.0?

Maybe slightly different names, but I think this is a clever solution. Satisfies my concerns. Probably the way to go!

Give me a little time to try to shake the tree for some internal feedback, but if I don't get back to you quickly, please feel free to check back in and give me a nudge. In the case of non-decision, I think we adopt this under feature flag (non-default behavior). In the interim, if you can update the PR with the above configuration behavior (and tests/documentation to match) then that will help expedite things.

agrobbin · 2023-02-03T01:38:28Z

@delner I think I've got tests passing (excluding a seemingly unrelated failure on Ruby 2.7) with the new configuration option functionality.

I went with :include_request and :exclude_request as the potential configurations, though I'm not at all attached to those if you prefer something else. I've made sure that true keeps the existing behavior (the same as :include_request). The documentation has also been updated. It seems reasonable to me that in the next major release, :exclude_request becomes the default, but that's definitely not my call!

I look forward to hearing what you find on your internal-to-DataDog side of things!

delner

Looking really good thus far. Added some suggestions for documentation and tagging these spans.

delner · 2023-02-03T14:19:27Z

integration/apps/rails-seven/config/initializers/datadog.rb

@@ -11,7 +11,7 @@
  if Datadog::DemoEnv.feature?('tracing')
    c.tracing.analytics.enabled = true if Datadog::DemoEnv.feature?('analytics')

-    c.tracing.instrument :rails, request_queuing: true
+    c.tracing.instrument :rails, request_queuing: :exclude_request


I wonder if this should be the default for the Rails integration app? Somewhat indifferent.

Need to think more about what these settings mean to an integration app: its not really meant to test all the different flags, as much as it is meant to test holistic stability of Rails 7 running typical ddtrace deployments.

I'll happily defer to you here @delner, I did a grep for request_queuing to make sure I didn't miss anything, but will adjust as you see fit.

docs/GettingStarted.md

lib/datadog/tracing/contrib/rack/middlewares.rb

delner · 2023-02-03T15:00:12Z

So far it doesn't look like we have a convention for HTTP queue spans. The existing convention defines {protocol}.{client|server}.request. This to me looks like it's {protocol}.{kind}.{operation}. Based on this, my inclination is span name should be:

"proxy http server request" --> http.proxy.request (new)
"proxy http server queue" --> http.proxy.queue (instead of http_server.queue)
"application http server request" --> http.server.request (instead of rack.request)

However, considering that we're trying to limit breaking changes (even under this feature flag), perhaps we can defer the rename of the Rack span until the next major version (2.0). This would give us:

"proxy http server request" --> http.proxy.request (new)
"proxy http server queue" --> http.proxy.queue (instead of http_server.queue)
"application http server request" --> rack.request (existing)

Will double check internally to see if there's any objections. Let me know what you think as well.

agrobbin · 2023-02-03T16:04:38Z

@delner just to be clear, renaming the existing http_server.queue span to http.proxy.queue will likely have similar breaking effects for existing users who depend on this in monitors, metrics, and other places. Are you suggesting we make that change pre-2.0, or are you suggesting that when request_queuing is set to :include_request, we continue to use http_server.queue, and only use http.proxy.queue when configured as :exclude_request?

delner · 2023-02-06T18:50:52Z

Breaking change renaming span duly noted. If it doesn't complicate things too much, then ideally :exclude_request reflects the desired span names, whereas default behavior reflects the existing span names. That way if you enable the new behavior, you just make all the breaking change updates accordingly, if you don't you don't need to make any.

If possible, I would add comments (such as a // DEV) that notes that the existing naming/style (:include_request) is deprecated and should be removed in 2.0, whenever that comes in the future. This is both for users and for maintainers, so we don't forget to change this later.

agrobbin · 2023-02-07T02:27:27Z

@delner I just updated this based on the above discussion!

1 last question... would you like there to a user-facing deprecation warning (via Datadog.logger.warn) when this is set to true, to highlight this for end-users? I included a code comment, but right now don't have any actual log output.

delner · 2023-02-08T03:30:40Z

would you like there to be a user-facing deprecation warning

Not something that we've really put into practice but seems reasonable. Only case where I see this problematic is when an application is reconfigured more than once... might spam more than once then. Still, worth trying I think.

ivoanjo · 2023-02-08T08:32:29Z

Not something that we've really put into practice but seems reasonable. Only case where I see this problematic is when an application is reconfigured more than once... might spam more than once then. Still, worth trying I think.

We actually have a helper to solve exactly that issue! Example of it being used in such a manner:

dd-trace-rb/lib/datadog/core/metrics/client.rb

Lines 185 to 194 in eadd5e1

    
           IGNORED_STATSD_ONLY_ONCE = Utils::OnlyOnce.new 
        
           private_constant :IGNORED_STATSD_ONLY_ONCE 
        
           def ignored_statsd_warning 
        
             IGNORED_STATSD_ONLY_ONCE.run do 
        
               Datadog.logger.warn( 
        
                 'Ignoring user-supplied statsd instance as currently-installed version of dogstastd-ruby is incompatible. ' \ 
        
                 "To fix this, ensure that you have `gem 'dogstatsd-ruby', '~> 5.3'` on your Gemfile or gems.rb file." 
        
               ) 
        
             end

agrobbin · 2023-02-08T13:50:29Z

@delner @ivoanjo I just added a deprecation warning for true and :include_request, and noted it in docs/GettingStarted.md! I did it in the Patcher, which I don't think can be called multiple times, so I didn't include the Utils::OnlyOnce logic, but if I am missing something, let me know and I can add that.

delner · 2023-02-08T15:59:50Z

@ivoanjo Right, forgot about OnlyOnce. Thanks for that.

Yeah, Patcher has its own OnlyOnce so you shouldn't have to worry about that. I just want to wrap up some discussion on my end, and do a final pass, but otherwise should be good to go.

delner · 2023-02-10T18:55:44Z

@agrobbin I finished my internal discussion; sounds like we're good to go on this.

Only caveat is that we want to mark this as "experimental" for the time being. Internally, we're interested in creating consistency across HTTP instrumentation as much as possible, and we may need to rename these spans in the future to reflect that consistent naming. Might happen, might never happen. But the trace structure itself is something we intend to keep for the foreseeable future, so nothing to worry about there.

Can you make a note in the documentation (under the "request queuing" section) that this is "experimental"? Once that's in place, I'd be very happy to submit my approval on this. :)

agrobbin · 2023-02-11T14:12:12Z

@delner if this is experimental, should the user-facing deprecation be removed? It seems a little confusing to end-users to introduce an experimental feature that simultaneously deprecates the old behavior.

delner · 2023-02-11T21:43:02Z

@agrobbin Yeah, I think that makes sense, given this change in direction. We can always re-add the deprecation later if we know its what we will break in a future major version. Sorry for the confusion!

agrobbin · 2023-02-12T01:24:18Z

No need to apologize! I've just made the updates. Let me know if there is anything else you see that you'd like to change.

lib/datadog/tracing/contrib/rack/middlewares.rb

marcotc · 2023-02-13T23:22:10Z

@agrobbin, it all looks great, thank you so much!
I'm just back from an extended leave, so I'm pick up where our team left off here.

I only left one small comment; save that this PR is ready to go.

agrobbin · 2023-02-14T00:13:19Z

@marcotc updated to mark the http.proxy.queue span as measured!

marcotc · 2023-02-15T21:21:09Z

@agrobbin, that looks good, could I trouble you to add a simple test asserting that the span is marked as "measured"? I promise that's the last of it! :)

If you have a request that spends 10ms in queue, and 5s being executed, you currently end up with these 2 spans in APM: * `http_server.queue` (duration: 5.01s) * `rack.request` (duration: 5s) However, that isn't really semantically correct. The queuing itself did not take 5.01s, the entire lifecycle of the request took 5.01s, with queueing making up 10ms of that time. This is an attempt at resolving that inconsistency by introducing an outer `http.proxy.request` span, which contains `http.proxy.queue` and `rack.request` as sibling spans, rather than the current parent/child relationship that exists between the 2 of them. It has been introduced as an opt-in feature, by setting `request_queueing: :exclude_request`.

agrobbin · 2023-02-15T21:42:47Z

@marcotc whoops, sorry about that! I could've sworn I included one. Should be all set now.

marcotc

Thank you so much, @agrobbin!

TonyCTHsu · 2024-05-29T08:40:54Z

👋 @agrobbin , we are on the verge of release 2.0 !

This release we have made this as the default behavior when enabling request_queueing. Would you like to give it a try by upgrading to 2.0.0.rc1 and give us feedback?

gem 'datadog', '>= 2.0.0.rc1'

agrobbin · 2024-05-29T12:58:59Z

That's great! We'll try and test it before you get to the official v2.0 release.

github-actions bot added integrations Involves tracing integrations tracing labels Jan 31, 2023

agrobbin force-pushed the rack-http-server-queue-duration branch from cc5e411 to b9f3860 Compare January 31, 2023 20:10

agrobbin marked this pull request as ready for review January 31, 2023 20:11

agrobbin requested a review from a team January 31, 2023 20:11

agrobbin force-pushed the rack-http-server-queue-duration branch 3 times, most recently from 15b548b to 55f6fb8 Compare January 31, 2023 20:32

delner added the community Was opened by a community member label Feb 2, 2023

delner assigned agrobbin Feb 2, 2023

delner self-requested a review February 2, 2023 21:13

agrobbin force-pushed the rack-http-server-queue-duration branch 5 times, most recently from 1a72d48 to 1e9f26c Compare February 3, 2023 01:23

delner reviewed Feb 3, 2023

View reviewed changes

agrobbin force-pushed the rack-http-server-queue-duration branch 4 times, most recently from 7731de5 to 2998aa8 Compare February 7, 2023 02:13

agrobbin force-pushed the rack-http-server-queue-duration branch 2 times, most recently from 4332705 to f0d8a19 Compare February 8, 2023 13:31

agrobbin force-pushed the rack-http-server-queue-duration branch from f0d8a19 to 27af4a4 Compare February 12, 2023 01:06

marcotc reviewed Feb 13, 2023

View reviewed changes

lib/datadog/tracing/contrib/rack/middlewares.rb Outdated Show resolved Hide resolved

agrobbin force-pushed the rack-http-server-queue-duration branch from 27af4a4 to 0a57883 Compare February 14, 2023 00:09

agrobbin force-pushed the rack-http-server-queue-duration branch from 0a57883 to 2e27f76 Compare February 15, 2023 21:32

marcotc approved these changes Feb 15, 2023

View reviewed changes

marcotc merged commit 3545138 into DataDog:master Feb 16, 2023

github-actions bot added this to the 1.10.0 milestone Feb 16, 2023

agrobbin mentioned this pull request Feb 16, 2023

HTTP request queuing duration #2064

Closed

TonyCTHsu mentioned this pull request Mar 6, 2023

Bump to version 1.10.0 #2667

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only count time spent in queue as part of the `http_server.queue` span #2591

Only count time spent in queue as part of the `http_server.queue` span #2591

agrobbin commented Jan 31, 2023 •

edited

Loading

delner commented Feb 2, 2023 •

edited

Loading

agrobbin commented Feb 2, 2023

delner commented Feb 2, 2023

agrobbin commented Feb 3, 2023

delner left a comment

delner Feb 3, 2023

agrobbin Feb 3, 2023

delner commented Feb 3, 2023 •

edited

Loading

agrobbin commented Feb 3, 2023 •

edited

Loading

delner commented Feb 6, 2023 •

edited

Loading

agrobbin commented Feb 7, 2023

delner commented Feb 8, 2023

ivoanjo commented Feb 8, 2023

agrobbin commented Feb 8, 2023

delner commented Feb 8, 2023

delner commented Feb 10, 2023

agrobbin commented Feb 11, 2023

delner commented Feb 11, 2023

agrobbin commented Feb 12, 2023

marcotc commented Feb 13, 2023

agrobbin commented Feb 14, 2023 •

edited

Loading

marcotc commented Feb 15, 2023

agrobbin commented Feb 15, 2023 •

edited

Loading

marcotc left a comment

TonyCTHsu commented May 29, 2024

agrobbin commented May 29, 2024

Only count time spent in queue as part of the http_server.queue span #2591

Only count time spent in queue as part of the http_server.queue span #2591

Conversation

agrobbin commented Jan 31, 2023 • edited Loading

delner commented Feb 2, 2023 • edited Loading

agrobbin commented Feb 2, 2023

delner commented Feb 2, 2023

agrobbin commented Feb 3, 2023

delner left a comment

Choose a reason for hiding this comment

delner Feb 3, 2023

Choose a reason for hiding this comment

agrobbin Feb 3, 2023

Choose a reason for hiding this comment

delner commented Feb 3, 2023 • edited Loading

agrobbin commented Feb 3, 2023 • edited Loading

delner commented Feb 6, 2023 • edited Loading

agrobbin commented Feb 7, 2023

delner commented Feb 8, 2023

ivoanjo commented Feb 8, 2023

agrobbin commented Feb 8, 2023

delner commented Feb 8, 2023

delner commented Feb 10, 2023

agrobbin commented Feb 11, 2023

delner commented Feb 11, 2023

agrobbin commented Feb 12, 2023

marcotc commented Feb 13, 2023

agrobbin commented Feb 14, 2023 • edited Loading

marcotc commented Feb 15, 2023

agrobbin commented Feb 15, 2023 • edited Loading

marcotc left a comment

Choose a reason for hiding this comment

TonyCTHsu commented May 29, 2024

agrobbin commented May 29, 2024

Only count time spent in queue as part of the `http_server.queue` span #2591

Only count time spent in queue as part of the `http_server.queue` span #2591

agrobbin commented Jan 31, 2023 •

edited

Loading

delner commented Feb 2, 2023 •

edited

Loading

delner commented Feb 3, 2023 •

edited

Loading

agrobbin commented Feb 3, 2023 •

edited

Loading

delner commented Feb 6, 2023 •

edited

Loading

agrobbin commented Feb 14, 2023 •

edited

Loading

agrobbin commented Feb 15, 2023 •

edited

Loading