Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only count time spent in queue as part of the http_server.queue span #2591

Merged
merged 1 commit into from
Feb 16, 2023

Conversation

agrobbin
Copy link
Contributor

@agrobbin agrobbin commented Jan 31, 2023

What does this PR do?

This is an attempt at resolving the inconsistency raised in #2064 between http_server.queue's duration and real world server queuing time by introducing an outer http_server.request span, which contains http_server.queue and rack.request as sibling spans, rather than the current parent/child relationship that exists between the 2 of them.

Motivation

If you have a request that spends 10ms in queue, and 5s being executed, you currently end up with these 2 spans in APM:

  • http_server.queue (duration: 5.01s)
    • rack.request (duration: 5s)

However, that isn't really semantically correct. The queuing itself did not take 5.01s, the entire lifecycle of the request took 5.01s, with queueing making up 10ms of that time.

After this change, you should end up with something like this:

  • http_server.request (duration: 5.01s)
    • http_server.queue (duration: 10ms)
    • rack.request (duration: 5s)

I'd love to get some general feedback on whether this kind of change is reasonable, as it is a breaking change (the definition of http_server.queue would change for anyone who depends on it). I feel that it is correcting a misleading and confusing APM span, but I'm not the one who maintains this library, or fields inbound customer support requests!

@github-actions github-actions bot added integrations Involves tracing integrations tracing labels Jan 31, 2023
@agrobbin agrobbin force-pushed the rack-http-server-queue-duration branch from cc5e411 to b9f3860 Compare January 31, 2023 20:10
@agrobbin agrobbin marked this pull request as ready for review January 31, 2023 20:11
@agrobbin agrobbin requested a review from a team January 31, 2023 20:11
@agrobbin agrobbin force-pushed the rack-http-server-queue-duration branch 3 times, most recently from 15b548b to 55f6fb8 Compare January 31, 2023 20:32
@delner
Copy link
Contributor

delner commented Feb 2, 2023

Hey @agrobbin! Thanks for sharing this!

Generally speaking I really like the idea here. The trace it produces seems more realistic, and it addresses a practical issue.

The only a few (but not insurmountable) reservations that I have are:

  1. The artificiality of the new http_server.request span. What service does it belong to? It bridges two actual processes (a proxy server and the Rack process). It seems like the span is basically analogous to "the trace", as it's just a container for all the spans generated within the trace. (Or maybe it's more than that, but that's just my perception).
  2. Changing root spans is more or less a breaking change. It will affect the monitors/metrics/dashboards of users who use this queuing feature.
  3. Following from (2), 2.0 isn't on the docket yet. How do we incorporate such a breaking change, if we were to merge it tomorrow? Feature flags are an option, but if there's a better idea, I'd like to consider that before adding in another configuration option that would eventually be deprecated/removed.
  4. Additionally, on the Datadog side (where we tend to treat root spans as special) we will want to double check this won't have side effects that need to be handled elsewhere, and that it fits with the general pattern for how we want instrumentation to work.

My overall disposition is that I'm inclined to adopt this new behavior; it's just a matter of addressing the logistics stated above. We'll do our part to start a discussion internally, but I'm interested in hearing your thoughts to those points.

@delner delner added the community Was opened by a community member label Feb 2, 2023
@delner delner self-requested a review February 2, 2023 21:13
@agrobbin
Copy link
Contributor Author

agrobbin commented Feb 2, 2023

Thanks for the thoughts @delner! I'm excited at the prospect of getting this included! I've responded to each of your questions below.

The artificiality of the new http_server.request span. What service does it belong to? It bridges two actual processes (a proxy server and the Rack process). It seems like the span is basically analogous to "the trace", as it's just a container for all the spans generated within the trace. (Or maybe it's more than that, but that's just my perception).

I definitely had a similar concern originally, but what led me here is that we need some root span, and conceptually, the overarching HTTP request is that "root thing". I didn't consider it totally surprising to be within the same service as the existing http_server.queue, as it mapped in my head to the same high-level system (even if technically the queuing time can be spent at the HTTP server-level and also potentially at the load balancer-level). I'm open to an alternative, though!

Changing root spans is more or less a breaking change. It will affect the monitors/metrics/dashboards of users who use this queuing feature.

Following from (2), 2.0 isn't on the docket yet. How do we incorporate such a breaking change, if we were to merge it tomorrow? Feature flags are an option, but if there's a better idea, I'd like to consider that before adding in another configuration option that would eventually be deprecated/removed.

I completely agree this is breaking, and I imagine this is something currently depended on by a bunch of users.

I do lean toward a configuration option, but maybe request_queuing switches from a boolean to an enumerated config of :inclusive and :exclusive), and the current true maps to :inclusive until 2.0? What are your thoughts? Maintaining backwards compatibility here is not a massive lift, even past 2.0 (where just the meaning of true could change, but inclusive is still a valid option). Let me know what you're thinking!

Additionally, on the Datadog side (where we tend to treat root spans as special) we will want to double check this won't have side effects that need to be handled elsewhere, and that it fits with the general pattern for how we want instrumentation to work.

I'll defer to you on this for sure, but if there is anything I can do in terms of experimentation or testing here, let me know and I can set that up within my company's non-production systems.

@delner
Copy link
Contributor

delner commented Feb 2, 2023

I think the model proposed is reasonable. I just need to reconcile with our plans for consistency with other Datadog instrumentation in other Datadog tracers.

I do lean toward a configuration option, but maybe request_queuing switches from a boolean to an enumerated config of :inclusive and :exclusive, and the current true maps to :inclusive until 2.0?

Maybe slightly different names, but I think this is a clever solution. Satisfies my concerns. Probably the way to go!

Give me a little time to try to shake the tree for some internal feedback, but if I don't get back to you quickly, please feel free to check back in and give me a nudge. In the case of non-decision, I think we adopt this under feature flag (non-default behavior). In the interim, if you can update the PR with the above configuration behavior (and tests/documentation to match) then that will help expedite things.

@agrobbin agrobbin force-pushed the rack-http-server-queue-duration branch 5 times, most recently from 1a72d48 to 1e9f26c Compare February 3, 2023 01:23
@agrobbin
Copy link
Contributor Author

agrobbin commented Feb 3, 2023

@delner I think I've got tests passing (excluding a seemingly unrelated failure on Ruby 2.7) with the new configuration option functionality.

I went with :include_request and :exclude_request as the potential configurations, though I'm not at all attached to those if you prefer something else. I've made sure that true keeps the existing behavior (the same as :include_request). The documentation has also been updated. It seems reasonable to me that in the next major release, :exclude_request becomes the default, but that's definitely not my call!

I look forward to hearing what you find on your internal-to-DataDog side of things!

Copy link
Contributor

@delner delner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking really good thus far. Added some suggestions for documentation and tagging these spans.

@@ -11,7 +11,7 @@
if Datadog::DemoEnv.feature?('tracing')
c.tracing.analytics.enabled = true if Datadog::DemoEnv.feature?('analytics')

c.tracing.instrument :rails, request_queuing: true
c.tracing.instrument :rails, request_queuing: :exclude_request
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should be the default for the Rails integration app? Somewhat indifferent.

Need to think more about what these settings mean to an integration app: its not really meant to test all the different flags, as much as it is meant to test holistic stability of Rails 7 running typical ddtrace deployments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll happily defer to you here @delner, I did a grep for request_queuing to make sure I didn't miss anything, but will adjust as you see fit.

docs/GettingStarted.md Outdated Show resolved Hide resolved
@delner
Copy link
Contributor

delner commented Feb 3, 2023

So far it doesn't look like we have a convention for HTTP queue spans. The existing convention defines {protocol}.{client|server}.request. This to me looks like it's {protocol}.{kind}.{operation}. Based on this, my inclination is span name should be:

  • "proxy http server request" --> http.proxy.request (new)
  • "proxy http server queue" --> http.proxy.queue (instead of http_server.queue)
  • "application http server request" --> http.server.request (instead of rack.request)

However, considering that we're trying to limit breaking changes (even under this feature flag), perhaps we can defer the rename of the Rack span until the next major version (2.0). This would give us:

  • "proxy http server request" --> http.proxy.request (new)
  • "proxy http server queue" --> http.proxy.queue (instead of http_server.queue)
  • "application http server request" --> rack.request (existing)

Will double check internally to see if there's any objections. Let me know what you think as well.

@agrobbin
Copy link
Contributor Author

agrobbin commented Feb 3, 2023

@delner just to be clear, renaming the existing http_server.queue span to http.proxy.queue will likely have similar breaking effects for existing users who depend on this in monitors, metrics, and other places. Are you suggesting we make that change pre-2.0, or are you suggesting that when request_queuing is set to :include_request, we continue to use http_server.queue, and only use http.proxy.queue when configured as :exclude_request?

@delner
Copy link
Contributor

delner commented Feb 6, 2023

Breaking change renaming span duly noted. If it doesn't complicate things too much, then ideally :exclude_request reflects the desired span names, whereas default behavior reflects the existing span names. That way if you enable the new behavior, you just make all the breaking change updates accordingly, if you don't you don't need to make any.

If possible, I would add comments (such as a // DEV) that notes that the existing naming/style (:include_request) is deprecated and should be removed in 2.0, whenever that comes in the future. This is both for users and for maintainers, so we don't forget to change this later.

@agrobbin agrobbin force-pushed the rack-http-server-queue-duration branch 4 times, most recently from 7731de5 to 2998aa8 Compare February 7, 2023 02:13
@agrobbin
Copy link
Contributor Author

agrobbin commented Feb 7, 2023

@delner I just updated this based on the above discussion!

1 last question... would you like there to a user-facing deprecation warning (via Datadog.logger.warn) when this is set to true, to highlight this for end-users? I included a code comment, but right now don't have any actual log output.

@delner
Copy link
Contributor

delner commented Feb 8, 2023

would you like there to be a user-facing deprecation warning

Not something that we've really put into practice but seems reasonable. Only case where I see this problematic is when an application is reconfigured more than once... might spam more than once then. Still, worth trying I think.

@ivoanjo
Copy link
Member

ivoanjo commented Feb 8, 2023

Not something that we've really put into practice but seems reasonable. Only case where I see this problematic is when an application is reconfigured more than once... might spam more than once then. Still, worth trying I think.

We actually have a helper to solve exactly that issue! Example of it being used in such a manner:

IGNORED_STATSD_ONLY_ONCE = Utils::OnlyOnce.new
private_constant :IGNORED_STATSD_ONLY_ONCE
def ignored_statsd_warning
IGNORED_STATSD_ONLY_ONCE.run do
Datadog.logger.warn(
'Ignoring user-supplied statsd instance as currently-installed version of dogstastd-ruby is incompatible. ' \
"To fix this, ensure that you have `gem 'dogstatsd-ruby', '~> 5.3'` on your Gemfile or gems.rb file."
)
end

@agrobbin agrobbin force-pushed the rack-http-server-queue-duration branch 2 times, most recently from 4332705 to f0d8a19 Compare February 8, 2023 13:31
@agrobbin
Copy link
Contributor Author

agrobbin commented Feb 8, 2023

@delner @ivoanjo I just added a deprecation warning for true and :include_request, and noted it in docs/GettingStarted.md! I did it in the Patcher, which I don't think can be called multiple times, so I didn't include the Utils::OnlyOnce logic, but if I am missing something, let me know and I can add that.

@delner
Copy link
Contributor

delner commented Feb 8, 2023

@ivoanjo Right, forgot about OnlyOnce. Thanks for that.

Yeah, Patcher has its own OnlyOnce so you shouldn't have to worry about that. I just want to wrap up some discussion on my end, and do a final pass, but otherwise should be good to go.

@delner
Copy link
Contributor

delner commented Feb 10, 2023

@agrobbin I finished my internal discussion; sounds like we're good to go on this.

Only caveat is that we want to mark this as "experimental" for the time being. Internally, we're interested in creating consistency across HTTP instrumentation as much as possible, and we may need to rename these spans in the future to reflect that consistent naming. Might happen, might never happen. But the trace structure itself is something we intend to keep for the foreseeable future, so nothing to worry about there.

Can you make a note in the documentation (under the "request queuing" section) that this is "experimental"? Once that's in place, I'd be very happy to submit my approval on this. :)

@agrobbin
Copy link
Contributor Author

@delner if this is experimental, should the user-facing deprecation be removed? It seems a little confusing to end-users to introduce an experimental feature that simultaneously deprecates the old behavior.

@delner
Copy link
Contributor

delner commented Feb 11, 2023

@agrobbin Yeah, I think that makes sense, given this change in direction. We can always re-add the deprecation later if we know its what we will break in a future major version. Sorry for the confusion!

@agrobbin agrobbin force-pushed the rack-http-server-queue-duration branch from f0d8a19 to 27af4a4 Compare February 12, 2023 01:06
@agrobbin
Copy link
Contributor Author

No need to apologize! I've just made the updates. Let me know if there is anything else you see that you'd like to change.

@marcotc
Copy link
Member

marcotc commented Feb 13, 2023

@agrobbin, it all looks great, thank you so much!
I'm just back from an extended leave, so I'm pick up where our team left off here.

I only left one small comment; save that this PR is ready to go.

@agrobbin agrobbin force-pushed the rack-http-server-queue-duration branch from 27af4a4 to 0a57883 Compare February 14, 2023 00:09
@agrobbin
Copy link
Contributor Author

agrobbin commented Feb 14, 2023

@marcotc updated to mark the http.proxy.queue span as measured!

@marcotc
Copy link
Member

marcotc commented Feb 15, 2023

@agrobbin, that looks good, could I trouble you to add a simple test asserting that the span is marked as "measured"? I promise that's the last of it! :)

If you have a request that spends 10ms in queue, and 5s being executed, you currently end up with these 2 spans in APM:

* `http_server.queue` (duration: 5.01s)
  * `rack.request` (duration: 5s)

However, that isn't really semantically correct. The queuing itself did not take 5.01s, the entire lifecycle of the request took 5.01s, with queueing making up 10ms of that time.

This is an attempt at resolving that inconsistency by introducing an outer `http.proxy.request` span, which contains `http.proxy.queue` and `rack.request` as sibling spans, rather than the current
parent/child relationship that exists between the 2 of them. It has been introduced as an opt-in feature, by setting `request_queueing: :exclude_request`.
@agrobbin agrobbin force-pushed the rack-http-server-queue-duration branch from 0a57883 to 2e27f76 Compare February 15, 2023 21:32
@agrobbin
Copy link
Contributor Author

agrobbin commented Feb 15, 2023

@marcotc whoops, sorry about that! I could've sworn I included one. Should be all set now.

Copy link
Member

@marcotc marcotc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much, @agrobbin!

@marcotc marcotc merged commit 3545138 into DataDog:master Feb 16, 2023
@github-actions github-actions bot added this to the 1.10.0 milestone Feb 16, 2023
@TonyCTHsu TonyCTHsu mentioned this pull request Mar 6, 2023
@TonyCTHsu
Copy link
Contributor

👋 @agrobbin , we are on the verge of release 2.0 !

This release we have made this as the default behavior when enabling request_queueing. Would you like to give it a try by upgrading to 2.0.0.rc1 and give us feedback?

gem 'datadog', '>= 2.0.0.rc1'

@agrobbin
Copy link
Contributor Author

That's great! We'll try and test it before you get to the official v2.0 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Was opened by a community member integrations Involves tracing integrations tracing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants