Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] rack middleware close trace via body proxy (Rack::CommonLogger compatibility) #1746

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

skcc321
Copy link
Contributor

@skcc321 skcc321 commented Nov 1, 2021

What is the issue?

Hanami framework is Rack-based framework. It uses Rack::CommonLogger middleware for printing out request/response logs. The issue is, current span/trace is not associated with these logs because Datadog rack middleware closes trace too early. If we take a look at Rack::CommonLogger we can observe that Rack::CommonLogger uses Rack::BodyProxy to guarantee that log is printed only when everything else is done (datadog trace as well) as a result we don't have an active correlation in logs formatter.
So I implemented the same approach for closing current span/trace in Datadog rack middleware.

@skcc321 skcc321 requested a review from a team November 1, 2021 11:38
@skcc321 skcc321 changed the title [fix] rack middleware close trace via body proxy [fix] rack middleware close trace via body proxy (Rack::CommonLogger compatibility) Nov 1, 2021
@skcc321 skcc321 force-pushed the rack-middleware-body-proxy branch 4 times, most recently from 5332b8d to 14c7616 Compare November 1, 2021 22:18
Copy link
Member

@marcotc marcotc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 @skcc321, thank you for the work you've done in this PR.

I understand the issue at hand, which is that log records created by Rack::CommonLogger won't receive trace->logs correlation, as ddtrace's trace will have already been finished by the time Rack::CommonLogger gets a chance to output its log line.

One concern I have with the solution proposed is that it create an unpredictable lifespan and "lifecycle end event" for the rack.request span (the span created by Datadog::Contrib::Rack::TraceMiddleware).

ddtrace #1746 drawio(2)

The way it works today, the rack.request span is created at the Datadog Start arrow in the diagram above, and finished at Datadog End. The Rack::CommonLogger log event will whenever a middleware chooses to close the response body, which is likely to happen below the Datadog End in the diagram. It could happen at End of Middleware 2, at End of Last Middleware or at the web sever handler.
It could also conditionally happen in different points at different times (sometimes at Middleware 2, sometimes at Middleware 1).

The most concerning issue that this unpredictable rack.request end event can cause is the mangling of Datadog spans in cases where "Middleware 1" or "Middleware 2" create Datadog spans. If, for example, "Middleware 2" creates a Datadog span, that span will be the parent of rack.request, created at Datadog::Contrib::Rack::TraceMiddleware. But if Rack::BodyProxy#close is called at End of Last Middleware, this will happen after the span created by "Middleware 2" was closed, meaning the parent span is closed before the child span rack.request is closed. This produces confusing traces that have dubious value.

Because the predictable lifecycle of a span is very important to:

  • Ensure a span measures the same operation every time, as to allow for time-based comparisons (has my end-point became slower recently?).
  • Spans that are not closed leak memory: their memory will never be released by the host application.
    For these reasons we have strong lifecycle constraints on span creation and closure.

For this reason, I don't think I'm comfortable with the solution presented as is.

This is not to say that the issue of lack of trace->logs correlation in Rack::CommonLogger log events is not important: it's very much so. But the compromise, as it is implemented, introduces too much risk to the Rack instrumentation.

I'm open to other suggestions around propagating trace correlation in environments where the root span (rack.request in this case) has already closed, but we'd like to still correlate closely related log lines to it.

@delner and @ivoanjo, could you guys take a look at the problem and solution at hand here and see if you have any options or suggestions for solving the Rack::CommonLogger problem at hand?

And just for completeness, there's also the case where Rack::BodyProxy#close is not called at all, but that contradicts Rack's spec and thus such application would not behave correctly in many different ways. I don't think this specific case is a concern.

@delner
Copy link
Contributor

delner commented Nov 2, 2021

Is there any way to push the Datadog middleware to the top of the stack? Or does BodyProxy truly have to run first/last? (Before/after ddtrace middleware?) If there's a way to do this, I think that would be the simplest solution.

You mentioned this is a Hanami app... one possibility: if necessary, we could introduce a Hanami integration that is aware of its middleware stack and can manage this. (Similar to what we do for Rails already.)

Let me know if any of that would work, or if I'm missing some detail.

@skcc321
Copy link
Contributor Author

skcc321 commented Nov 2, 2021

@delner hanami is just a rack based framework. So the issue is related to Rack::CommonLogger which is something which exists many years.
Rack::CommonLogger is a part of rack library so technically datadog "rack" instrumentation is not complete without handling that CommonLogger properly.

@skcc321
Copy link
Contributor Author

skcc321 commented Nov 2, 2021

@marcotc so as I understand the solution for rack middleware is:
Insert datadog rack middleware on top of all middlewares, so trace/span is started for sure at the very beginning of a request processing and closed at the very end. So according to behaviour of Rack::BodyProxy I think the implementation is OK (if we are sure that datadog rack middleware was inserted at the very beginning)

If the Body responds to +close+, it will be called after iteration. If
the original Body is replaced by a new Body, the new Body
must close the original Body after iteration, if it responds to +close+.
If the Body responds to both +to_ary+ and +close+, its
implementation of +to_ary+ must call +close+ after iteration.

Doesn't that guarantee that body will be closed always?

@ivoanjo
Copy link
Member

ivoanjo commented Nov 2, 2021

could you guys take a look at the problem and solution at hand here and see if you have any options or suggestions for solving the Rack::CommonLogger problem at hand?

I haven't dug into this that often, so take my opinions/suggestions with a big grain of salt, but looking at Rack::CommonLogger it seems to gather a number of infos from the app when it executes as part of the rack stack, and then uses the BodyProxy as a way of printing that information some time later in the future:

https://github.com/rack/rack/blob/e0993d47096d36535dbe725091963a7e0f5aab52/lib/rack/common_logger.rb#L36-L42

But that gets me thinking; isn't what we're trying to do here similar to what the rack common logger is doing? E.g. could we gather the current span information inside CommonLogger#call, and store it for later, so that the logger could access it after-the-fact?

@delner
Copy link
Contributor

delner commented Nov 2, 2021

so technically datadog "rack" instrumentation is not complete without handling that CommonLogger properly.

@skcc321 Right, so why couldn't we push the datadog middleware to the the last middleware?

@skcc321
Copy link
Contributor Author

skcc321 commented Nov 3, 2021

so technically datadog "rack" instrumentation is not complete without handling that CommonLogger properly.

@skcc321 Right, so why couldn't we push the datadog middleware to the the last middleware?

@delner the reason is that Rack::BodyProxy will write logs as a "callback" so it always goes last regardless Datadog middleware on top of other middlewares. As a result, you have no active span/trace in Rack::CommonLogger#call

@skcc321 skcc321 force-pushed the rack-middleware-body-proxy branch from 14c7616 to 513c157 Compare November 19, 2021 11:56
@codecov-commenter
Copy link

Codecov Report

Merging #1746 (513c157) into master (8b97105) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1746   +/-   ##
=======================================
  Coverage   98.17%   98.17%           
=======================================
  Files         934      934           
  Lines       45048    45056    +8     
=======================================
+ Hits        44227    44235    +8     
  Misses        821      821           
Impacted Files Coverage Δ
lib/ddtrace/contrib/delayed_job/plugin.rb 100.00% <100.00%> (ø)
lib/ddtrace/contrib/rack/middlewares.rb 85.71% <100.00%> (+0.89%) ⬆️
lib/ddtrace/contrib/rack/patcher.rb 93.75% <100.00%> (+0.13%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8b97105...513c157. Read the comment docs.

@skcc321
Copy link
Contributor Author

skcc321 commented Nov 19, 2021

@delner @ivoanjo any updates on this one?

@ivoanjo
Copy link
Member

ivoanjo commented Nov 19, 2021

Thanks for the ping, I'll raise this with the team!

arielvalentin added a commit to arielvalentin/opentelemetry-ruby-contrib that referenced this pull request Feb 18, 2023
This is an alternate implementation of the Rack instrumentation that leverages the [`Rack::Events` API](https://www.rubydoc.info/gems/rack/Rack/Events) instead of a custom`Rack::Middleware`.

Why am I suggesting we change this instrumentation? At GitHub we leverage `Rack::BodyProxy` to write Rack Request logs after the request is complete, however Rack span is already finished and its related `Context` has already been detached. This means we are not able to correlate request logs to our traces.

The advantage of using `Rack::Events` is that handlers are triggered during difference stages of a request, including for deferred operations like [`Rack::BodyProxy`](https://www.rubydoc.info/gems/rack/Rack/BodyProxy), as opposed to middlewares who _only_ are invoked inline.

The disadvantage of this API is that it makes managing the request more difficult and we have to track the Context positions to detach in the `Rack::Env`.

This implementation will be released along side the existing instrumtation to give users the option to use the middleware instead of the `Rack::Events` handler until we are able to run this in some of our heavy production workloads.

Fixes open-telemetry#341
Related DataDog/dd-trace-rb#1746
arielvalentin added a commit to arielvalentin/opentelemetry-ruby-contrib that referenced this pull request Mar 2, 2023
This is an alternate implementation of the Rack instrumentation that leverages the [`Rack::Events` API](https://www.rubydoc.info/gems/rack/Rack/Events) instead of a custom`Rack::Middleware`.

Why am I suggesting we change this instrumentation? At GitHub we leverage `Rack::BodyProxy` to write Rack Request logs after the request is complete, however Rack span is already finished and its related `Context` has already been detached. This means we are not able to correlate request logs to our traces.

The advantage of using `Rack::Events` is that handlers are triggered during difference stages of a request, including for deferred operations like [`Rack::BodyProxy`](https://www.rubydoc.info/gems/rack/Rack/BodyProxy), as opposed to middlewares who _only_ are invoked inline.

The disadvantage of this API is that it makes managing the request more difficult and we have to track the Context positions to detach in the `Rack::Env`.

This implementation will be released along side the existing instrumtation to give users the option to use the middleware instead of the `Rack::Events` handler until we are able to run this in some of our heavy production workloads.

Fixes open-telemetry#341
Related DataDog/dd-trace-rb#1746
arielvalentin added a commit to open-telemetry/opentelemetry-ruby-contrib that referenced this pull request Apr 5, 2023
* feat: Use Rack::Events for instrumentation

This is an alternate implementation of the Rack instrumentation that leverages the [`Rack::Events` API](https://www.rubydoc.info/gems/rack/Rack/Events) instead of a custom`Rack::Middleware`.

Why am I suggesting we change this instrumentation? At GitHub we leverage `Rack::BodyProxy` to write Rack Request logs after the request is complete, however Rack span is already finished and its related `Context` has already been detached. This means we are not able to correlate request logs to our traces.

The advantage of using `Rack::Events` is that handlers are triggered during difference stages of a request, including for deferred operations like [`Rack::BodyProxy`](https://www.rubydoc.info/gems/rack/Rack/BodyProxy), as opposed to middlewares who _only_ are invoked inline.

The disadvantage of this API is that it makes managing the request more difficult and we have to track the Context positions to detach in the `Rack::Env`.

This implementation will be released along side the existing instrumtation to give users the option to use the middleware instead of the `Rack::Events` handler until we are able to run this in some of our heavy production workloads.

Fixes #341
Related DataDog/dd-trace-rb#1746

* squash: additional feature parity

* squash: add allowed response headers

* squash: url quantization

* squash: Now with new Lemon Scented response headers

* squash: we are now at parity

* squash: use instrumetation config

* squash: Use declarative config options

* squash: fix bad refactoring

* convert proxy span to an event

* refactor: move configurations to instrumentation install

* squash: add test converage

* squash: make response headers a little more resilient

* squash: Ensures event middleware will not cause the application to crash when it encounters errors

* squash: fix linter error

* feat: Add middleware args helper for ActionPack and Sinatra

* fix: test case

* fix: Rack Events is autoloaded so if the parent module is present so are submodules

* fix: More precise error handling

* fix: Ensure config is cleared/setup during installation

* fix: Sinatra 1.4 compatability

* fix: bad merge

* fix: invalid responses in test case

* squash: Added a few more test cases
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants