Detect and log long running http(s) requests #17842

jrafanie · 2018-08-10T18:27:44Z

When a request is made through apache, we have 2 minutes to complete this request before apache throws a timeout, resulting in a browser 502 error. This pull request provides a mechanism for people who encounter these 502 errors to figure out which request and parameters are causing this problem. Typically, the request is really slow and needs to be optimized, but we have had deadlock(s) occur so it's very helpful to minimize the work required to narrow the problem down.

https://bugzilla.redhat.com/show_bug.cgi?id=1614918

Add a middleware to detect long running requests

This middleware tracks each request and when it started using thread
local variables. long_running_requests traverses these thread local
variables to find requests beyond the 2 minute timeout. This
information can be retrieved by another thread, such as the heartbeating
thread of a puma worker, which can then respond to a timed out request.

Log long running requests via RequestStartedOnMiddleware

For puma based workers, we implement do_heartbeat_work to
log_long_running_requests. This method is run very frequently per minute,
so we ensure this check happens at most every 30 seconds.
The log message includes the requested URI path, the thread (in the same
PID:TID format that we log elsewhere) and the request duration so far.
Using the PID:TID information, users can then find the request
parameters by searching the logs for this PID:TID.

Example logging:

[----] I, [2018-08-10T14:57:10.587334 #2788:357b5f4]  INFO -- : Started GET "/api/ping" for 127.0.0.1 at 2018-08-10 14:57:10 -0400
[----] I, [2018-08-10T14:57:11.231264 #2788:357b5f4]  INFO -- : Processing by Api::PingController#index as JSON
[----] I, [2018-08-10T14:57:11.232892 #2788:357b5f4]  INFO -- : Completed 200 OK in 1ms (Views: 0.2ms | ActiveRecord: 0.0ms)
[----] I, [2018-08-10T14:57:11.235108 #2788:357b8d8]  INFO -- : Started GET "/api/ping" for 127.0.0.1 at 2018-08-10 14:57:11 -0400
[----] I, [2018-08-10T14:57:11.237800 #2788:357b8d8]  INFO -- : Processing by Api::PingController#index as JSON
[----] I, [2018-08-10T14:57:11.245526 #2788:357b324]  INFO -- : Started GET "/api/ping" for 127.0.0.1 at 2018-08-10 14:57:11 -0400
[----] I, [2018-08-10T14:57:11.247653 #2788:357b324]  INFO -- : Processing by Api::PingController#index as JSON
[----] I, [2018-08-10T14:57:11.259623 #2788:357b5f4]  INFO -- : Started GET "/api/ping" for 127.0.0.1 at 2018-08-10 14:57:11 -0400
[----] I, [2018-08-10T14:57:11.261375 #2788:357b5f4]  INFO -- : Processing by Api::PingController#index as JSON
[----] I, [2018-08-10T14:57:11.261793 #2788:357b5f4]  INFO -- : Completed 200 OK in 0ms (Views: 0.1ms | ActiveRecord: 0.0ms)
[----] I, [2018-08-10T14:57:11.265736 #2788:357b5f4]  INFO -- : Started GET "/api/ping" for 127.0.0.1 at 2018-08-10 14:57:11 -0400
[----] I, [2018-08-10T14:57:11.267544 #2788:357b5f4]  INFO -- : Processing by Api::PingController#index as JSON
[----] I, [2018-08-10T14:57:11.268136 #2788:357b5f4]  INFO -- : Completed 200 OK in 0ms (Views: 0.2ms | ActiveRecord: 0.0ms)
[----] I, [2018-08-10T14:57:11.278956 #2788:357b5f4]  INFO -- : Started GET "/api/ping" for 127.0.0.1 at 2018-08-10 14:57:11 -0400
[----] I, [2018-08-10T14:57:11.280813 #2788:357b5f4]  INFO -- : Processing by Api::PingController#index as JSON
[----] I, [2018-08-10T14:57:11.281137 #2788:357b5f4]  INFO -- : Completed 200 OK in 0ms (Views: 0.1ms | ActiveRecord: 0.0ms)
[----] I, [2018-08-10T14:57:11.285687 #2788:357b5f4]  INFO -- : Started GET "/api/ping" for 127.0.0.1 at 2018-08-10 14:57:11 -0400
[----] I, [2018-08-10T14:57:11.287741 #2788:357b5f4]  INFO -- : Processing by Api::PingController#index as JSON
[----] I, [2018-08-10T14:57:11.288111 #2788:357b5f4]  INFO -- : Completed 200 OK in 0ms (Views: 0.1ms | ActiveRecord: 0.0ms)
[----] I, [2018-08-10T14:57:11.293100 #2788:357b5f4]  INFO -- : Started GET "/api/ping" for 127.0.0.1 at 2018-08-10 14:57:11 -0400
[----] I, [2018-08-10T14:57:11.295267 #2788:357b5f4]  INFO -- : Processing by Api::PingController#index as JSON
[----] I, [2018-08-10T14:57:11.295688 #2788:357b5f4]  INFO -- : Completed 200 OK in 0ms (Views: 0.1ms | ActiveRecord: 0.0ms)
[----] W, [2018-08-10T14:59:35.108849 #2788:eb7c4c]  WARN -- : Long running http(s) request: '/api/ping' handled by #2788:357b8d8, running for 143.87 seconds
[----] W, [2018-08-10T14:59:35.109085 #2788:eb7c4c]  WARN -- : Long running http(s) request: '/api/ping' handled by #2788:357b324, running for 143.86 seconds
[----] W, [2018-08-10T15:00:11.132154 #2788:eb7c4c]  WARN -- : Long running http(s) request: '/api/ping' handled by #2788:357b8d8, running for 179.9 seconds
[----] W, [2018-08-10T15:00:11.132362 #2788:eb7c4c]  WARN -- : Long running http(s) request: '/api/ping' handled by #2788:357b324, running for 179.89 seconds
[----] I, [2018-08-10T15:00:11.239419 #2788:357b8d8]  INFO -- : Completed 200 OK in 180001ms (Views: 0.2ms | ActiveRecord: 0.0ms)
[----] I, [2018-08-10T15:00:11.248699 #2788:357b324]  INFO -- : Completed 200 OK in 180001ms (Views: 0.2ms | ActiveRecord: 0.0ms)

Note, above, I have 2 requests that take 3 minutes to finish. Each one is logged twice.

jrafanie · 2018-08-10T18:30:12Z

@Fryguy @carbonin @bdunne @NickLaMuro please review with brutal honesty... This middleware is run on every request so any and all calculations/slow operations need to be minimized.

NickLaMuro · 2018-08-10T18:33:23Z

@jrafanie I will have to take some time to look at this later, but I have been downloading rbspy on to appliances to figure this out myself. Usually works pretty well.

Might be a decent alternative to consider, but I will take a look later at this as well.

jrafanie · 2018-08-10T18:41:31Z

@NickLaMuro well, as coded (currently), this PR enables this by default since you might not be able to get to the UI to enable it if your puma threads are hung up.

I'm willing to make this optional if the overhead is too much but ideally, we should just log this information whenever it happens.

NickLaMuro · 2018-08-10T18:43:30Z

@jrafanie no enabling necessary, you can run it against an existing pid:

$ rbspy report -p 12345

jrafanie · 2018-08-10T18:43:55Z

ah, @kbrock @dmetzger57 probably have opinions on this too

jrafanie · 2018-08-10T18:46:46Z

@jrafanie no enabling necessary, you can run it against an existing pid:

$ rbspy report -p 12345

@NickLaMuro My bad, I was talking about this PR. It's on by default. rbspy is nice for us but I'm hoping the overhead of this PR is low enough that we can bake this into the application and not need external tools.

NickLaMuro · 2018-08-10T18:48:44Z

Nope, totally get that, just trying suggesting alternatives that might avoid more maintenance for us.

bdunne · 2018-08-10T19:01:18Z

lib/request_started_on_middleware.rb

+      # thread after we set one or more of the above local variables. The fallout
+      # of this is we return a false positive for a request that finished very close
+      # to the 2 minute timeout.
+      if request.present? && started_on.kind_of?(Time) && REQUEST_TIMEOUT.ago > started_on


Was this supposed to use the class method rather than the constant?

👀 YES! Thanks!

Refactoring for testing and forgetting to call the new method... It's a constant problem.

jrafanie · 2018-08-10T19:06:30Z

I updated the description with a logging example.

jrafanie · 2018-08-10T19:11:51Z

app/models/mixins/miq_web_server_runner_mixin.rb

+    log_long_running_requests
+  end
+
+  CHECK_LONG_RUNNING_REQUESTS_INTERVAL = 30.seconds


who knows if every 30 seconds is too frequent or not

jrafanie · 2018-08-10T19:13:49Z

app/models/mixins/miq_web_server_runner_mixin.rb

+    return if @last_checked_hung_requests > CHECK_LONG_RUNNING_REQUESTS_INTERVAL.ago
+
+    RequestStartedOnMiddleware.long_running_requests.each do |request, duration, thread|
+      message = "Long running http(s) request: '#{request}' handled by ##{Process.pid}:#{thread.object_id.to_s(16)}, running for #{duration.round(2)} seconds"


I debated having the middleware method return the PID:TID format and rounded duration but chose to keep raw data there. I can be convinced to change it. 😉

Does seem like this would be part of rails rails itself. Or part of new relic / scout / skylight

This will be nice to proactively track stuff down.
Is there a way to make this chartable rather than dumping into another log?

Yes, I can see making this more generic (it assumes a threaded web server like puma).

We had looked at an alternative, https://github.com/heroku/rack-timeout, which warns about the implications of raising a timeout like it does and how it's not for production... more for debugging... This PR was not meant to take action other than notify us of a long running request. We do not care if the threads are deadlocked or just doing too many things slowly, we just log the long running request. I honestly don't know that we ever want to take action other than log/notify people.

So, yes, if this works out for us, I can see making this a small standalone rack middleware with a generic interface.

kbrock

Will this let us know about hung requests too?

This looks nice

kbrock · 2018-08-13T18:00:31Z

lib/request_started_on_middleware.rb

+
+  # For testing: mocking Thread.list feels dangerous
+  private_class_method def self.relevant_thread_list
+    Thread.list


Think this represents your race condition

kbrock · 2018-08-13T18:03:31Z

app/models/mixins/miq_web_server_runner_mixin.rb

+    return if @last_checked_hung_requests > CHECK_LONG_RUNNING_REQUESTS_INTERVAL.ago
+
+    RequestStartedOnMiddleware.long_running_requests.each do |request, duration, thread|
+      message = "Long running http(s) request: '#{request}' handled by ##{Process.pid}:#{thread.object_id.to_s(16)}, running for #{duration.round(2)} seconds"


Does seem like this would be part of rails rails itself. Or part of new relic / scout / skylight

This will be nice to proactively track stuff down.
Is there a way to make this chartable rather than dumping into another log?

jrafanie · 2018-08-13T18:14:10Z

Will this let us know about hung requests too?

Yes. I actually called this "hung requests" at first but since this detects all long running requests, including deadlocked ones and ones that just take really long to complete, I thought the wording was clearer as "long running requests."

This looks nice

Thanks, I'm hoping this will give us more clarity when people encounter 502 errors.

NickLaMuro

My two cents on the topic.

For what it is worth, I have done some of this in ManagePerformance as well already.

NickLaMuro · 2018-08-13T18:11:39Z

lib/request_started_on_middleware.rb

+      start_request(env['PATH_INFO'], Time.now.utc)
+      response = @app.call(env)
+    ensure
+      complete_request


Seems like you could avoid a lot of complexity with this by just logging as part of the middleware.

def call(env) start_time, end_time = [Time.now, nil] begin @app.call(env) ensure log_to_request_time_dot_log(start_time, end_time) if too_long? end end

The extra Thread seems like a much larger over complication of what you are trying to do here, and I am unsure why it is needed in the first place.

Agreed, the thread part is the complicated part but is not without good reason. The problem is that the thread is hung and sometimes doesn't ever finish because it's deadlocked, killed before finishing, or finishes much later than the user gets the 502 error. In all three of these cases, we won't get to the log_to_request_time_dot_log method until after (if ever) the user has captured or investigated the logs.

doesn't ever finish because it's deadlocked, killed before finishing, or finishes much later

Fair enough. I have some points below, but this is mostly just followup, and my real status on the review is here:

So for these three:

"deadlocked": valid

"killed before finishing": true

"finishes much later than the user gets the 502": Well, this would still get logged, the data from the request would just end up being sent into the void.

That said, with the first two (which the first, the deadlock would probably lead to the second happening eventually), I want to say that I have seen when I have killed a process (at least with a Ctrl-C), that the "Completed" message still ends up in the logs. I have to do some digging into the Rails source a bit more to see how ActiveSupport::LogSubscriber handles control signals and allow this to happen, since that seems to be what is handling this here.

Curious if some form a rescue/at_exit would allow this to work, but since you are trying to be proactive long term, this might just be more of a curiosity/musing for myself than anything.

While I agree, the "eventually" logs Completed is something we can parse logs for after the fact... I've seen too many logs where they never Completed due to deadlock, killed before finishing or the user captured the logs before it was logged.

The ONLY reason to do this "clever" in memory thread local variable thing is because we don't get the log message and therefore can't trust that we'll get to the rack request completion point in the thread and thus we need to use another thread to track long requests. If there is another way, I'm open to suggestions.

NickLaMuro · 2018-08-13T18:13:22Z

lib/request_started_on_middleware.rb

+  def call(env)
+    begin
+      start_request(env['PATH_INFO'], Time.now.utc)
+      response = @app.call(env)


This can just be @app.call(env) (without the response var). ensure doesn't mess with the the result of the block/method:

$ ruby -e "puts begin; 'foo'; ensure; puts 'hi'; end" hi foo

Good point, I'll fix this.

NickLaMuro · 2018-08-13T18:20:20Z

Also worth noting, I recently uploaded some code for parsing this information from the existing logs:

NickLaMuro/miq_tools#8

(in gist form which is a single file and probably a bit easier to grok).

Might not hurt to add something that simply prints out dangling requests that don't have a Completed message accompanied with them. Also some simple filtering might not hurt either. I have it tossed up to miq_tools for my on usage (junk drawer) but I have zero issue with it being integrated somewhere in core if you think it is useful.

Specifically, I was using this when debugging the BZ for ManageIQ/manageiq-ui-classic#4447 since the app was also 502'ing and even getting reaped by the MiqServer because of memory thresholds. Also where I was using rbspy to track down the point of failure.

jrafanie · 2018-08-13T18:30:27Z

Also worth noting, I recently uploaded some code for parsing this information from the existing logs:

NickLaMuro/miq_tools#8

I think this is valuable to have regardless. I'm not sure where it should live but if it can parse PID:TID combinations spanning multiple compressed files (which it looks like it can do) to find each request that never completed or took too long, that would be extremely valuable.

That said, I was hoping to be more proactive in detecting this in process unless we believe this is problematic or too much overhead. Perhaps not now, but it would be nice to be able to react to a hung request at some point in the future.

kbrock · 2018-08-13T20:24:57Z

lib/request_started_on_middleware.rb

+      # thread after we set one or more of the above local variables. The fallout
+      # of this is we return a false positive for a request that finished very close
+      # to the 2 minute timeout.
+      if request.present? && started_on.kind_of?(Time) && request_timeout.ago > started_on


Will this loop take a long time?

do we want to keep calculating the request_timeout.ago or pre calculate it?

yeah, good idea... there's no reason to keep calculating this. Time calculation is slow, we should do it once at the top 👍

NickLaMuro · 2018-08-13T20:47:27Z

(in reply to the log parser) I think this is valuable to have regardless. I'm not sure where it should live ...

Yup, which is why I have it where it is at for now just so there is somewhere people can pull it from and make use of it. And if we find a place for it to live cool, happy to move it.

That said, I was hoping to be more proactive in detecting this in process...

I see, didn't realize that was the case, so that changes my perspective a bit.

I will need to do some thinking and further digging into what you have here to form a better opinion. I admittedly, have only glanced at this so far, so I probably will need some more time with this to give you a better answer.

For what it is worth, my ❌ is not meant to be a blocker, so if I don't get back to my review of this, then don't let that hold you up.

jrafanie · 2018-08-14T15:00:50Z

I will need to do some thinking and further digging into what you have here to form a better opinion. I admittedly, have only glanced at this so far, so I probably will need some more time with this to give you a better answer.

For what it is worth, my ❌ is not meant to be a blocker, so if I don't get back to my review of this, then don't let that hold you up.

Yeah, this isn't something I want to force through.. I'd prefer we all have time to review it and are happy with it. Take your time.

Fryguy · 2018-08-14T16:17:45Z

I'm fine with this conceptually, though I haven't fully reviewed the code. I am surprised there isn't a gem that just already does this. I understand that rack-timeout takes it a step further, but it still feels like a wheel that's already been invented.

jrafanie · 2018-08-14T17:30:19Z

I'm fine with this conceptually, though I haven't fully reviewed the code. I am surprised there isn't a gem that just already does this. I understand that rack-timeout takes it a step further, but it still feels like a wheel that's already been invented.

I've googled for rack long request, rack timeout request, rails long request...

Skylight does something similar but assumes completed requests
new relic has something similar but also seems to rely upon requests completing

Perhaps, a middleware as @NickLaMuro suggested with an ensure block would help us (without the thread local variables) in all cases except when the threads are deadlocked or the process gets SIGKILL'd.

jrafanie · 2018-08-16T18:26:43Z

I'll make some of the suggested changes and we can decide what we want to do here.

https://bugzilla.redhat.com/show_bug.cgi?id=1614918 This middleware tracks each request and when it started using thread local variables. long_running_requests traverses these thread local variables to find requests beyond the 2 minute timeout. This information can be retrieved by another thread, such as the heartbeating thread of a puma worker, which can then respond to a timed out request.

jrafanie · 2018-08-17T02:00:10Z

lib/request_started_on_middleware.rb

+
+  def self.long_running_requests
+    requests = []
+    timed_out_request_started_on = request_timeout.ago.utc


thanks Keenan for this idea ^

jrafanie · 2018-08-17T02:00:29Z

lib/request_started_on_middleware.rb

+
+  def call(env)
+    start_request(env['PATH_INFO'], Time.now.utc)
+    @app.call(env)


NickL, thanks for this simplification

jrafanie · 2018-08-17T02:02:59Z

Ok, I pushed a few changes suggested by @NickLaMuro and @kbrock... I didn't actually run this again to verify I didn't break something... at this point, I'd like to an idea if this is a direction we want to go.

I'm concerned that if we don't track this in process and instead rely on the rack request to complete, we'll miss hung threads or threads brutally killed. Maybe we can improve both?

Do we think this overhead is too great to run in production?

https://bugzilla.redhat.com/show_bug.cgi?id=1614918 For puma based workers, we implement do_heartbeat_work to log_long_running_requests. This method is run very frequently per minute, so we ensure this check happens at most every 30 seconds. The log message includes the requested URI path, the thread (in the same PID:TID format that we log elsewhere) and the request duration so far. Using the PID:TID information, users can then find the request parameters by searching the logs for this PID:TID.

jrafanie · 2018-08-17T14:06:25Z

❗️ - Line 112, Col 5 - Style/MethodCallWithArgsParentheses - Use parentheses for method calls with arguments.

I'm ok with this, the config.middlware stuff is a dsl and looks better without parens

jrafanie · 2018-09-05T20:10:56Z

@NickLaMuro what do you think... I've summarized the two approaches:

Rack middleware that does an `ensure` block:

Pros:
- simpler
- in thread, no thread local variables
- minimal overhead
Cons:
- Can't handle SIGKILL'd hung processes/threads
- Can't handle deadlocked processes
- We don't log this until the thread completes the work: could be many minutes after the UI returns a 502/proxy error

Rack middleware that uses thread local variables to track long running requests (THIS PR)

Pros:
- Detects any length request, not just ones that will timeout/502.
- Should log hung process/threads the moment they exceed the request threshold - minimal waiting (logs will most certainly have this very soon before/after the browser returns a proxy error
- Handles hung/deadlocked/SIGKILL'd processes (doesn't require the request to complete... we should detect the request is hung before it's SIGKILL'd. If a request is deadlocked, we should still be able to log it since we're not waiting on the same lock in our monitoring thread.
Cons:
- More overhead
- Thread local variable complexity

kbrock · 2018-09-05T22:20:48Z

lib/request_started_on_middleware.rb

+
+  def start_request(path, started_on)
+    Thread.current[:current_request] = path
+    Thread.current[:current_request_started_on] = started_on


I like the thread local vars

NickLaMuro

Okay, I admit part of this review is just to remove the X from my original review (EDIT: Bah... didn't work). Mostly minor suggestions and tweaks.

So, from what I can tell, if you switch to using env for the :current_request_started_on, you basically have no object allocations on a per request basis. And since you are hooking into do_heartbeat_work, there really is no extra overhead that wasn't really already thread from how we run our workers (and the .log_long_running_requests and RequestStartedOnMiddleware.long_running_requests calls are as good as you can get, and would basically be what you would want to log anyway, regardless of the technique used (this versus a ensure based approach).

Compared to what I suggested (ensure), this has the advantage of logging multiple times for the same request, and being able to make use of the tread.backtrace if we wanted (see specific comments below regarding that).

NickLaMuro · 2018-09-05T22:57:24Z

lib/request_started_on_middleware.rb

+
+  def self.long_running_requests
+    requests = []
+    timed_out_request_started_on = request_timeout.ago.utc


This variable name seems a bit wordy and confusing when read in the if below.

Maybe something like timeout_threshold or allowable_request_start_time, and then the if reads like:

if ... && started_on < allowable_request_start_time # ... end

NickLaMuro · 2018-09-05T23:02:06Z

lib/request_started_on_middleware.rb

+      # There's a race condition where the complete_request method runs in another
+      # thread after we set one or more of the above local variables. The fallout
+      # of this is we return a false positive for a request that finished very close
+      # to the 2 minute timeout.


Just a note for the future: This becomes a bigger issue if we were "react to a hung request" as you suggested here, but for now, this really isn't a big deal.

Logging something that was a "close to a 2 min request" is going to anger just about no one (except @kbrock, who probably thinks we log too much as it is... which is fair, but not relevant to my point).

@NickLaMuro 😆 🤣 @kbrock

NickLaMuro · 2018-09-05T23:04:38Z

lib/request_started_on_middleware.rb

+  end
+
+  def call(env)
+    start_request(env['PATH_INFO'], Time.now.utc)


Pedantic, but can we extract this Time.now value from the env instead of looking it up ourselves? Does it makes sense to make another Time.now allocation, or should use what is there?

Honestly asking here though, because I am curious if you think there is an advantage or disadvantage to one or the other.

Good question. I was thinking you're right... maybe we should use the one in the headers but then I was looking at how newrelic pulls the times out of the header and it looks faster to do it the way I already have it. Check it out here

So, yeah, I think what I have is fine.

Huh, I was under the impression that rack/puma would have added this info for us, but I guess not.

That said, Rails does do this, which lead me down a rabbit hole of investigation into "how", and going to spike on something real quick as a POC just for you to let me know what you think. This "idea" probably won't be superior in features, and is digging into the Rails private API a bit, but might be less code for us to maintain overall. We'll see...

NickLaMuro · 2018-09-05T23:06:23Z

lib/request_started_on_middleware.rb

+      # of this is we return a false positive for a request that finished very close
+      # to the 2 minute timeout.
+      if request.present? && started_on.kind_of?(Time) && timed_out_request_started_on > started_on
+        duration = (Time.now.utc - started_on).to_f


I think we can "cache" Time.now in a local variable above with request and started_on instead of looking it up every iteration of the loop.

Also something I noticed just now, is the .utc calls done through out these changes even necessary? We never log the raw time, and only the difference between the start and current time when logging long requests. I only mention this because we are trying to keep it as minimal/lightweight as possible, and calls to .utc seem pointless when dealing with time diffs.

Might be something that rubocop complained about though, so just something I noticed.

👍 Yeah, since we're writing and reading the time, there's no need to store it as utc as long as it's always local

Note, we're only running Time.now once for each long request, which shouldn't be that frequent. The alternative is to do it once for EVERY request.

NickLaMuro · 2018-09-05T23:21:39Z

app/models/mixins/miq_web_server_runner_mixin.rb

+    return if @last_checked_hung_requests > CHECK_LONG_RUNNING_REQUESTS_INTERVAL.ago
+
+    RequestStartedOnMiddleware.long_running_requests.each do |request, duration, thread|
+      message = "Long running http(s) request: '#{request}' handled by ##{Process.pid}:#{thread.object_id.to_s(16)}, running for #{duration.round(2)} seconds"


Would it make sense to log the thread.backtrace here as well? It would help debugging engineers find out where the bad parts of the codebase exist. If this is really a hung process, multiple checks on this on the 30 second intervals would let us know if we are looping on the same section of code on each check.

Maybe? I don't know if it would be too noisy though. I hear @kbrock thinks we log too much

For the record: We can (and should only) do this in a followup, this was just something I thought was nice about this approach that we are able to do.

Maybe? I don't know if it would be too noisy though.

So my thought with this, and others, was:

Ideally, no requests go over 10 seconds ever.

When they do, and we have a reproducible environment, with zero changes (or maybe a settings tweak+restart) we can hit a slow request and git the UI workers on a regular interval with this signal (SIGHUP possibly or something unused by puma).

So going back to the "stacktrace" portion, this could maybe only be something that is only turned on when we are using a control signal, and the "light mode" is what is on all the time.

This could even be scripted so that once the "bad url" is determined, it can be hit, and another thread waits 1 minute, and then starts hitting the running UI workers with this signal. The whole script would run until some form of a result is returned from the server (502 or otherwise).

Anyway, just some ideas.

Ideally, no requests go over 10 seconds ever.

🤣

yeah, we can do the backtrace with some signal but we have to be careful of puma's signal handling like you mentioned.

NickLaMuro · 2018-09-05T23:53:36Z

@jrafanie So I am taking some time to re-look over this and mull it over in my head, but to start with your original question:

@NickLaMuro what do you think... I've summarized the two approaches:

I have just a few corrections/additions while I continue to do some thinking and looking over what you have here:

`ensure` method

in thread, no thread local variables

Maybe worth noting that it is then "thread safe".

Can't handle SIGKILL'd hung processes/threads

Can't handle deadlocked processes

You combine these in the below method, and they probably could be combined here as well.

Might also be worth noting as a "Con" is that it is reactive versus proactive. When looking at it proactively, we have the advantage of being able to log the current stacktrace, which can give us a better insight to what is causing the deadlock/slowness.

"this PR" method

Should log hung process/threads the moment they exceed the request threshold...

Up to a 30 second(ish) delay here since we are only checking on a half minute basis. Your following comments do correct this a bit, but being more direct might be good when looking back on this.

Handles hung/deadlocked/SIGKILL'd processes (doesn't require the request to complete...

Maybe re-word this one. It doesn't "handle" anything, it can log prior to being killed and and can report on long requests as they continue to be an issue (multiple times per request if needed).

Also, currently, there is nothing in place for a at_exit, so it is possible that a 2 minute request doesn't get logged at all of memory balloons quickly, and causes the MiqServer process to reap it prior to it being able to log (say a 90 second request uses 1.5gigs of memory, and then doesn't get logged prior to being reaped).

Adding an at_exit might cause additional complexity that we might not want, but just worth mentioning.

Thread local variable complexity

On that same note, it might not hurt adding "general maintenance complexity" in addition, and "Miq specific" tooling that would be hard to extract into another gem. Compared to the ensure method, this is much harder to make generic since we are hooking into assumed internals of our app (.do_heartbeat_work), versus just making another rack middleware.

Removing this 'X' for now. Don't think it is sending the right message at the moment.

NickLaMuro

Something I forgot in my last review.

NickLaMuro · 2018-09-06T00:16:30Z

lib/request_started_on_middleware.rb

+    requests
+  end
+
+  REQUEST_TIMEOUT = 2.minutes


Two quick suggestions about this (which I didn't think to add into my last review):

Maybe rename this to LONG_REQUEST, or something, since we really aren't "timing out" anything here. It is based on the apache timeout, but this simply logs a long request, and this is the value for what constitutes that.

Possibly make this only 1.minute? On the performance team, we were told to try and make anything over 10 seconds in the UI faster, so waiting until 2.minutes to say something is "long" seems like it is a bit extreme. When paired with my suggestion to possibly include the .backtrace with it, it seems like this would allow us to more quickly see when something is long and causing issues, and if the issue is moving from one code path to the next. And most requests (ideally) shouldn't get to this point where it is a problem.

I agree on both. I like the LONG_REQUEST name... it makes more sense. I was thinking we could expose the 2.minutes value because I also see value in changing it. Maybe 1.minute across the board will be better as it will pick up long requests that don't currently timeout apache but also requests that do. I'll change it to 1.minute and we'll go from there.

NickLaMuro · 2018-09-06T00:59:47Z

Sorry @jrafanie, you are getting 📖ed by me today...

So I was thinking of a few things we could do with this as alternatives or "in addition to" what we have here. Most are later TODO's, and really shouldn't be considered for this PR, but I think are worth discussing:

This isn't really an addition to this code change, but would it be possible for you to dig up some previous BZs/PRs where the deadlocking/SIGKILL'ing/hung/etc. requests happened so we have something to test this and other approaches against? I assume (know) you have had a lot of experience with these kinds of issues over the years, and if there are some particular issues that you have faced and are aware of, it seems like this PR would be a good place to archive them for reference.
Would creating a control signal that one could fire to the MiqUIWorker process that would trigger this logging make sense, ala https://tenderlovemaking.com/2016/02/05/i-am-a-puts-debuggerer.html#i-have-a-deadlock-but-i-dont-know-where ???
Would doing something similar on SIGKILL also make sense? An at_exit in the MiqUIWorker possibly in the after_fork callback?

I bring up the first point in particular because I don't know what you have been doing "to verify this" [ref] in the past (edit: looks like you did describe how in the BZ), but it would be nice if it wasn't just adding a sleep 250 in some random request, but actual real life issues that could then further determine if my other suggestions are worth while (in the above and in general).

Things like bloating memory, long unoptimized SQL queries, actual deadlocks, etc., that were reproducible and could help validate this code better. Here is a few examples of some I have done (mostly SQL related, so no deadlock examples):

https://bugzilla.redhat.com/show_bug.cgi?id=1610924
- Stop N+1 with EmsDashboardService#recent_records manageiq-ui-classic#4447
- Use .count, not .length, for better performance manageiq-ui-classic#4448
https://bugzilla.redhat.com/show_bug.cgi?id=1580569
- Remove compliances include from Host.yaml manageiq-ui-classic#3989
https://bugzilla.redhat.com/show_bug.cgi?id=1590908
- Uh... there were a lot of PRs with this one... too lazy to list them all...

jrafanie · 2018-09-06T21:41:55Z

This isn't really an addition to this code change, but would it be possible for you to dig up some previous BZs/PRs where the deadlocking/SIGKILL'ing/hung/etc. requests happened so we have something to test this and other approaches against? I assume (know) you have had a lot of experience with these kinds of issues over the years, and if there are some particular issues that you have faced and are aware of, it seems like this PR would be a good place to archive them for reference.

I'll look. I'm not sure how to reliably do this without first rigging the system by either giving puma less time to do the request or make the request long via sleep.

Would creating a control signal that one could fire to the MiqUIWorker process that would trigger this logging make sense, ala https://tenderlovemaking.com/2016/02/05/i-am-a-puts-debuggerer.html#i-have-a-deadlock-but-i-dont-know-where ???

Maybe? Log all threads or just the slow requests?

Would doing something similar on SIGKILL also make sense? An at_exit in the MiqUIWorker possibly in the after_fork callback?

We can't catch SIGKILL via at_exit. We can do other signals though. Is this a problem at_exit? Normally, our problem is the silly ui worker stays up too long while all 5 of it's threads are hung/busy so premature killing of the worker hasn't been a problem in my experience.

miq-bot · 2018-09-06T21:43:26Z

Checked commits jrafanie/manageiq@cf34c1f~...b53091e with ruby 2.3.3, rubocop 0.52.1, haml-lint 0.20.0, and yamllint 1.10.0
4 files checked, 1 offense detected

config/application.rb

❗ - Line 112, Col 5 - Style/MethodCallWithArgsParentheses - Use parentheses for method calls with arguments.

Detect and log long running http(s) requests (cherry picked from commit cc6e263) https://bugzilla.redhat.com/show_bug.cgi?id=1614918

simaishi · 2018-10-01T16:56:58Z

Hammer backport details:

$ git log -1
commit 6a809cbd3dfbaccc89ff53f384bea797e3166508
Author: Gregg Tanzillo <[email protected]>
Date:   Wed Sep 26 17:51:58 2018 -0400

    Merge pull request #17842 from jrafanie/detect_long_running_requests
    
    Detect and log long running http(s) requests
    
    (cherry picked from commit cc6e263931c3b6bffc60af64168f9ba26207f2a9)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1614918

jrafanie added enhancement core gaprindashvili/yes labels Aug 10, 2018

jrafanie assigned gtanzillo Aug 10, 2018

bdunne reviewed Aug 10, 2018

View reviewed changes

jrafanie force-pushed the detect_long_running_requests branch from 8df16a6 to 3d9cf2e Compare August 10, 2018 19:10

jrafanie commented Aug 10, 2018

View reviewed changes

kbrock reviewed Aug 13, 2018

View reviewed changes

NickLaMuro previously requested changes Aug 13, 2018

View reviewed changes

kbrock reviewed Aug 13, 2018

View reviewed changes

jrafanie force-pushed the detect_long_running_requests branch from 3d9cf2e to 2f4bcb5 Compare August 17, 2018 01:59

jrafanie commented Aug 17, 2018

View reviewed changes

jrafanie force-pushed the detect_long_running_requests branch from 2f4bcb5 to 61680c1 Compare August 17, 2018 02:04

kbrock reviewed Sep 5, 2018

View reviewed changes

NickLaMuro reviewed Sep 5, 2018

View reviewed changes

NickLaMuro reviewed Sep 6, 2018

View reviewed changes

Squash these suggestions from Nick when done.

b53091e

NickLaMuro mentioned this pull request Sep 7, 2018

[POC][WIP] ActiveSupport::Notifications approach to determining slow requests #17958

Closed

gtanzillo added this to the Sprint 96 Ending Oct 8, 2018 milestone Sep 26, 2018

gtanzillo merged commit cc6e263 into ManageIQ:master Sep 26, 2018

gtanzillo added the hammer/yes label Sep 26, 2018

jrafanie deleted the detect_long_running_requests branch September 26, 2018 22:05

simaishi pushed a commit that referenced this pull request Oct 1, 2018

Merge pull request #17842 from jrafanie/detect_long_running_requests

6a809cb

Detect and log long running http(s) requests (cherry picked from commit cc6e263) https://bugzilla.redhat.com/show_bug.cgi?id=1614918

simaishi added hammer/backported and removed hammer/yes labels Oct 1, 2018

jrafanie mentioned this pull request Sep 21, 2020

Overview of code loading issues leading to errors or deadlocks with long running http(s) requests #20576

Closed

Detect and log long running http(s) requests #17842

Detect and log long running http(s) requests #17842

Conversation

jrafanie commented Aug 10, 2018 • edited Loading

jrafanie commented Aug 10, 2018

NickLaMuro commented Aug 10, 2018

jrafanie commented Aug 10, 2018

NickLaMuro commented Aug 10, 2018

jrafanie commented Aug 10, 2018

jrafanie commented Aug 10, 2018

NickLaMuro commented Aug 10, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrafanie commented Aug 10, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kbrock left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrafanie commented Aug 13, 2018

NickLaMuro left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrafanie Aug 13, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NickLaMuro commented Aug 13, 2018 • edited Loading

jrafanie commented Aug 13, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NickLaMuro commented Aug 13, 2018

jrafanie commented Aug 14, 2018

Fryguy commented Aug 14, 2018

jrafanie commented Aug 14, 2018

jrafanie commented Aug 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrafanie commented Aug 17, 2018

jrafanie commented Aug 17, 2018

jrafanie commented Sep 5, 2018

Rack middleware that does an ensure block:

Rack middleware that uses thread local variables to track long running requests (THIS PR)

Choose a reason for hiding this comment

NickLaMuro left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NickLaMuro Sep 5, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NickLaMuro commented Sep 5, 2018

ensure method

"this PR" method

NickLaMuro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NickLaMuro commented Sep 6, 2018 • edited Loading

jrafanie commented Sep 6, 2018

miq-bot commented Sep 6, 2018

simaishi commented Oct 1, 2018

jrafanie commented Aug 10, 2018 •

edited

Loading

NickLaMuro left a comment •

edited

Loading

jrafanie Aug 13, 2018 •

edited

Loading

NickLaMuro commented Aug 13, 2018 •

edited

Loading

jrafanie commented Aug 13, 2018 •

edited

Loading

Rack middleware that does an `ensure` block:

NickLaMuro left a comment •

edited

Loading

NickLaMuro Sep 5, 2018 •

edited

Loading

`ensure` method

NickLaMuro commented Sep 6, 2018 •

edited

Loading