Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect and log long running http(s) requests #17842

Merged
merged 3 commits into from
Sep 26, 2018

Conversation

jrafanie
Copy link
Member

@jrafanie jrafanie commented Aug 10, 2018

When a request is made through apache, we have 2 minutes to complete this request before apache throws a timeout, resulting in a browser 502 error. This pull request provides a mechanism for people who encounter these 502 errors to figure out which request and parameters are causing this problem. Typically, the request is really slow and needs to be optimized, but we have had deadlock(s) occur so it's very helpful to minimize the work required to narrow the problem down.

https://bugzilla.redhat.com/show_bug.cgi?id=1614918

  • Add a middleware to detect long running requests

This middleware tracks each request and when it started using thread
local variables. long_running_requests traverses these thread local
variables to find requests beyond the 2 minute timeout. This
information can be retrieved by another thread, such as the heartbeating
thread of a puma worker, which can then respond to a timed out request.

  • Log long running requests via RequestStartedOnMiddleware

For puma based workers, we implement do_heartbeat_work to
log_long_running_requests. This method is run very frequently per minute,
so we ensure this check happens at most every 30 seconds.
The log message includes the requested URI path, the thread (in the same
PID:TID format that we log elsewhere) and the request duration so far.
Using the PID:TID information, users can then find the request
parameters by searching the logs for this PID:TID.

Example logging:

[----] I, [2018-08-10T14:57:10.587334 #2788:357b5f4]  INFO -- : Started GET "/api/ping" for 127.0.0.1 at 2018-08-10 14:57:10 -0400
[----] I, [2018-08-10T14:57:11.231264 #2788:357b5f4]  INFO -- : Processing by Api::PingController#index as JSON
[----] I, [2018-08-10T14:57:11.232892 #2788:357b5f4]  INFO -- : Completed 200 OK in 1ms (Views: 0.2ms | ActiveRecord: 0.0ms)
[----] I, [2018-08-10T14:57:11.235108 #2788:357b8d8]  INFO -- : Started GET "/api/ping" for 127.0.0.1 at 2018-08-10 14:57:11 -0400
[----] I, [2018-08-10T14:57:11.237800 #2788:357b8d8]  INFO -- : Processing by Api::PingController#index as JSON
[----] I, [2018-08-10T14:57:11.245526 #2788:357b324]  INFO -- : Started GET "/api/ping" for 127.0.0.1 at 2018-08-10 14:57:11 -0400
[----] I, [2018-08-10T14:57:11.247653 #2788:357b324]  INFO -- : Processing by Api::PingController#index as JSON
[----] I, [2018-08-10T14:57:11.259623 #2788:357b5f4]  INFO -- : Started GET "/api/ping" for 127.0.0.1 at 2018-08-10 14:57:11 -0400
[----] I, [2018-08-10T14:57:11.261375 #2788:357b5f4]  INFO -- : Processing by Api::PingController#index as JSON
[----] I, [2018-08-10T14:57:11.261793 #2788:357b5f4]  INFO -- : Completed 200 OK in 0ms (Views: 0.1ms | ActiveRecord: 0.0ms)
[----] I, [2018-08-10T14:57:11.265736 #2788:357b5f4]  INFO -- : Started GET "/api/ping" for 127.0.0.1 at 2018-08-10 14:57:11 -0400
[----] I, [2018-08-10T14:57:11.267544 #2788:357b5f4]  INFO -- : Processing by Api::PingController#index as JSON
[----] I, [2018-08-10T14:57:11.268136 #2788:357b5f4]  INFO -- : Completed 200 OK in 0ms (Views: 0.2ms | ActiveRecord: 0.0ms)
[----] I, [2018-08-10T14:57:11.278956 #2788:357b5f4]  INFO -- : Started GET "/api/ping" for 127.0.0.1 at 2018-08-10 14:57:11 -0400
[----] I, [2018-08-10T14:57:11.280813 #2788:357b5f4]  INFO -- : Processing by Api::PingController#index as JSON
[----] I, [2018-08-10T14:57:11.281137 #2788:357b5f4]  INFO -- : Completed 200 OK in 0ms (Views: 0.1ms | ActiveRecord: 0.0ms)
[----] I, [2018-08-10T14:57:11.285687 #2788:357b5f4]  INFO -- : Started GET "/api/ping" for 127.0.0.1 at 2018-08-10 14:57:11 -0400
[----] I, [2018-08-10T14:57:11.287741 #2788:357b5f4]  INFO -- : Processing by Api::PingController#index as JSON
[----] I, [2018-08-10T14:57:11.288111 #2788:357b5f4]  INFO -- : Completed 200 OK in 0ms (Views: 0.1ms | ActiveRecord: 0.0ms)
[----] I, [2018-08-10T14:57:11.293100 #2788:357b5f4]  INFO -- : Started GET "/api/ping" for 127.0.0.1 at 2018-08-10 14:57:11 -0400
[----] I, [2018-08-10T14:57:11.295267 #2788:357b5f4]  INFO -- : Processing by Api::PingController#index as JSON
[----] I, [2018-08-10T14:57:11.295688 #2788:357b5f4]  INFO -- : Completed 200 OK in 0ms (Views: 0.1ms | ActiveRecord: 0.0ms)
[----] W, [2018-08-10T14:59:35.108849 #2788:eb7c4c]  WARN -- : Long running http(s) request: '/api/ping' handled by #2788:357b8d8, running for 143.87 seconds
[----] W, [2018-08-10T14:59:35.109085 #2788:eb7c4c]  WARN -- : Long running http(s) request: '/api/ping' handled by #2788:357b324, running for 143.86 seconds
[----] W, [2018-08-10T15:00:11.132154 #2788:eb7c4c]  WARN -- : Long running http(s) request: '/api/ping' handled by #2788:357b8d8, running for 179.9 seconds
[----] W, [2018-08-10T15:00:11.132362 #2788:eb7c4c]  WARN -- : Long running http(s) request: '/api/ping' handled by #2788:357b324, running for 179.89 seconds
[----] I, [2018-08-10T15:00:11.239419 #2788:357b8d8]  INFO -- : Completed 200 OK in 180001ms (Views: 0.2ms | ActiveRecord: 0.0ms)
[----] I, [2018-08-10T15:00:11.248699 #2788:357b324]  INFO -- : Completed 200 OK in 180001ms (Views: 0.2ms | ActiveRecord: 0.0ms)

Note, above, I have 2 requests that take 3 minutes to finish. Each one is logged twice.

@jrafanie
Copy link
Member Author

@Fryguy @carbonin @bdunne @NickLaMuro please review with brutal honesty... This middleware is run on every request so any and all calculations/slow operations need to be minimized.

@NickLaMuro
Copy link
Member

@jrafanie I will have to take some time to look at this later, but I have been downloading rbspy on to appliances to figure this out myself. Usually works pretty well.

Might be a decent alternative to consider, but I will take a look later at this as well.

@jrafanie
Copy link
Member Author

@NickLaMuro well, as coded (currently), this PR enables this by default since you might not be able to get to the UI to enable it if your puma threads are hung up.

I'm willing to make this optional if the overhead is too much but ideally, we should just log this information whenever it happens.

@NickLaMuro
Copy link
Member

@jrafanie no enabling necessary, you can run it against an existing pid:

$ rbspy report -p 12345

@jrafanie
Copy link
Member Author

ah, @kbrock @dmetzger57 probably have opinions on this too

@jrafanie
Copy link
Member Author

@jrafanie no enabling necessary, you can run it against an existing pid:

$ rbspy report -p 12345

@NickLaMuro My bad, I was talking about this PR. It's on by default. rbspy is nice for us :neckbeard: but I'm hoping the overhead of this PR is low enough that we can bake this into the application and not need external tools.

@NickLaMuro
Copy link
Member

Nope, totally get that, just trying suggesting alternatives that might avoid more maintenance for us.

# thread after we set one or more of the above local variables. The fallout
# of this is we return a false positive for a request that finished very close
# to the 2 minute timeout.
if request.present? && started_on.kind_of?(Time) && REQUEST_TIMEOUT.ago > started_on
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this supposed to use the class method rather than the constant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👀 YES! Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactoring for testing and forgetting to call the new method... It's a constant problem.

@jrafanie
Copy link
Member Author

I updated the description with a logging example.

@jrafanie jrafanie force-pushed the detect_long_running_requests branch from 8df16a6 to 3d9cf2e Compare August 10, 2018 19:10
log_long_running_requests
end

CHECK_LONG_RUNNING_REQUESTS_INTERVAL = 30.seconds
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who knows if every 30 seconds is too frequent or not

return if @last_checked_hung_requests > CHECK_LONG_RUNNING_REQUESTS_INTERVAL.ago

RequestStartedOnMiddleware.long_running_requests.each do |request, duration, thread|
message = "Long running http(s) request: '#{request}' handled by ##{Process.pid}:#{thread.object_id.to_s(16)}, running for #{duration.round(2)} seconds"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I debated having the middleware method return the PID:TID format and rounded duration but chose to keep raw data there. I can be convinced to change it. 😉

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does seem like this would be part of rails rails itself. Or part of new relic / scout / skylight

This will be nice to proactively track stuff down.
Is there a way to make this chartable rather than dumping into another log?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can see making this more generic (it assumes a threaded web server like puma).

We had looked at an alternative, https://github.com/heroku/rack-timeout, which warns about the implications of raising a timeout like it does and how it's not for production... more for debugging... This PR was not meant to take action other than notify us of a long running request. We do not care if the threads are deadlocked or just doing too many things slowly, we just log the long running request. I honestly don't know that we ever want to take action other than log/notify people.

So, yes, if this works out for us, I can see making this a small standalone rack middleware with a generic interface.

Copy link
Member

@kbrock kbrock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this let us know about hung requests too?

This looks nice


# For testing: mocking Thread.list feels dangerous
private_class_method def self.relevant_thread_list
Thread.list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think this represents your race condition

return if @last_checked_hung_requests > CHECK_LONG_RUNNING_REQUESTS_INTERVAL.ago

RequestStartedOnMiddleware.long_running_requests.each do |request, duration, thread|
message = "Long running http(s) request: '#{request}' handled by ##{Process.pid}:#{thread.object_id.to_s(16)}, running for #{duration.round(2)} seconds"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does seem like this would be part of rails rails itself. Or part of new relic / scout / skylight

This will be nice to proactively track stuff down.
Is there a way to make this chartable rather than dumping into another log?

@jrafanie
Copy link
Member Author

Will this let us know about hung requests too?

Yes. I actually called this "hung requests" at first but since this detects all long running requests, including deadlocked ones and ones that just take really long to complete, I thought the wording was clearer as "long running requests."

This looks nice

Thanks, I'm hoping this will give us more clarity when people encounter 502 errors.

Copy link
Member

@NickLaMuro NickLaMuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My two cents on the topic.

For what it is worth, I have done some of this in ManagePerformance as well already.

start_request(env['PATH_INFO'], Time.now.utc)
response = @app.call(env)
ensure
complete_request
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like you could avoid a lot of complexity with this by just logging as part of the middleware.

def call(env)
  start_time, end_time = [Time.now, nil]
  begin
    @app.call(env)
  ensure
    log_to_request_time_dot_log(start_time, end_time) if too_long?
  end
end

The extra Thread seems like a much larger over complication of what you are trying to do here, and I am unsure why it is needed in the first place.

Copy link
Member Author

@jrafanie jrafanie Aug 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, the thread part is the complicated part but is not without good reason. The problem is that the thread is hung and sometimes doesn't ever finish because it's deadlocked, killed before finishing, or finishes much later than the user gets the 502 error. In all three of these cases, we won't get to the log_to_request_time_dot_log method until after (if ever) the user has captured or investigated the logs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't ever finish because it's deadlocked, killed before finishing, or finishes much later

Fair enough. I have some points below, but this is mostly just followup, and my real status on the review is here:


So for these three:

  • "deadlocked": valid
  • "killed before finishing": true
  • "finishes much later than the user gets the 502": Well, this would still get logged, the data from the request would just end up being sent into the void.

That said, with the first two (which the first, the deadlock would probably lead to the second happening eventually), I want to say that I have seen when I have killed a process (at least with a Ctrl-C), that the "Completed" message still ends up in the logs. I have to do some digging into the Rails source a bit more to see how ActiveSupport::LogSubscriber handles control signals and allow this to happen, since that seems to be what is handling this here.

Curious if some form a rescue/at_exit would allow this to work, but since you are trying to be proactive long term, this might just be more of a curiosity/musing for myself than anything.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree, the "eventually" logs Completed is something we can parse logs for after the fact... I've seen too many logs where they never Completed due to deadlock, killed before finishing or the user captured the logs before it was logged.

The ONLY reason to do this "clever" in memory thread local variable thing is because we don't get the log message and therefore can't trust that we'll get to the rack request completion point in the thread and thus we need to use another thread to track long requests. If there is another way, I'm open to suggestions.

def call(env)
begin
start_request(env['PATH_INFO'], Time.now.utc)
response = @app.call(env)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can just be @app.call(env) (without the response var). ensure doesn't mess with the the result of the block/method:

$ ruby -e "puts begin; 'foo'; ensure; puts 'hi'; end"
hi
foo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll fix this.

@NickLaMuro
Copy link
Member

NickLaMuro commented Aug 13, 2018

Also worth noting, I recently uploaded some code for parsing this information from the existing logs:

NickLaMuro/miq_tools#8

(in gist form which is a single file and probably a bit easier to grok).

Might not hurt to add something that simply prints out dangling requests that don't have a Completed message accompanied with them. Also some simple filtering might not hurt either. I have it tossed up to miq_tools for my on usage (junk drawer) but I have zero issue with it being integrated somewhere in core if you think it is useful.

Specifically, I was using this when debugging the BZ for ManageIQ/manageiq-ui-classic#4447 since the app was also 502'ing and even getting reaped by the MiqServer because of memory thresholds. Also where I was using rbspy to track down the point of failure.

@jrafanie
Copy link
Member Author

jrafanie commented Aug 13, 2018

Also worth noting, I recently uploaded some code for parsing this information from the existing logs:

NickLaMuro/miq_tools#8

I think this is valuable to have regardless. I'm not sure where it should live but if it can parse PID:TID combinations spanning multiple compressed files (which it looks like it can do) to find each request that never completed or took too long, that would be extremely valuable.

That said, I was hoping to be more proactive in detecting this in process unless we believe this is problematic or too much overhead. Perhaps not now, but it would be nice to be able to react to a hung request at some point in the future.

# thread after we set one or more of the above local variables. The fallout
# of this is we return a false positive for a request that finished very close
# to the 2 minute timeout.
if request.present? && started_on.kind_of?(Time) && request_timeout.ago > started_on
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this loop take a long time?

do we want to keep calculating the request_timeout.ago or pre calculate it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, good idea... there's no reason to keep calculating this. Time calculation is slow, we should do it once at the top 👍

@NickLaMuro
Copy link
Member

(in reply to the log parser) I think this is valuable to have regardless. I'm not sure where it should live ...

Yup, which is why I have it where it is at for now just so there is somewhere people can pull it from and make use of it. And if we find a place for it to live cool, happy to move it.

That said, I was hoping to be more proactive in detecting this in process...

I see, didn't realize that was the case, so that changes my perspective a bit.

I will need to do some thinking and further digging into what you have here to form a better opinion. I admittedly, have only glanced at this so far, so I probably will need some more time with this to give you a better answer.

For what it is worth, my ❌ is not meant to be a blocker, so if I don't get back to my review of this, then don't let that hold you up.

@jrafanie
Copy link
Member Author

I will need to do some thinking and further digging into what you have here to form a better opinion. I admittedly, have only glanced at this so far, so I probably will need some more time with this to give you a better answer.

For what it is worth, my ❌ is not meant to be a blocker, so if I don't get back to my review of this, then don't let that hold you up.

Yeah, this isn't something I want to force through.. I'd prefer we all have time to review it and are happy with it. Take your time.

@Fryguy
Copy link
Member

Fryguy commented Aug 14, 2018

I'm fine with this conceptually, though I haven't fully reviewed the code. I am surprised there isn't a gem that just already does this. I understand that rack-timeout takes it a step further, but it still feels like a wheel that's already been invented.

@jrafanie
Copy link
Member Author

I'm fine with this conceptually, though I haven't fully reviewed the code. I am surprised there isn't a gem that just already does this. I understand that rack-timeout takes it a step further, but it still feels like a wheel that's already been invented.

I've googled for rack long request, rack timeout request, rails long request...

Perhaps, a middleware as @NickLaMuro suggested with an ensure block would help us (without the thread local variables) in all cases except when the threads are deadlocked or the process gets SIGKILL'd.

@jrafanie
Copy link
Member Author

I'll make some of the suggested changes and we can decide what we want to do here.

https://bugzilla.redhat.com/show_bug.cgi?id=1614918

This middleware tracks each request and when it started using thread
local variables.  long_running_requests traverses these thread local
variables to find requests beyond the 2 minute timeout.  This
information can be retrieved by another thread, such as the heartbeating
thread of a puma worker, which can then respond to a timed out request.
@jrafanie jrafanie force-pushed the detect_long_running_requests branch from 3d9cf2e to 2f4bcb5 Compare August 17, 2018 01:59

def self.long_running_requests
requests = []
timed_out_request_started_on = request_timeout.ago.utc
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Keenan for this idea ^


def call(env)
start_request(env['PATH_INFO'], Time.now.utc)
@app.call(env)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NickL, thanks for this simplification

@jrafanie
Copy link
Member Author

Ok, I pushed a few changes suggested by @NickLaMuro and @kbrock... I didn't actually run this again to verify I didn't break something... at this point, I'd like to an idea if this is a direction we want to go.

I'm concerned that if we don't track this in process and instead rely on the rack request to complete, we'll miss hung threads or threads brutally killed. Maybe we can improve both?

Do we think this overhead is too great to run in production?

https://bugzilla.redhat.com/show_bug.cgi?id=1614918

For puma based workers, we implement do_heartbeat_work to
log_long_running_requests. This method is run very frequently per minute,
so we ensure this check happens at most every 30 seconds.

The log message includes the requested URI path, the thread (in the same
PID:TID format that we log elsewhere) and the request duration so far.

Using the PID:TID information, users can then find the request
parameters by searching the logs for this PID:TID.
@jrafanie jrafanie force-pushed the detect_long_running_requests branch from 2f4bcb5 to 61680c1 Compare August 17, 2018 02:04
@jrafanie
Copy link
Member Author

❗️ - Line 112, Col 5 - Style/MethodCallWithArgsParentheses - Use parentheses for method calls with arguments.

I'm ok with this, the config.middlware stuff is a dsl and looks better without parens

@jrafanie
Copy link
Member Author

jrafanie commented Sep 5, 2018

@NickLaMuro what do you think... I've summarized the two approaches:

Rack middleware that does an ensure block:

  • Pros:
    • simpler
    • in thread, no thread local variables
    • minimal overhead
  • Cons:
    • Can't handle SIGKILL'd hung processes/threads
    • Can't handle deadlocked processes
    • We don't log this until the thread completes the work: could be many minutes after the UI returns a 502/proxy error

Rack middleware that uses thread local variables to track long running requests (THIS PR)

  • Pros:
    • Detects any length request, not just ones that will timeout/502.
    • Should log hung process/threads the moment they exceed the request threshold - minimal waiting (logs will most certainly have this very soon before/after the browser returns a proxy error
    • Handles hung/deadlocked/SIGKILL'd processes (doesn't require the request to complete... we should detect the request is hung before it's SIGKILL'd. If a request is deadlocked, we should still be able to log it since we're not waiting on the same lock in our monitoring thread.
  • Cons:
    • More overhead
    • Thread local variable complexity


def start_request(path, started_on)
Thread.current[:current_request] = path
Thread.current[:current_request_started_on] = started_on
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the thread local vars

Copy link
Member

@NickLaMuro NickLaMuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I admit part of this review is just to remove the X from my original review (EDIT: Bah... didn't work). Mostly minor suggestions and tweaks.


So, from what I can tell, if you switch to using env for the :current_request_started_on, you basically have no object allocations on a per request basis. And since you are hooking into do_heartbeat_work, there really is no extra overhead that wasn't really already thread from how we run our workers (and the .log_long_running_requests and RequestStartedOnMiddleware.long_running_requests calls are as good as you can get, and would basically be what you would want to log anyway, regardless of the technique used (this versus a ensure based approach).

Compared to what I suggested (ensure), this has the advantage of logging multiple times for the same request, and being able to make use of the tread.backtrace if we wanted (see specific comments below regarding that).


def self.long_running_requests
requests = []
timed_out_request_started_on = request_timeout.ago.utc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable name seems a bit wordy and confusing when read in the if below.

Maybe something like timeout_threshold or allowable_request_start_time, and then the if reads like:

if ... && started_on < allowable_request_start_time
  # ...
end

# There's a race condition where the complete_request method runs in another
# thread after we set one or more of the above local variables. The fallout
# of this is we return a false positive for a request that finished very close
# to the 2 minute timeout.
Copy link
Member

@NickLaMuro NickLaMuro Sep 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note for the future: This becomes a bigger issue if we were "react to a hung request" as you suggested here, but for now, this really isn't a big deal.

Logging something that was a "close to a 2 min request" is going to anger just about no one (except @kbrock, who probably thinks we log too much as it is... which is fair, but not relevant to my point).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NickLaMuro 😆 🤣 @kbrock

end

def call(env)
start_request(env['PATH_INFO'], Time.now.utc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pedantic, but can we extract this Time.now value from the env instead of looking it up ourselves? Does it makes sense to make another Time.now allocation, or should use what is there?

Honestly asking here though, because I am curious if you think there is an advantage or disadvantage to one or the other.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I was thinking you're right... maybe we should use the one in the headers but then I was looking at how newrelic pulls the times out of the header and it looks faster to do it the way I already have it. Check it out here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, yeah, I think what I have is fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, I was under the impression that rack/puma would have added this info for us, but I guess not.

That said, Rails does do this, which lead me down a rabbit hole of investigation into "how", and going to spike on something real quick as a POC just for you to let me know what you think. This "idea" probably won't be superior in features, and is digging into the Rails private API a bit, but might be less code for us to maintain overall. We'll see...

# of this is we return a false positive for a request that finished very close
# to the 2 minute timeout.
if request.present? && started_on.kind_of?(Time) && timed_out_request_started_on > started_on
duration = (Time.now.utc - started_on).to_f
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can "cache" Time.now in a local variable above with request and started_on instead of looking it up every iteration of the loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also something I noticed just now, is the .utc calls done through out these changes even necessary? We never log the raw time, and only the difference between the start and current time when logging long requests. I only mention this because we are trying to keep it as minimal/lightweight as possible, and calls to .utc seem pointless when dealing with time diffs.

Might be something that rubocop complained about though, so just something I noticed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Yeah, since we're writing and reading the time, there's no need to store it as utc as long as it's always local

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, we're only running Time.now once for each long request, which shouldn't be that frequent. The alternative is to do it once for EVERY request.

return if @last_checked_hung_requests > CHECK_LONG_RUNNING_REQUESTS_INTERVAL.ago

RequestStartedOnMiddleware.long_running_requests.each do |request, duration, thread|
message = "Long running http(s) request: '#{request}' handled by ##{Process.pid}:#{thread.object_id.to_s(16)}, running for #{duration.round(2)} seconds"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to log the thread.backtrace here as well? It would help debugging engineers find out where the bad parts of the codebase exist. If this is really a hung process, multiple checks on this on the 30 second intervals would let us know if we are looping on the same section of code on each check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? I don't know if it would be too noisy though. I hear @kbrock thinks we log too much

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record: We can (and should only) do this in a followup, this was just something I thought was nice about this approach that we are able to do.

Maybe? I don't know if it would be too noisy though.

So my thought with this, and others, was:

  1. Ideally, no requests go over 10 seconds ever.
  2. When they do, and we have a reproducible environment, with zero changes (or maybe a settings tweak+restart) we can hit a slow request and git the UI workers on a regular interval with this signal (SIGHUP possibly or something unused by puma).

So going back to the "stacktrace" portion, this could maybe only be something that is only turned on when we are using a control signal, and the "light mode" is what is on all the time.

This could even be scripted so that once the "bad url" is determined, it can be hit, and another thread waits 1 minute, and then starts hitting the running UI workers with this signal. The whole script would run until some form of a result is returned from the server (502 or otherwise).

Anyway, just some ideas.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, no requests go over 10 seconds ever.

🤣

yeah, we can do the backtrace with some signal but we have to be careful of puma's signal handling like you mentioned.

@NickLaMuro
Copy link
Member

@jrafanie So I am taking some time to re-look over this and mull it over in my head, but to start with your original question:

@NickLaMuro what do you think... I've summarized the two approaches:

I have just a few corrections/additions while I continue to do some thinking and looking over what you have here:


ensure method

  • in thread, no thread local variables

Maybe worth noting that it is then "thread safe".

  • Can't handle SIGKILL'd hung processes/threads
  • Can't handle deadlocked processes

You combine these in the below method, and they probably could be combined here as well.

Might also be worth noting as a "Con" is that it is reactive versus proactive. When looking at it proactively, we have the advantage of being able to log the current stacktrace, which can give us a better insight to what is causing the deadlock/slowness.

"this PR" method

  • Should log hung process/threads the moment they exceed the request threshold...

Up to a 30 second(ish) delay here since we are only checking on a half minute basis. Your following comments do correct this a bit, but being more direct might be good when looking back on this.

  • Handles hung/deadlocked/SIGKILL'd processes (doesn't require the request to complete...

Maybe re-word this one. It doesn't "handle" anything, it can log prior to being killed and and can report on long requests as they continue to be an issue (multiple times per request if needed).

Also, currently, there is nothing in place for a at_exit, so it is possible that a 2 minute request doesn't get logged at all of memory balloons quickly, and causes the MiqServer process to reap it prior to it being able to log (say a 90 second request uses 1.5gigs of memory, and then doesn't get logged prior to being reaped).

Adding an at_exit might cause additional complexity that we might not want, but just worth mentioning.

  • Thread local variable complexity

On that same note, it might not hurt adding "general maintenance complexity" in addition, and "Miq specific" tooling that would be hard to extract into another gem. Compared to the ensure method, this is much harder to make generic since we are hooking into assumed internals of our app (.do_heartbeat_work), versus just making another rack middleware.

@NickLaMuro NickLaMuro dismissed their stale review September 5, 2018 23:54

Removing this 'X' for now. Don't think it is sending the right message at the moment.

Copy link
Member

@NickLaMuro NickLaMuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I forgot in my last review.

requests
end

REQUEST_TIMEOUT = 2.minutes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two quick suggestions about this (which I didn't think to add into my last review):

  • Maybe rename this to LONG_REQUEST, or something, since we really aren't "timing out" anything here. It is based on the apache timeout, but this simply logs a long request, and this is the value for what constitutes that.
  • Possibly make this only 1.minute? On the performance team, we were told to try and make anything over 10 seconds in the UI faster, so waiting until 2.minutes to say something is "long" seems like it is a bit extreme. When paired with my suggestion to possibly include the .backtrace with it, it seems like this would allow us to more quickly see when something is long and causing issues, and if the issue is moving from one code path to the next. And most requests (ideally) shouldn't get to this point where it is a problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on both. I like the LONG_REQUEST name... it makes more sense. I was thinking we could expose the 2.minutes value because I also see value in changing it. Maybe 1.minute across the board will be better as it will pick up long requests that don't currently timeout apache but also requests that do. I'll change it to 1.minute and we'll go from there.

@NickLaMuro
Copy link
Member

NickLaMuro commented Sep 6, 2018

Sorry @jrafanie, you are getting 📖ed by me today...

So I was thinking of a few things we could do with this as alternatives or "in addition to" what we have here. Most are later TODO's, and really shouldn't be considered for this PR, but I think are worth discussing:

  • This isn't really an addition to this code change, but would it be possible for you to dig up some previous BZs/PRs where the deadlocking/SIGKILL'ing/hung/etc. requests happened so we have something to test this and other approaches against? I assume (know) you have had a lot of experience with these kinds of issues over the years, and if there are some particular issues that you have faced and are aware of, it seems like this PR would be a good place to archive them for reference.
  • Would creating a control signal that one could fire to the MiqUIWorker process that would trigger this logging make sense, ala https://tenderlovemaking.com/2016/02/05/i-am-a-puts-debuggerer.html#i-have-a-deadlock-but-i-dont-know-where ???
  • Would doing something similar on SIGKILL also make sense? An at_exit in the MiqUIWorker possibly in the after_fork callback?

I bring up the first point in particular because I don't know what you have been doing "to verify this" [ref] in the past (edit: looks like you did describe how in the BZ), but it would be nice if it wasn't just adding a sleep 250 in some random request, but actual real life issues that could then further determine if my other suggestions are worth while (in the above and in general).

Things like bloating memory, long unoptimized SQL queries, actual deadlocks, etc., that were reproducible and could help validate this code better. Here is a few examples of some I have done (mostly SQL related, so no deadlock examples):

@jrafanie
Copy link
Member Author

jrafanie commented Sep 6, 2018

This isn't really an addition to this code change, but would it be possible for you to dig up some previous BZs/PRs where the deadlocking/SIGKILL'ing/hung/etc. requests happened so we have something to test this and other approaches against? I assume (know) you have had a lot of experience with these kinds of issues over the years, and if there are some particular issues that you have faced and are aware of, it seems like this PR would be a good place to archive them for reference.

I'll look. I'm not sure how to reliably do this without first rigging the system by either giving puma less time to do the request or make the request long via sleep.

Would creating a control signal that one could fire to the MiqUIWorker process that would trigger this logging make sense, ala https://tenderlovemaking.com/2016/02/05/i-am-a-puts-debuggerer.html#i-have-a-deadlock-but-i-dont-know-where ???

Maybe? Log all threads or just the slow requests?

Would doing something similar on SIGKILL also make sense? An at_exit in the MiqUIWorker possibly in the after_fork callback?

We can't catch SIGKILL via at_exit. We can do other signals though. Is this a problem at_exit? Normally, our problem is the silly ui worker stays up too long while all 5 of it's threads are hung/busy so premature killing of the worker hasn't been a problem in my experience.

@miq-bot
Copy link
Member

miq-bot commented Sep 6, 2018

Checked commits jrafanie/manageiq@cf34c1f~...b53091e with ruby 2.3.3, rubocop 0.52.1, haml-lint 0.20.0, and yamllint 1.10.0
4 files checked, 1 offense detected

config/application.rb

@gtanzillo gtanzillo added this to the Sprint 96 Ending Oct 8, 2018 milestone Sep 26, 2018
@gtanzillo gtanzillo merged commit cc6e263 into ManageIQ:master Sep 26, 2018
@jrafanie jrafanie deleted the detect_long_running_requests branch September 26, 2018 22:05
simaishi pushed a commit that referenced this pull request Oct 1, 2018
@simaishi
Copy link
Contributor

simaishi commented Oct 1, 2018

Hammer backport details:

$ git log -1
commit 6a809cbd3dfbaccc89ff53f384bea797e3166508
Author: Gregg Tanzillo <[email protected]>
Date:   Wed Sep 26 17:51:58 2018 -0400

    Merge pull request #17842 from jrafanie/detect_long_running_requests
    
    Detect and log long running http(s) requests
    
    (cherry picked from commit cc6e263931c3b6bffc60af64168f9ba26207f2a9)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1614918

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants