Continued cycling/killing of puma workers (wrong ones?) #14

travisp · 2015-06-01T16:14:33Z

Running 5 puma workers on a 2X heroku dyno, we encountered a situation when the memory ran out (after many hours) where puma_worker_killer was continually killing and cycling workers with indexes 1-3. The memory was not dropping to an acceptable level, and puma_worker_killer did not try sending a TERM to the other workers. My best guess is that the reported memory usage for the individual workers was wrong (i.e. largest_worker was wrong) and so it was killing the wrong ones. This has happened multiple times for us.

It seems the potential mitigating solutions are:

only run on 1X dynos with fewer workers (but we've also seen this here)
add an option, similar to unicorn_worker_killer, allowing terminating a worker after a certain number of requests
store information about recently terminated workers and consider trying to have it terminate a different worker than largest_worker, especially if the previous attempt didn't have success, or the previous attempt was for largest_worker

The text was updated successfully, but these errors were encountered:

schneems · 2015-06-04T15:04:40Z

Check out #8 (comment)

It's due to the way CoW optimized Ruby apps work. It sounds like you've got a leak or other memory problem that PWK is covering up. Have you seen https://github.com/schneems/derailed_benchmarks ? Maybe try to decrease your overall memory use.

travisp · 2015-06-10T15:16:45Z

I don't disagree that we've likely got a memory leak that becomes an issue after several hours. We're doing what we can to identify the issue, but it would be nice if PWK could help with this issue in the meantime, as the read me suggests:

If you have a memory leak in your code, finding and plugging it can be a herculean effort. Instead what if you just killed your processes when they got to be too large? The Puma Worker Killer does just that. Similar to Unicorn Worker Killer but for the Puma web server.

Would you be open to some sort of max_requests option, similar to unicorn worker killer?

schneems · 2015-06-23T18:36:22Z

Would you be open to some sort of max_requests option, similar to unicorn worker killer?

I honestly don't know. It would depend on the implementation. The UWK relies on some metaprogramming to open up Unicorn classes to record the count. My gut instinct is that it won't fit into the current core codebase and might be a better separate lib but IDK.

Would killing off workers after x minutes instead of x requests work for your use case? I think that would be much easier, spawn a new thread, record the current time, every 10 seconds wake up to see if it's past the maximum time. When that occurs do a rolling worker restart (kill each worker with a slight delay between them so throughput doesn't just go away.

travisp · 2015-07-01T15:16:31Z

@schneems I think allowing a rolling worker restart after some period of time could be a very helpful solution. We've managed to mostly control the problem for now through some pretty aggressive tuning of Ruby GC settings (at a moderately large cost to performance), but I could see still being useful to us and others at times.

schneems · 2015-07-03T18:40:35Z

Great, I think that such a feature will be fairly straight forward. I'm off on paternity leave right now. I'll be back after July 15th. If you don't hear from me before then, would you mind pinging this thread?

forrestblount · 2015-07-23T16:06:21Z

I would also be interested in this rolling restart solution. We use some pdf manipulation libraries that always end up chewing into swap.

schneems · 2015-07-23T19:43:10Z

Take a look at #16 if it looks good and tests pass, i'll merge it into master and you can try it out.

forrestblount · 2015-07-24T00:07:09Z

Doesn't look like the tests pass, but have set it up in our staging environment and will let you know if I have any issues.

forrestblount · 2015-07-28T16:40:10Z

Because I'm using this on Heroku I added an initializer with only one line:
PumaWorkerKiller.enable_rolling_restart(4 * 3600)

It doesn't look like it is restarting every few hours though. Did I misunderstand the documentation?

schneems · 2015-07-28T16:51:16Z

Are you using any kind of logging addon like papertrail? You can grep for "PumaWorkerKiller: Rolling Restart". You can also try it locally and set to something really low, like 10 seconds just to verify it's set up correctly.

forrestblount · 2015-08-03T16:51:03Z

We use logentries. It's definitely not working for me. Even if I use the full config, I never seeing rolling restarts, even locally with 10s.

Things I've tried:

using PumaWorkerKiller.enable_rolling_restart(10) in an initializer
starting using foreman start
starting using rails s
using the more elaborate config

PumaWorkerKiller.config do |config|
config.ram = 6144 # mb
config.frequency = 5 # seconds
config.percent_usage = 0.98
config.rolling_restart_frequency = 10 #4 * 3600
end
PumaWorkerKiller.start

Could it be related to my puma server configuration? Wondering if the @cluster isn't getting set correctly -- that seems to be the only place where the Rolling Restart would kill itself.

Here's my puma config (v 2.12.2):
workers Integer(ENV['WEB_CONCURRENCY'] || 2)
threads_count = Integer(ENV['MAX_THREADS'] || 5)
threads threads_count, threads_count

preload_app!

rackup DefaultRackup
port ENV['PORT'] || 3000
environment ENV['RACK_ENV'] || 'development'

on_worker_boot do
ActiveSupport.on_load(:active_record) do
ActiveRecord::Base.establish_connection
end
end

Thanks in advance for your help.

schneems · 2015-08-05T15:48:28Z

I realized I didn't call start on the AutoReap.new instance. It's fixed in the branch and should be executing now. Unfortunately my tests are pretty non-deterministic. Once I get those sorted out i'll merge. Thanks for the feedback and your patience.

forrestblount · 2015-08-05T20:43:15Z

Thanks for creating this helpful tool! I can confirm it's now working as expected.

schneems · 2015-08-14T22:22:34Z

🚀 I'm curious. What do you think a good default reap time would good be? I think i've currently got it set for 12 hours. I want to not have to bring down your workers more than needed since it does affect throughput when it happens, however I don't want to wait so long that you're swapping. 2x a day seemed like an okay guess.

Second question: did you see any noticeable dip in throughput when the reaper kicks in? I put some randomization in there so not every server tries to restart at the same time, but i'm not sure if it's enough (it's plus/minus 5 seconds). If you're seeing a big drop in throughput we might want to make this a larger delta. You might not see a problem until adding a large number of servers, or maybe it's not that bad.

forrestblount · 2015-08-16T18:08:50Z

12 hours isn't bad for a default, but it really depends on the size of the
dyno and number of workers running. For us to stay on the 2x dynos, I've
got it set to run every 4 hours in staging.

Because we're typically running < 5 dynos, I am seeing some issues with
throughput. I think a delta of 1-3m would work better here - the idea being
that only 1-2 dynos are ever out of commission at the same time. Is this a
moot point if the app has preboot enabled? I've only been testing in
staging so far (where I have noticed throughput dropping to near 0, but
where we typically run 1-2 dynos).

On Fri, Aug 14, 2015 at 3:22 PM Richard Schneeman [email protected]
wrote:

[image: 🚀] I'm curious. What do you think a good default reap time
would good be? I think i've currently got it set for 12 hours. I want to
not have to bring down your workers more than needed since it does affect
throughput when it happens, however I don't want to wait so long that
you're swapping. 2x a day seemed like an okay guess.

Second question: did you see any noticeable dip in throughput when the
reaper kicks in? I put some randomization in there so not every server
tries to restart at the same time, but i'm not sure if it's enough (it's
plus/minus 5 seconds). If you're seeing a big drop in throughput we might
want to make this a larger delta. You might not see a problem until adding
a large number of servers, or maybe it's not that bad.

—
Reply to this email directly or view it on GitHub
#14 (comment)
.

jvenezia · 2016-05-19T15:15:05Z

Hi,

I'm trying the exact same thing than @forrestblount in a local development env.
I can't see any restart in the server logs, and @cluster.running? is always returning nil witch I assume prevents restarting my workers.

Any Idea why? Thanks!

schneems · 2016-05-26T15:51:03Z

The thread that basically sleeps for 12 hours is probably either not getting started, or getting started in a weird place. Is preload app set to true on your puma? I think when it isn't enabled there's an edgecase we're not properly handling.

jvenezia · 2016-05-26T16:13:35Z

Hi, I'm using app preload. Here is my puma settings:

workers Integer(ENV['PUMA_WORKERS'] || 2)
threads_count = Integer(ENV['PUMA_WORKER_THREADS'] || 5)
threads threads_count, threads_count

preload_app!

rackup DefaultRackup
port ENV['PORT'] || 3000
environment ENV['RACK_ENV'] || 'development'

on_worker_boot do
  ActiveSupport.on_load(:active_record) do
    config = ActiveRecord::Base.configurations[Rails.env] || Rails.application.config.database_configuration[Rails.env]
    config['pool'] = threads_count
    ActiveRecord::Base.establish_connection(config)
  end
end

schneems · 2016-06-07T18:16:50Z

Can you give me an example app that reproduces this in development?

travisp mentioned this issue Jun 1, 2015

workers being constantly booted #8

Closed

schneems closed this as completed Nov 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continued cycling/killing of puma workers (wrong ones?) #14

Continued cycling/killing of puma workers (wrong ones?) #14

travisp commented Jun 1, 2015

schneems commented Jun 4, 2015

travisp commented Jun 10, 2015

schneems commented Jun 23, 2015

travisp commented Jul 1, 2015

schneems commented Jul 3, 2015

forrestblount commented Jul 23, 2015

schneems commented Jul 23, 2015

forrestblount commented Jul 24, 2015

forrestblount commented Jul 28, 2015

schneems commented Jul 28, 2015

forrestblount commented Aug 3, 2015

schneems commented Aug 5, 2015

forrestblount commented Aug 5, 2015

schneems commented Aug 14, 2015

forrestblount commented Aug 16, 2015

jvenezia commented May 19, 2016

schneems commented May 26, 2016

jvenezia commented May 26, 2016

schneems commented Jun 7, 2016

Continued cycling/killing of puma workers (wrong ones?) #14

Continued cycling/killing of puma workers (wrong ones?) #14

Comments

travisp commented Jun 1, 2015

schneems commented Jun 4, 2015

travisp commented Jun 10, 2015

schneems commented Jun 23, 2015

travisp commented Jul 1, 2015

schneems commented Jul 3, 2015

forrestblount commented Jul 23, 2015

schneems commented Jul 23, 2015

forrestblount commented Jul 24, 2015

forrestblount commented Jul 28, 2015

schneems commented Jul 28, 2015

forrestblount commented Aug 3, 2015

schneems commented Aug 5, 2015

forrestblount commented Aug 5, 2015

schneems commented Aug 14, 2015

forrestblount commented Aug 16, 2015

jvenezia commented May 19, 2016

schneems commented May 26, 2016

jvenezia commented May 26, 2016

schneems commented Jun 7, 2016