-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Continued cycling/killing of puma workers (wrong ones?) #14
Comments
Check out #8 (comment) It's due to the way CoW optimized Ruby apps work. It sounds like you've got a leak or other memory problem that PWK is covering up. Have you seen https://github.com/schneems/derailed_benchmarks ? Maybe try to decrease your overall memory use. |
I don't disagree that we've likely got a memory leak that becomes an issue after several hours. We're doing what we can to identify the issue, but it would be nice if PWK could help with this issue in the meantime, as the read me suggests:
Would you be open to some sort of max_requests option, similar to unicorn worker killer? |
I honestly don't know. It would depend on the implementation. The UWK relies on some metaprogramming to open up Unicorn classes to record the count. My gut instinct is that it won't fit into the current core codebase and might be a better separate lib but IDK. Would killing off workers after x minutes instead of x requests work for your use case? I think that would be much easier, spawn a new thread, record the current time, every 10 seconds wake up to see if it's past the maximum time. When that occurs do a rolling worker restart (kill each worker with a slight delay between them so throughput doesn't just go away. |
@schneems I think allowing a rolling worker restart after some period of time could be a very helpful solution. We've managed to mostly control the problem for now through some pretty aggressive tuning of Ruby GC settings (at a moderately large cost to performance), but I could see still being useful to us and others at times. |
Great, I think that such a feature will be fairly straight forward. I'm off on paternity leave right now. I'll be back after July 15th. If you don't hear from me before then, would you mind pinging this thread? |
I would also be interested in this rolling restart solution. We use some pdf manipulation libraries that always end up chewing into swap. |
Take a look at #16 if it looks good and tests pass, i'll merge it into master and you can try it out. |
Doesn't look like the tests pass, but have set it up in our staging environment and will let you know if I have any issues. |
Because I'm using this on Heroku I added an initializer with only one line: It doesn't look like it is restarting every few hours though. Did I misunderstand the documentation? |
Are you using any kind of logging addon like papertrail? You can grep for "PumaWorkerKiller: Rolling Restart". You can also try it locally and set to something really low, like 10 seconds just to verify it's set up correctly. |
We use logentries. It's definitely not working for me. Even if I use the full config, I never seeing rolling restarts, even locally with 10s. Things I've tried:
PumaWorkerKiller.config do |config| Could it be related to my puma server configuration? Wondering if the @cluster isn't getting set correctly -- that seems to be the only place where the Rolling Restart would kill itself. Here's my puma config (v 2.12.2): preload_app! rackup DefaultRackup on_worker_boot do Thanks in advance for your help. |
I realized I didn't call |
Thanks for creating this helpful tool! I can confirm it's now working as expected. |
🚀 I'm curious. What do you think a good default reap time would good be? I think i've currently got it set for 12 hours. I want to not have to bring down your workers more than needed since it does affect throughput when it happens, however I don't want to wait so long that you're swapping. 2x a day seemed like an okay guess. Second question: did you see any noticeable dip in throughput when the reaper kicks in? I put some randomization in there so not every server tries to restart at the same time, but i'm not sure if it's enough (it's plus/minus 5 seconds). If you're seeing a big drop in throughput we might want to make this a larger delta. You might not see a problem until adding a large number of servers, or maybe it's not that bad. |
12 hours isn't bad for a default, but it really depends on the size of the Because we're typically running < 5 dynos, I am seeing some issues with On Fri, Aug 14, 2015 at 3:22 PM Richard Schneeman [email protected]
|
Hi, I'm trying the exact same thing than @forrestblount in a local development env. Any Idea why? Thanks! |
The thread that basically sleeps for 12 hours is probably either not getting started, or getting started in a weird place. Is preload app set to true on your puma? I think when it isn't enabled there's an edgecase we're not properly handling. |
Hi, I'm using app preload. Here is my puma settings: workers Integer(ENV['PUMA_WORKERS'] || 2)
threads_count = Integer(ENV['PUMA_WORKER_THREADS'] || 5)
threads threads_count, threads_count
preload_app!
rackup DefaultRackup
port ENV['PORT'] || 3000
environment ENV['RACK_ENV'] || 'development'
on_worker_boot do
ActiveSupport.on_load(:active_record) do
config = ActiveRecord::Base.configurations[Rails.env] || Rails.application.config.database_configuration[Rails.env]
config['pool'] = threads_count
ActiveRecord::Base.establish_connection(config)
end
end |
Can you give me an example app that reproduces this in development? |
Running 5 puma workers on a 2X heroku dyno, we encountered a situation when the memory ran out (after many hours) where puma_worker_killer was continually killing and cycling workers with indexes 1-3. The memory was not dropping to an acceptable level, and puma_worker_killer did not try sending a TERM to the other workers. My best guess is that the reported memory usage for the individual workers was wrong (i.e.
largest_worker
was wrong) and so it was killing the wrong ones. This has happened multiple times for us.It seems the potential mitigating solutions are:
largest_worker
, especially if the previous attempt didn't have success, or the previous attempt was forlargest_worker
The text was updated successfully, but these errors were encountered: