Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continued cycling/killing of puma workers (wrong ones?) #14

Closed
travisp opened this issue Jun 1, 2015 · 19 comments
Closed

Continued cycling/killing of puma workers (wrong ones?) #14

travisp opened this issue Jun 1, 2015 · 19 comments

Comments

@travisp
Copy link

travisp commented Jun 1, 2015

Running 5 puma workers on a 2X heroku dyno, we encountered a situation when the memory ran out (after many hours) where puma_worker_killer was continually killing and cycling workers with indexes 1-3. The memory was not dropping to an acceptable level, and puma_worker_killer did not try sending a TERM to the other workers. My best guess is that the reported memory usage for the individual workers was wrong (i.e. largest_worker was wrong) and so it was killing the wrong ones. This has happened multiple times for us.

It seems the potential mitigating solutions are:

  • only run on 1X dynos with fewer workers (but we've also seen this here)
  • add an option, similar to unicorn_worker_killer, allowing terminating a worker after a certain number of requests
  • store information about recently terminated workers and consider trying to have it terminate a different worker than largest_worker, especially if the previous attempt didn't have success, or the previous attempt was for largest_worker
@schneems
Copy link
Member

schneems commented Jun 4, 2015

Check out #8 (comment)

It's due to the way CoW optimized Ruby apps work. It sounds like you've got a leak or other memory problem that PWK is covering up. Have you seen https://github.com/schneems/derailed_benchmarks ? Maybe try to decrease your overall memory use.

@travisp
Copy link
Author

travisp commented Jun 10, 2015

I don't disagree that we've likely got a memory leak that becomes an issue after several hours. We're doing what we can to identify the issue, but it would be nice if PWK could help with this issue in the meantime, as the read me suggests:

If you have a memory leak in your code, finding and plugging it can be a herculean effort. Instead what if you just killed your processes when they got to be too large? The Puma Worker Killer does just that. Similar to Unicorn Worker Killer but for the Puma web server.

Would you be open to some sort of max_requests option, similar to unicorn worker killer?

@schneems
Copy link
Member

Would you be open to some sort of max_requests option, similar to unicorn worker killer?

I honestly don't know. It would depend on the implementation. The UWK relies on some metaprogramming to open up Unicorn classes to record the count. My gut instinct is that it won't fit into the current core codebase and might be a better separate lib but IDK.

Would killing off workers after x minutes instead of x requests work for your use case? I think that would be much easier, spawn a new thread, record the current time, every 10 seconds wake up to see if it's past the maximum time. When that occurs do a rolling worker restart (kill each worker with a slight delay between them so throughput doesn't just go away.

@travisp
Copy link
Author

travisp commented Jul 1, 2015

@schneems I think allowing a rolling worker restart after some period of time could be a very helpful solution. We've managed to mostly control the problem for now through some pretty aggressive tuning of Ruby GC settings (at a moderately large cost to performance), but I could see still being useful to us and others at times.

@schneems
Copy link
Member

schneems commented Jul 3, 2015

Great, I think that such a feature will be fairly straight forward. I'm off on paternity leave right now. I'll be back after July 15th. If you don't hear from me before then, would you mind pinging this thread?

@forrestblount
Copy link

I would also be interested in this rolling restart solution. We use some pdf manipulation libraries that always end up chewing into swap.

@schneems
Copy link
Member

Take a look at #16 if it looks good and tests pass, i'll merge it into master and you can try it out.

@forrestblount
Copy link

Doesn't look like the tests pass, but have set it up in our staging environment and will let you know if I have any issues.

@forrestblount
Copy link

Because I'm using this on Heroku I added an initializer with only one line:
PumaWorkerKiller.enable_rolling_restart(4 * 3600)

It doesn't look like it is restarting every few hours though. Did I misunderstand the documentation?

@schneems
Copy link
Member

Are you using any kind of logging addon like papertrail? You can grep for "PumaWorkerKiller: Rolling Restart". You can also try it locally and set to something really low, like 10 seconds just to verify it's set up correctly.

@forrestblount
Copy link

We use logentries. It's definitely not working for me. Even if I use the full config, I never seeing rolling restarts, even locally with 10s.

Things I've tried:

  • using PumaWorkerKiller.enable_rolling_restart(10) in an initializer
  • starting using foreman start
  • starting using rails s
  • using the more elaborate config

PumaWorkerKiller.config do |config|
config.ram = 6144 # mb
config.frequency = 5 # seconds
config.percent_usage = 0.98
config.rolling_restart_frequency = 10 #4 * 3600
end
PumaWorkerKiller.start

Could it be related to my puma server configuration? Wondering if the @cluster isn't getting set correctly -- that seems to be the only place where the Rolling Restart would kill itself.

Here's my puma config (v 2.12.2):
workers Integer(ENV['WEB_CONCURRENCY'] || 2)
threads_count = Integer(ENV['MAX_THREADS'] || 5)
threads threads_count, threads_count

preload_app!

rackup DefaultRackup
port ENV['PORT'] || 3000
environment ENV['RACK_ENV'] || 'development'

on_worker_boot do
ActiveSupport.on_load(:active_record) do
ActiveRecord::Base.establish_connection
end
end

Thanks in advance for your help.

@schneems
Copy link
Member

schneems commented Aug 5, 2015

I realized I didn't call start on the AutoReap.new instance. It's fixed in the branch and should be executing now. Unfortunately my tests are pretty non-deterministic. Once I get those sorted out i'll merge. Thanks for the feedback and your patience.

@forrestblount
Copy link

Thanks for creating this helpful tool! I can confirm it's now working as expected.

@schneems
Copy link
Member

🚀 I'm curious. What do you think a good default reap time would good be? I think i've currently got it set for 12 hours. I want to not have to bring down your workers more than needed since it does affect throughput when it happens, however I don't want to wait so long that you're swapping. 2x a day seemed like an okay guess.

Second question: did you see any noticeable dip in throughput when the reaper kicks in? I put some randomization in there so not every server tries to restart at the same time, but i'm not sure if it's enough (it's plus/minus 5 seconds). If you're seeing a big drop in throughput we might want to make this a larger delta. You might not see a problem until adding a large number of servers, or maybe it's not that bad.

@forrestblount
Copy link

12 hours isn't bad for a default, but it really depends on the size of the
dyno and number of workers running. For us to stay on the 2x dynos, I've
got it set to run every 4 hours in staging.

Because we're typically running < 5 dynos, I am seeing some issues with
throughput. I think a delta of 1-3m would work better here - the idea being
that only 1-2 dynos are ever out of commission at the same time. Is this a
moot point if the app has preboot enabled? I've only been testing in
staging so far (where I have noticed throughput dropping to near 0, but
where we typically run 1-2 dynos).

On Fri, Aug 14, 2015 at 3:22 PM Richard Schneeman [email protected]
wrote:

[image: 🚀] I'm curious. What do you think a good default reap time
would good be? I think i've currently got it set for 12 hours. I want to
not have to bring down your workers more than needed since it does affect
throughput when it happens, however I don't want to wait so long that
you're swapping. 2x a day seemed like an okay guess.

Second question: did you see any noticeable dip in throughput when the
reaper kicks in? I put some randomization in there so not every server
tries to restart at the same time, but i'm not sure if it's enough (it's
plus/minus 5 seconds). If you're seeing a big drop in throughput we might
want to make this a larger delta. You might not see a problem until adding
a large number of servers, or maybe it's not that bad.


Reply to this email directly or view it on GitHub
#14 (comment)
.

@jvenezia
Copy link

Hi,

I'm trying the exact same thing than @forrestblount in a local development env.
I can't see any restart in the server logs, and @cluster.running? is always returning nil witch I assume prevents restarting my workers.

Any Idea why? Thanks!

@schneems
Copy link
Member

The thread that basically sleeps for 12 hours is probably either not getting started, or getting started in a weird place. Is preload app set to true on your puma? I think when it isn't enabled there's an edgecase we're not properly handling.

@jvenezia
Copy link

Hi, I'm using app preload. Here is my puma settings:

workers Integer(ENV['PUMA_WORKERS'] || 2)
threads_count = Integer(ENV['PUMA_WORKER_THREADS'] || 5)
threads threads_count, threads_count

preload_app!

rackup DefaultRackup
port ENV['PORT'] || 3000
environment ENV['RACK_ENV'] || 'development'

on_worker_boot do
  ActiveSupport.on_load(:active_record) do
    config = ActiveRecord::Base.configurations[Rails.env] || Rails.application.config.database_configuration[Rails.env]
    config['pool'] = threads_count
    ActiveRecord::Base.establish_connection(config)
  end
end

@schneems
Copy link
Member

schneems commented Jun 7, 2016

Can you give me an example app that reproduces this in development?

@schneems schneems closed this as completed Nov 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants