-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Socket removed after reload, part II #1842
Comments
@mcg and anyone else experiencing this on 4.0.1/4.0.0 - please drop a log in the discussion here so we can get a better idea of what we're dealing with. |
What logs do you need? Using 4.0.1 and having Puma restart the socket is still missing. |
I just did a quick test on our staging server and it looks like 4.0.1 fixed the issue for us. To clarify: For us, it worked on 3.12, it didn't work on 4.0.0, and it works again on 4.0.1. Will do some more tests tomorrow and report back if we see any issues. Edit: We're using Ruby 2.6, if that could be of relevance. |
To clarify our setup, we are using Puma socket activation as described(https://github.com/puma/puma/blob/master/docs/systemd.md#socket-activation) and issuing a hot restart as:
This worked with 3.12, didn't with 4.0.0 and 4.0.1. Socket is gone with the restart. Ruby is 2.6.3. |
Can you provide puma.service config
ruby 2.6.3 config/puma.rb # frozen_string_literal: true
# Puma can serve each request in a thread from an internal thread pool.
# The `threads` method setting takes two numbers: a minimum and maximum.
# Any libraries that use thread pools should be configured to match
# the maximum value specified for Puma. Default is set to 5 threads for minimum
# and maximum; this matches the default thread size of Active Record.
#
max_threads_count = ENV.fetch('RAILS_MAX_THREADS') { 5 }
min_threads_count = ENV.fetch('RAILS_MIN_THREADS') { max_threads_count }
threads min_threads_count, max_threads_count
# Specifies the `environment` that Puma will run in.
#
environment ENV.fetch('RAILS_ENV') { 'development' }
# early_hints true
# Specifies the number of `workers` to boot in clustered mode.
# Workers are forked webserver processes. If using threads and workers together
# the concurrency of the application would be max `threads` * `workers`.
# Workers do not work on JRuby or Windows (both of which do not support
# processes).
#
workers ENV.fetch('WEB_CONCURRENCY') { 2 }
# Use the `preload_app!` method when specifying a `workers` number.
# This directive tells Puma to first boot the application and load code
# before forking the application. This takes advantage of Copy On Write
# process behavior so workers use less memory. If you use this option
# you need to make sure to reconnect any threads in the `on_worker_boot`
# block.
#
# preload_app!
# Allow workers to reload bundler context when master process is issued
# a USR1 signal. This allows proper reloading of gems while the master
# is preserved across a phased-restart. (incompatible with preload_app)
# (off by default)
#
prune_bundler
# Bind unix domain socket
bind ENV.fetch('PUMA_SOCK') { 'unix://tmp/sockets/puma.sock' }
# Use `path` as the file to store the server info state. This is
# used by `pumactl` to query and control the server.
#
state_path ENV.fetch('SERVER_STATE') { 'tmp/pids/puma.state' }
# === Puma control rack application ===
# Start the puma control rack application on `url`. This application can
# be communicated with to control the main server. Additionally, you can
# provide an authentication token, so all requests to the control server
# will need to include that token as a query parameter. This allows for
# simple authentication.
#
# Check out https://github.com/puma/puma/blob/master/lib/puma/app/status.rb
# to see what the app has available.
#
activate_control_app ENV.fetch('PUMACTL_SOCK') { 'unix://tmp/sockets/pumactl.sock' }
# Store the pid of the server in the file at `path`.
#
pidfile 'tmp/pids/puma.pid'
# Change the default timeout of workers
#
worker_timeout 10
# Allow puma to be restarted by `rails restart` command.
plugin :tmp_restart |
With the latest 4.0.1 I still get the missing socket file error
But the systemd service reports all is well
|
@Fudoshiki noticed you're using puma.rb:
puma.service:
puma.socket:
puma.service logs on a restart:
|
Our tests are sort of screwed up now because we never unlink any created unix socket FDs. I feel like we probably should have reverted #1685 rather than merged #1829. See for example https://travis-ci.org/puma/puma/jobs/557439675#L490 |
Nate and others, Re 1851, I think there are four general issues to be dealt with, which may require additional PR's: A. fixing the issue here By accurate and clean, right now some tests are passing with IO errors, but the tests don't reflect that. Similar to what you said in issue #1841. If PR #1851 actually fixes A, working on the remaining three becomes easier. So, if you think the changes to binder.rb may resolve the issue, is it time to plead with some people to test this issue with the PR branch? |
Looks like you guys are on it, but just another report...Just discovered this today, using a pretty vanilla setup with simple type systemd, it's a little bit subtle because it does eventually recover when systemd starts it back up again. I only noticed a bunch request failing and then recovering soon after. This type of error is the most common user config error so i just assumed i regressed something in my deploy. Let me know if theres a branch you guys want tested and I'll throw some traffic on it.
|
The time stamps seem to indicate so, but just to clarify, your post here #1851 (comment) shows we've now got the above example working correctly? Thanks again. |
Linking #1886 |
So, to summarize: #1775 reported that, during phased restart w/Unix sockets, Puma blew up with ENOENT on this line. The #1685 was pointed to as the cause. We merged #1829 as a fix, but that was raising errors in the tests and now some people are getting a new ENOENT on this line. It feels like Binder#close is overloaded with responsibility, and is being called in different scenarios which it isn't intended for. That's why the fix at #1829 "fixed" the issue for one use case but broke it (or didn't fix it) for others. So, we need to a) close sockets during restarts and not accept new requests but also b) not remove the socket file during a restart. Options:
|
Ok I just came back to this and and can confirm this is still happening on When using puma.service.j2
puma.socket.j2
production.rb.j2 #! /usr/bin/env puma
environment 'production'
directory '{{deploy_directory}}/current'
rackup DefaultRackup
# for t3.medium optimum configuration is 3 workers 10 threads
{% if inventory_hostname in groups['tag_Role_web_worker']|default([]) %}
workers 3
{% else %}
workers 3
{% endif %}
threads_count = Integer({{max_puma_thread}})
threads threads_count, threads_count
bind 'unix://{{deploy_directory}}/shared/tmp/sockets/puma.sock'
pidfile "{{deploy_directory}}/shared/tmp/pids/puma.pid"
state_path "{{deploy_directory}}/shared/tmp/pids/puma.state"
#plugin :statsd
preload_app!
# Code to run immediately before the master starts workers.
#
before_fork do
puts "Starting workers..."
end
# Code to run in a worker before it starts serving requests.
#
# This is called everytime a worker is to be started.
#
on_worker_boot do
puts 'On worker boot...'
ActiveRecord::Base.establish_connection
end
# Code to run in a worker right before it exits.
#
# This is called everytime a worker is to about to shutdown.
#
on_worker_shutdown do
puts 'On worker shutdown...'
end
# Code to run in the master right before a worker is started. The worker's
# index is passed as an argument.
#
# This is called everytime a worker is to be started.
#
on_worker_fork do
puts 'Before worker fork...'
end
# Code to run in the master after a worker has been started. The worker's
# index is passed as an argument.
#
# This is called everytime a worker is to be started.
#
after_worker_fork do
puts 'After worker fork...'
end
Logs on restart
|
The one thing i should probably look into though, is first allowing the gem to update get and get a clean start. i'm going to try one more time upgrading, but removing traffic and shutting down services, and give it a clean start |
Fixed in #1970 |
Confirmed fixed here. Thanks y'all |
Not fixed here. Socket is still missing on restart. |
We are using a UNIX socket but not having the same error as #1976 ("There is already a server bound to: puma.sock"). Just that |
@mcg please open a new issue with repro steps along the lines of #1842 (comment) Systemd issues are a bit harder for us to debug as they're often highly config-dependent and we need linux VMs to test them. |
I reported #1962 and I think the same still happens. I'm not 100% sure it relates to the upgrade. It might also happen with 3.12, but that's partly because it's hard to test unfortunately. The only solution (workaround) I found so far is to always restart both systemd services, i.e. Sorry I can't shed much more light on this. I can try a thing or two if you have any suggestions, but my experimentation window is limited at the moment. |
@gingerlime We run 3.12 and haven't had this issue at all. @nateberkopec #1988 created |
@mcg I thought so too, but then I downgraded back to 3.12 and tried to restart but had the same issue. Perhaps it was still running 4.2.0 when the gem version changed, so it didn't pick this up or something? as I mentioned, it was a bit tricky to test, so did this in a bit of a rush. I'll try to see if I can re-test with 3.12 in a cleaner state if it helps. But I definitely can reproduce the missing socket with 4.2.0 |
@nateberkopec I cant remember if i posted this yet but: i'm also still not convinced this isn't my own configuration error and that there may be something subtle between the releases that expose the problem. My biggest suspicion is around my
One important caveat that makes this reproduction difficult is that it also requires active inbound connections. I've made the mistake twice now where i test a deploy that is not behind a load balancer, think it's fixed then deploy to machine behind a load balancer and it explodes. From what i can tell I want to create my exact ec2 setup with an AMI, just replacing with a "hello world" app, so at least someone can actually poke around. I'll just add a bash script to mock inbound requests during the restart. |
@rromanchuk it certainly could be. Like I said, I'd like to better understand how people use systemd with Puma so I can have a better sense of what to recommend people to make repros with. i.e. "If you do XYZ and it still reproduces, it's probably a Puma issue, if you do ABC and it doesn't reproduce, it's probably a problem with your env/config/app." |
My company is also having a similar problem, using Ubuntu 16.04. In The reload for Puma works fine with our config. These are the snippets of systemd services we're using, let me know if you need any other info. puma.service
puma.socket
|
@AwolDes Is the socket |
Another problem I reported on #1962 and I see again with 4.2.0 is seeing these errors in the puma log after a couple of deploys, slightly sanitized:
It seems like the app is running fine though, but those errors keep appearing. Another odd thing here is that those logs appear in the already logrotated My |
@nateberkopec I managed to create a reproducible environment with systemd that exhibits the issue with the restart. Please take a look at https://github.com/gingerlime/puma-test-socket - it's basically a bash script that bootstraps a new Ubuntu box with ruby, puma and a sample sinatra app. It takes a bit of time to run everything, but hope it helps to then debug things etc. (tested on Digital Ocean) |
@gingerlime in your repro setup, is the socket file( |
@mcg yep.
|
@gingerlime thanks for confirming. I assume since this is closed we should discuss on #1988 ? |
@mcg yes, things got a little scattered across several tickets, including #1988 and #1962 and here... I'm pretty sure #1988 and #1962 are duplicates anyway, although not 100% sure yet. Happy to consolidate whichever way the project thinks makes sense, and I'll try to post further comments on whichever issue is "the chosen one" :) |
Continuing the discussion from #1775, it seems like the fix in 4.0.1 may not have done the trick for everyone.
I made a mistake here by not forking the discussion into two parts, because some people were reporting problems on 3.12 and some were not. We may have two separate issues here. I'd like to reset the discussion to here with what people are experiencing on 4.0.1.
The text was updated successfully, but these errors were encountered: