Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Socket removed after reload, part II #1842

Closed
nateberkopec opened this issue Jul 11, 2019 · 39 comments
Closed

Socket removed after reload, part II #1842

nateberkopec opened this issue Jul 11, 2019 · 39 comments
Labels

Comments

@nateberkopec
Copy link
Member

Continuing the discussion from #1775, it seems like the fix in 4.0.1 may not have done the trick for everyone.

I made a mistake here by not forking the discussion into two parts, because some people were reporting problems on 3.12 and some were not. We may have two separate issues here. I'd like to reset the discussion to here with what people are experiencing on 4.0.1.

@nateberkopec
Copy link
Member Author

@mcg and anyone else experiencing this on 4.0.1/4.0.0 - please drop a log in the discussion here so we can get a better idea of what we're dealing with.

@mcg
Copy link

mcg commented Jul 11, 2019

What logs do you need? Using 4.0.1 and having Puma restart the socket is still missing.

@calmyournerves
Copy link

calmyournerves commented Jul 11, 2019

I just did a quick test on our staging server and it looks like 4.0.1 fixed the issue for us. To clarify: For us, it worked on 3.12, it didn't work on 4.0.0, and it works again on 4.0.1.

Will do some more tests tomorrow and report back if we see any issues.

Edit: We're using Ruby 2.6, if that could be of relevance.

@mcg
Copy link

mcg commented Jul 11, 2019

To clarify our setup, we are using Puma socket activation as described(https://github.com/puma/puma/blob/master/docs/systemd.md#socket-activation) and issuing a hot restart as:

# A "hot" restart, with systemd keeping puma.socket listening and
# providing to the new puma (master) instance.
systemctl restart puma.service

This worked with 3.12, didn't with 4.0.0 and 4.0.1. Socket is gone with the restart. Ruby is 2.6.3.

@Fudoshiki
Copy link
Contributor

Fudoshiki commented Jul 12, 2019

Can you provide puma.service config

systemctl restart puma.service works fine with 4.0.1 with my config

ruby 2.6.3
rails 6.1
puma 4

config/puma.rb

# frozen_string_literal: true

# Puma can serve each request in a thread from an internal thread pool.
# The `threads` method setting takes two numbers: a minimum and maximum.
# Any libraries that use thread pools should be configured to match
# the maximum value specified for Puma. Default is set to 5 threads for minimum
# and maximum; this matches the default thread size of Active Record.
#
max_threads_count = ENV.fetch('RAILS_MAX_THREADS') { 5 }
min_threads_count = ENV.fetch('RAILS_MIN_THREADS') { max_threads_count }
threads min_threads_count, max_threads_count

# Specifies the `environment` that Puma will run in.
#
environment ENV.fetch('RAILS_ENV') { 'development' }
# early_hints true

# Specifies the number of `workers` to boot in clustered mode.
# Workers are forked webserver processes. If using threads and workers together
# the concurrency of the application would be max `threads` * `workers`.
# Workers do not work on JRuby or Windows (both of which do not support
# processes).
#
workers ENV.fetch('WEB_CONCURRENCY') { 2 }

# Use the `preload_app!` method when specifying a `workers` number.
# This directive tells Puma to first boot the application and load code
# before forking the application. This takes advantage of Copy On Write
# process behavior so workers use less memory. If you use this option
# you need to make sure to reconnect any threads in the `on_worker_boot`
# block.
#
# preload_app!

# Allow workers to reload bundler context when master process is issued
# a USR1 signal. This allows proper reloading of gems while the master
# is preserved across a phased-restart. (incompatible with preload_app)
# (off by default)
#
prune_bundler

# Bind unix domain socket
bind ENV.fetch('PUMA_SOCK') { 'unix://tmp/sockets/puma.sock' }

# Use `path` as the file to store the server info state. This is
# used by `pumactl` to query and control the server.
#
state_path ENV.fetch('SERVER_STATE') { 'tmp/pids/puma.state' }

# === Puma control rack application ===

# Start the puma control rack application on `url`. This application can
# be communicated with to control the main server. Additionally, you can
# provide an authentication token, so all requests to the control server
# will need to include that token as a query parameter. This allows for
# simple authentication.
#
# Check out https://github.com/puma/puma/blob/master/lib/puma/app/status.rb
# to see what the app has available.
#
activate_control_app ENV.fetch('PUMACTL_SOCK') { 'unix://tmp/sockets/pumactl.sock' }

# Store the pid of the server in the file at `path`.
#
pidfile 'tmp/pids/puma.pid'

# Change the default timeout of workers
#
worker_timeout 10

# Allow puma to be restarted by `rails restart` command.
plugin :tmp_restart

@adamcreekroad
Copy link

With the latest 4.0.1 I still get the missing socket file error

Errno::ENOENT: No such file or directory @ apply2files - /home/deploy/sockets/puma.app-staging.sock
  /var/www/app/staging/shared/bundle/ruby/2.6.0/gems/puma-4.0.1/lib/puma/launcher.rb:222:in `unlink'
  /var/www/app/staging/shared/bundle/ruby/2.6.0/gems/puma-4.0.1/lib/puma/launcher.rb:222:in `block in close_binder_listeners'
  /var/www/app/staging/shared/bundle/ruby/2.6.0/gems/puma-4.0.1/lib/puma/launcher.rb:218:in `each'
  /var/www/app/staging/shared/bundle/ruby/2.6.0/gems/puma-4.0.1/lib/puma/launcher.rb:218:in `close_binder_listeners'
  /var/www/app/staging/shared/bundle/ruby/2.6.0/gems/puma-4.0.1/lib/puma/cluster.rb:414:in `block in setup_signals'
  /var/www/app/staging/shared/bundle/ruby/2.6.0/gems/puma-4.0.1/lib/puma/cluster.rb:62:in `sleep'
  /var/www/app/staging/shared/bundle/ruby/2.6.0/gems/puma-4.0.1/lib/puma/cluster.rb:62:in `block in stop_workers'
  /var/www/app/staging/shared/bundle/ruby/2.6.0/gems/puma-4.0.1/lib/puma/cluster.rb:50:in `loop'
  /var/www/app/staging/shared/bundle/ruby/2.6.0/gems/puma-4.0.1/lib/puma/cluster.rb:50:in `stop_workers'
  /var/www/app/staging/shared/bundle/ruby/2.6.0/gems/puma-4.0.1/lib/puma/cluster.rb:546:in `run'
  /var/www/app/staging/shared/bundle/ruby/2.6.0/gems/puma-4.0.1/lib/puma/launcher.rb:186:in `run'
  /var/www/app/staging/shared/bundle/ruby/2.6.0/gems/puma-4.0.1/lib/puma/cli.rb:80:in `run'
  /var/www/app/staging/shared/bundle/ruby/2.6.0/gems/puma-4.0.1/bin/puma:10:in `<top (required)>'
  /var/www/app/staging/shared/bundle/ruby/2.6.0/bin/puma:23:in `load'
  /var/www/app/staging/shared/bundle/ruby/2.6.0/bin/puma:23:in `<top (required)>'

But the systemd service reports all is well

● puma.app-staging.service - puma (app staging)
   Loaded: loaded (/etc/systemd/system/puma.app-staging.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2019-07-12 17:58:20 UTC; 2min 35s ago
  Process: 12819 ExecStop=/home/deploy/.rbenv/shims/bundle exec pumactl -S /home/deploy/pids/puma.app-staging.state stop (code=exited, status=0/SUCCESS)
 Main PID: 12833 (bundle)
    Tasks: 54 (limit: 4915)
   CGroup: /system.slice/puma.app-staging.service
           ├─12833 puma 4.0.1 (unix:///home/deploy/sockets/puma.app-staging.sock) [app-staging]
           ├─12889 puma: cluster worker 0: 12833 [app-staging]
           └─12905 puma: cluster worker 1: 12833 [app-staging]

Jul 12 17:58:20 ip-172-30-1-149 systemd[1]: Started puma (app staging).
Jul 12 17:58:20 ip-172-30-1-149 puma[12833]: [12833] Puma starting in cluster mode...
Jul 12 17:58:20 ip-172-30-1-149 puma[12833]: [12833] * Version 4.0.1 (ruby 2.6.3-p62), codename: 4 Fast 4 Furious
Jul 12 17:58:20 ip-172-30-1-149 puma[12833]: [12833] * Min threads: 16, max threads: 16
Jul 12 17:58:20 ip-172-30-1-149 puma[12833]: [12833] * Environment: staging
Jul 12 17:58:20 ip-172-30-1-149 puma[12833]: [12833] * Process workers: 2
Jul 12 17:58:20 ip-172-30-1-149 puma[12833]: [12833] * Preloading application
Jul 12 17:58:27 ip-172-30-1-149 puma[12833]: [12833] * Activated unix:///home/deploy/sockets/puma.app-staging.sock
Jul 12 17:58:27 ip-172-30-1-149 puma[12833]: [12833] Use Ctrl-C to stop

@mcg
Copy link

mcg commented Jul 12, 2019

@Fudoshiki noticed you're using prune_bundler using preload_app! here.

puma.rb:

directory '/var/app/current'
environment 'staging'

pidfile '/var/app/shared/tmp/pids/puma.pid'
state_path '/var/app/shared/tmp/pids/puma.state'
stdout_redirect '/var/app/shared/log/puma_access.log', '/var/app/shared/log/puma_error.log', true

workers 2
threads 1, 2

bind 'unix:///var/app/shared/tmp/sockets/puma.sock'
activate_control_app 'tcp://127.0.0.1:9293', { auth_token: '...' }

preload_app!

on_restart do
  puts 'Refreshing Gemfile'
  ENV["BUNDLE_GEMFILE"] = '/var/app/current/Gemfile'
end

before_fork do
  ActiveRecord::Base.connection_pool.disconnect!
end

on_worker_boot do
  SemanticLogger.reopen if defined?(SemanticLogger)

  ActiveSupport.on_load(:active_record) do
    ActiveRecord::Base.establish_connection
  end
end

puma.service:

[Unit]
Description=Puma HTTP Server
After=network.target

Requires=puma.socket

[Service]
Type=simple

User=appuser
Group=appuser

WorkingDirectory=/var/app/current

Environment=PUMA_DEBUG=1

ExecStart=/var/app/rvm1scripts/rvm-auto.sh . bundle exec --keep-file-descriptors puma -C /var/app/shared/config/puma.rb

Restart=always

TimeoutStopSec=180

[Install]
WantedBy=multi-user.target

puma.socket:

[Unit]
Description=Puma HTTP Server Accept Sockets

[Socket]
SocketUser=appuser
SocketGroup=appuser
ListenStream=/var/app/shared/tmp/sockets/puma.sock

NoDelay=true
ReusePort=true
Backlog=1024

[Install]
WantedBy=sockets.target

puma.service logs on a restart:

Jul 12 19:00:54: % Registered unix:/var/app/shared/tmp/sockets/puma.sock for activation from LISTEN_FDS
Jul 12 19:00:54: [1424] Puma starting in cluster mode...
Jul 12 19:00:54: [1424] * Version 4.0.1 (ruby 2.6.3-p62), codename: 4 Fast 4 Furious
Jul 12 19:00:54: [1424] * Min threads: 1, max threads: 2
Jul 12 19:00:54: [1424] * Environment: staging
Jul 12 19:00:54: [1424] * Process workers: 2
Jul 12 19:00:54: [1424] * Preloading application
Jul 12 19:01:05: [1424] * Activated unix:///var/app/shared/tmp/sockets/puma.sock
Jul 12 19:01:05: [1424] Use Ctrl-C to stop

@nateberkopec
Copy link
Member Author

Our tests are sort of screwed up now because we never unlink any created unix socket FDs.

I feel like we probably should have reverted #1685 rather than merged #1829.

See for example https://travis-ci.org/puma/puma/jobs/557439675#L490

@Fudoshiki
Copy link
Contributor

Maybe revert both (#1685 and #1829) and rewrite #1685

@nateberkopec
Copy link
Member Author

#1851

@MSP-Greg
Copy link
Member

Nate and others,

Re 1851, I think there are four general issues to be dealt with, which may require additional PR's:

A. fixing the issue here
B. getting tests to show breakage re this issue
C. making sure Puma is unlinking files when it should
D. getting tests to be accurate, clean, repeatable, immune to 'leakage' from test to test, and removing all test artifacts.

By accurate and clean, right now some tests are passing with IO errors, but the tests don't reflect that. Similar to what you said in issue #1841.

If PR #1851 actually fixes A, working on the remaining three becomes easier.

So, if you think the changes to binder.rb may resolve the issue, is it time to plead with some people to test this issue with the PR branch?

@rromanchuk
Copy link

Looks like you guys are on it, but just another report...Just discovered this today, using a pretty vanilla setup with simple type systemd, it's a little bit subtle because it does eventually recover when systemd starts it back up again. I only noticed a bunch request failing and then recovering soon after. This type of error is the most common user config error so i just assumed i regressed something in my deploy. Let me know if theres a branch you guys want tested and I'll throw some traffic on it.

[Unit]
Description=Puma HTTP Server
After=network.target

[Service]
Type=simple

User=ubuntu

Environment=STATSD_HOST=127.0.0.1

WorkingDirectory={{deploy_directory}}/current
ExecStart=/home/ubuntu/.rbenv/bin/rbenv exec bundle exec puma -C {{deploy_directory}}/shared/config/puma/production.rb

SyslogIdentifier=puma
StandardOutput=syslog
StandardError=syslog

Restart=always

[Install]
WantedBy=multi-user.target
#! /usr/bin/env puma

{% if inventory_hostname in groups['tag_Role_web_worker']|default([]) %}
workers 3
{% else %}
workers 4
{% endif %}


threads_count = Integer({{max_puma_thread}})
threads threads_count, threads_count

plugin :statsd

bind 'unix://{{deploy_directory}}/shared/tmp/sockets/puma_{{app_name}}.sock'
environment 'production'

directory '{{deploy_directory}}/current'

rackup      DefaultRackup

preload_app!

# [.....trimmed....]

Core_—_ubuntu_web5-0e2___var_www_spar_shared_config_puma_—_ssh_awx_—_293×72

@MSP-Greg
Copy link
Member

@rromanchuk

The time stamps seem to indicate so, but just to clarify, your post here #1851 (comment) shows we've now got the above example working correctly? Thanks again.

@nateberkopec
Copy link
Member Author

Linking #1886

@nateberkopec
Copy link
Member Author

So, to summarize:

#1775 reported that, during phased restart w/Unix sockets, Puma blew up with ENOENT on this line. The puma.sock socket was also removed (not just closed). A minimal reproduction case was provided by @Fudoshiki: pumactl -S tmp/pids/puma.state -F config/puma.rb phased-restart removes puma.sock.

#1685 was pointed to as the cause.

We merged #1829 as a fix, but that was raising errors in the tests and now some people are getting a new ENOENT on this line.

It feels like Binder#close is overloaded with responsibility, and is being called in different scenarios which it isn't intended for. That's why the fix at #1829 "fixed" the issue for one use case but broke it (or didn't fix it) for others.

So, we need to a) close sockets during restarts and not accept new requests but also b) not remove the socket file during a restart.

Options:

@rromanchuk
Copy link

rromanchuk commented Aug 22, 2019

Ok I just came back to this and and can confirm this is still happening on * Version 4.1.0 (ruby 2.6.3-p62), codename: Fourth and One, with active traffic from the load balancer sudo systemctl restart puma begins dropping connections.

When using 3.12.1, with the same exact configuration, no connections are dropped.

puma.service.j2

[Unit]
Description=Puma HTTP Server
After=network.target
Requires=puma.socket

[Service]
Type=simple

User=ubuntu

Environment=STATSD_HOST=127.0.0.1
Environment=PUMA_DEBUG=1
Environment=PATH=/home/ubuntu/.rbenv/shims:/bin:/usr/bin

WorkingDirectory={{deploy_directory}}/current
ExecStart={{deploy_directory}}/current/bin/puma -C {{deploy_directory}}/shared/config/puma/production.rb

SyslogIdentifier=puma
StandardOutput=syslog
StandardError=syslog


Restart=always

[Install]
WantedBy=multi-user.target

puma.socket.j2

[Unit]
Description=Puma HTTP Server Accept Sockets

[Socket]
ListenStream={{deploy_directory}}/shared/tmp/sockets/puma.sock


[Install]
WantedBy=sockets.target

production.rb.j2

#! /usr/bin/env puma
environment 'production'
directory '{{deploy_directory}}/current'
rackup      DefaultRackup

# for t3.medium optimum configuration is 3 workers 10 threads
{% if inventory_hostname in groups['tag_Role_web_worker']|default([]) %}
workers 3
{% else %}
workers 3
{% endif %}


threads_count = Integer({{max_puma_thread}})
threads threads_count, threads_count

bind 'unix://{{deploy_directory}}/shared/tmp/sockets/puma.sock'
pidfile "{{deploy_directory}}/shared/tmp/pids/puma.pid"
state_path "{{deploy_directory}}/shared/tmp/pids/puma.state"



#plugin :statsd

preload_app!


# Code to run immediately before the master starts workers.
#
before_fork do
  puts "Starting workers..."
end


# Code to run in a worker before it starts serving requests.
#
# This is called everytime a worker is to be started.
#
on_worker_boot do
  puts 'On worker boot...'
  ActiveRecord::Base.establish_connection
end

# Code to run in a worker right before it exits.
#
# This is called everytime a worker is to about to shutdown.
#
on_worker_shutdown do
  puts 'On worker shutdown...'
end

# Code to run in the master right before a worker is started. The worker's
# index is passed as an argument.
#
# This is called everytime a worker is to be started.
#
on_worker_fork do
  puts 'Before worker fork...'
end

# Code to run in the master after a worker has been started. The worker's
# index is passed as an argument.
#
# This is called everytime a worker is to be started.
#
after_worker_fork do
  puts 'After worker fork...'
end

Logs on restart

Aug 22 15:30:17 web_worker2-0c2 puma[12869]: On worker shutdown...
Aug 22 15:30:17 web_worker2-0c2 puma[12869]: [12869]     worker status: pid 13000 exit 0
Aug 22 15:30:17 web_worker2-0c2 puma[12869]: [12869]     worker status: pid 13006 exit 0
Aug 22 15:30:17 web_worker2-0c2 puma[12869]: [12869]     worker status: pid 13012 exit 0
Aug 22 15:30:17 web_worker2-0c2 puma[12869]: [12869]     worker shutdown time:   0.50
Aug 22 15:30:17 web_worker2-0c2 puma[12869]: After worker fork...
Aug 22 15:30:17 web_worker2-0c2 systemd[1]: Stopped Puma HTTP Server.
-- Subject: Unit puma.service has finished shutting down
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
-- 
-- Unit puma.service has finished shutting down.
Aug 22 15:30:17 web_worker2-0c2 systemd[1]: Started Puma HTTP Server.
-- Subject: Unit puma.service has finished start-up
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
-- 
-- Unit puma.service has finished starting up.
-- 
-- The start-up result is RESULT.
Aug 22 15:30:17 web_worker2-0c2 sudo[16625]: pam_unix(sudo:session): session closed for user root
Aug 22 15:30:18 web_worker2-0c2 puma[16633]: % Registered unix:/var/www/foo/shared/tmp/sockets/puma.sock for activation from LISTEN_FDS
Aug 22 15:30:18 web_worker2-0c2 puma[16633]: [16633] Puma starting in cluster mode...
Aug 22 15:30:18 web_worker2-0c2 puma[16633]: [16633] * Version 4.1.0 (ruby 2.6.3-p62), codename: Fourth and One
Aug 22 15:30:18 web_worker2-0c2 puma[16633]: [16633] * Min threads: 10, max threads: 10
Aug 22 15:30:18 web_worker2-0c2 puma[16633]: [16633] * Environment: production
Aug 22 15:30:18 web_worker2-0c2 puma[16633]: [16633] * Process workers: 3
Aug 22 15:30:18 web_worker2-0c2 puma[16633]: [16633] * Preloading application

ug 22 15:30:20 web_worker2-0c2 nginx[5009]: web_worker2-0c2 nginx: 2019/08/22 15:30:20 [crit] 5009#5009: *29806 connect() to unix:///var/www/foo/shared/tmp/sockets/puma.sock failed (2: No such file or directory) while connecting to upstream, client: 172.31.11.144, server: , request: "GET /status HTTP/1.1", upstream: "http://unix:///var/www/foo/shared/tmp/sockets/puma.sock:/status", host: "x.x.x.x"
Aug 22 15:30:20 web_worker2-0c2 nginx[5009]: web_worker2-0c2 nginx: 2019/08/22 15:30:20 [crit] 5009#5009: *29806 connect() to unix:///var/www/foo/shared/tmp/sockets/puma.sock failed (2: No such file or directory) while connecting to upstream, client: 172.31.11.144, server: , request: "GET /status HTTP/1.1", upstream: "http://unix:///var/www/foo/shared/tmp/sockets/puma.sock:/500.html", host: "x.x.x.x"
Aug 22 15:30:20 web_worker2-0c2 nginx[5009]: web_worker2-0c2 nginx: 2019/08/22 15:30:20 [crit] 5009#5009: *29806 connect() to unix:///var/www/foo/shared/tmp/sockets/puma.sock failed (2: No such file or directory) while connecting to upstream, client: 172.31.11.144, server: , request: "GET /status HTTP/1.1", upstream: "http://unix:///var/www/foo/shared/tmp/sockets/puma.sock:/500.html", host: "x.x.x.x"
Aug 22 15:30:20 web_worker2-0c2 nginx[5009]: web_worker2-0c2 nginx: 2019/08/22 15:30:20 [crit] 5009#5009: *29806 connect() to unix:///var/www/foo/shared/tmp/sockets/puma.sock failed (2: No such file or directory) while connecting to upstream, client: 172.31.11.144, server: , request: "GET /status HTTP/1.1", upstream: "http://unix:///var/www/foo/shared/tmp/sockets/puma.sock:/500.html", host: "x.x.x.x"
Aug 22 15:30:20 web_worker2-0c2 nginx[5009]: web_worker2-0c2 nginx: x.x.x.x - - [22/Aug/2019:15:30:20 +0000] "GET /status HTTP/1.1" 500 186 "-" "ELB-HealthChecker/2.0"
Aug 22 15:30:21 web_worker2-0c2 puma[16633]: [16633] * Activated unix:///var/www/foo/shared/tmp/sockets/puma.sock
Aug 22 15:30:21 web_worker2-0c2 puma[16633]: [16633] ! WARNING: Detected 1 Thread(s) started in app boot:
Aug 22 15:30:21 web_worker2-0c2 puma[16633]: [16633] ! #<Thread:0x00005583a11d4b90@/var/www/foo/shared/vendor/bundle/ruby/2.6.0/gems/activerecord-6.0.0/lib/active_record/connection_adapters/abstract/connection_pool.rb:335 sleep> - /var/www/foo/shared/vendor/bundle/ruby/2.6.0/gems/activerecord-6.0.0/lib/active_record/connection_adapters/abstract/connection_pool.rb:337:in `sleep'
Aug 22 15:30:21 web_worker2-0c2 puma[16633]: [16633] Use Ctrl-C to stop
Aug 22 15:30:21 web_worker2-0c2 puma[16633]: Starting workers...
Aug 22 15:30:21 web_worker2-0c2 puma[16633]: Before worker fork...
Aug 22 15:30:21 web_worker2-0c2 puma[16633]: After worker fork...
Aug 22 15:30:21 web_worker2-0c2 puma[16633]: Before worker fork...
Aug 22 15:30:21 web_worker2-0c2 puma[16633]: After worker fork...
Aug 22 15:30:21 web_worker2-0c2 puma[16633]: Before worker fork...
Aug 22 15:30:21 web_worker2-0c2 puma[16633]: [16633] - Worker 0 (pid: 16752) booted, phase: 0
Aug 22 15:30:21 web_worker2-0c2 puma[16633]: [16633] - Worker 1 (pid: 16766) booted, phase: 0
Aug 22 15:30:21 web_worker2-0c2 puma[16633]: [16633] - Worker 2 (pid: 16773) booted, phase: 0
Aug 22 15:30:21 web_worker2-0c2 nginx[5009]: web_worker2-0c2 nginx: 2019/08/22 15:30:21 [crit] 5009#5009: *29818 connect() to unix:///var/www/foo/shared/tmp/sockets/puma.sock failed (2: No such file or directory) while connecting to upstream, client: x, server: , request: "GET /api/v2/challenges/index?page=1 HTTP/1.1", upstream: "http://unix:///var/www/foo/shared/tmp/sockets/puma.sock:/api/v2/challenges/index?page=1", host: "api.getfoo.com"

@rromanchuk
Copy link

The one thing i should probably look into though, is first allowing the gem to update get and get a clean start. i'm going to try one more time upgrading, but removing traffic and shutting down services, and give it a clean start

@nateberkopec
Copy link
Member Author

Fixed in #1970

@rromanchuk
Copy link

Confirmed fixed here. Thanks y'all

@mcg
Copy link

mcg commented Sep 24, 2019

Not fixed here. Socket is still missing on restart.

@MSP-Greg
Copy link
Member

@mcg

Are you having the same issue as #1976 ? You mentioned systemd above.

I did write some tests for 'pre-existing' UNIX listeners (path or socket), I need to review that...

@mcg
Copy link

mcg commented Sep 24, 2019

We are using a UNIX socket but not having the same error as #1976 ("There is already a server bound to: puma.sock"). Just that systemctl restart puma.service removes the socket.

@nateberkopec
Copy link
Member Author

@mcg please open a new issue with repro steps along the lines of #1842 (comment)

Systemd issues are a bit harder for us to debug as they're often highly config-dependent and we need linux VMs to test them.

@gingerlime
Copy link
Contributor

I reported #1962 and I think the same still happens. systemctl restart puma.service removes the socket. It's also hard for me to debug on our staging environment, because it's used for other things and I can't break things for too long unfortunately.

I'm not 100% sure it relates to the upgrade. It might also happen with 3.12, but that's partly because it's hard to test unfortunately.

The only solution (workaround) I found so far is to always restart both systemd services, i.e. systemctl restart puma.service puma.socket. We're also usually doing a phased-reload rather than a full restart (systemctl reload puma.service), and that seems to work ok so far.

Sorry I can't shed much more light on this. I can try a thing or two if you have any suggestions, but my experimentation window is limited at the moment.

@mcg
Copy link

mcg commented Sep 24, 2019

@gingerlime We run 3.12 and haven't had this issue at all.

@nateberkopec #1988 created

@gingerlime
Copy link
Contributor

@mcg I thought so too, but then I downgraded back to 3.12 and tried to restart but had the same issue. Perhaps it was still running 4.2.0 when the gem version changed, so it didn't pick this up or something? as I mentioned, it was a bit tricky to test, so did this in a bit of a rush.

I'll try to see if I can re-test with 3.12 in a cleaner state if it helps. But I definitely can reproduce the missing socket with 4.2.0

@rromanchuk
Copy link

@nateberkopec I cant remember if i posted this yet but: i'm also still not convinced this isn't my own configuration error and that there may be something subtle between the releases that expose the problem. My biggest suspicion is around my ExecStart coupled with rbenv + bundle. There is a lot of discussion around this with systemd, and I have a feeling i'm not following directions.

Note: Any wrapper scripts which exec, or other indirections in ExecStart, may result in activated socket file descriptors being closed before they reach the puma master process. For example, if using bundle exec, pass the --keep-file-descriptors flag. bundle exec can be avoided by using a puma executable generated by bundle binstubs puma. This is tracked in [#1499].

One important caveat that makes this reproduction difficult is that it also requires active inbound connections. I've made the mistake twice now where i test a deploy that is not behind a load balancer, think it's fixed then deploy to machine behind a load balancer and it explodes. From what i can tell systemctl restart puma will not reveal this issue if it's not serving any requests.

I want to create my exact ec2 setup with an AMI, just replacing with a "hello world" app, so at least someone can actually poke around. I'll just add a bash script to mock inbound requests during the restart.

@nateberkopec
Copy link
Member Author

@rromanchuk it certainly could be. Like I said, I'd like to better understand how people use systemd with Puma so I can have a better sense of what to recommend people to make repros with. i.e. "If you do XYZ and it still reproduces, it's probably a Puma issue, if you do ABC and it doesn't reproduce, it's probably a problem with your env/config/app."

@AwolDes
Copy link

AwolDes commented Sep 24, 2019

My company is also having a similar problem, using Ubuntu 16.04.

In v3.12, we were able to stop the puma service and start it again without issue. Using v4.2 our app serves a 500 even though the puma service is running and is stable. When the server originally boots up, puma will serve the app as expected.

The reload for Puma works fine with our config.

These are the snippets of systemd services we're using, let me know if you need any other info.

puma.service

[Unit]
Description=Puma HTTP Server - serves the core application

# After=network.target cloud-final.service puma.socket
After=cloud-final.service puma.socket

# Socket activation prior to boot, introduces a delay
Requires=puma.socket

# The service can start N many times in the interval of N length
StartLimitBurst=5
# How long the interval to count restart attempts within
StartLimitInterval=30

ConditionPathExists=/home/{{ ruby_user }}/shared/core-systemd-switches/puma-service
ConditionPathExists=/var/lib/cloud/instance/boot-finished

[Service]
# Foreground process (do not use --daemon in ExecStart or config.rb)
Type=simple

# Preferably configure a non-privileged user
User={{ ruby_user }}
Group={{ ruby_user }}

# Time to wait for puma to signal boot
TimeoutStartSec=60

# Specify the path to your puma application root
WorkingDirectory=/home/{{ ruby_user }}/app

PIDFile=/home/{{ ruby_user }}/shared/puma.pid

EnvironmentFile=/home/{{ ruby_user }}/shared/userdata.env

# The command to start Puma
ExecStart=/bin/bash -lc 'bundle exec --keep-file-descriptors puma \
                            -C /home/{{ ruby_user }}/app/config/puma.rb \
                            --pidfile /home/{{ ruby_user }}/shared/puma.pid'

ExecStop=/bin/bash -lc 'cat /home/{{ ruby_user }}/shared/puma.pid | xargs kill -s SIGTERM && \
                        while ps axg | grep -vw grep | grep -w puma > /dev/null; do sleep 1; done && \
                        exit 0'

# The command to hot (online) reload Puma
ExecReload=/bin/bash -lc 'cat /home/{{ ruby_user }}/shared/puma.pid | xargs kill -s SIGUSR1'

KillMode=control-group
# Important to keep this inline with the timeouts of the entire system (see ELB, Rack & Puma timeouts)
TimeoutStopSec=175
KillSignal=SIGTERM
SendSIGKILL=yes

StandardOutput=journal
StandardError=journal
SyslogIdentifier=puma

# Restart on process failure, good for prod reliability
Restart=always
# The time to sleep before restarting a service
RestartSec=1

[Install]
WantedBy=multi-user.target

puma.socket

[Unit]
Description=Puma HTTP Server Accept Sockets - network bindings for the core application

After=cloud-final.service

PartOf=puma.service

# Notify via email when service fails to start
OnFailure=unit-status-mail@%n.service

ConditionPathExists=/home/{{ruby_user}}/shared/core-systemd-switches/puma-socket
ConditionPathExists=/var/lib/cloud/instance/boot-finished

[Socket]
# AF_UNIX domain socket
ListenStream=/home/{{ ruby_user }}/shared/puma.sock

SocketUser={{ ruby_user }}
SocketGroup={{ ruby_user }}

# Might resolve keep_alive_issues
# KeepAlive=
# KeepAliveTimeSec=
# KeepAliveIntervalSec=
# KeepAliveProbes=

# Queueing Mechanisms

# Socket options matching Puma defaults - {{ socket_backlog }} is passed through as a query param in puma.rb
Backlog={{ socket_backlog }}
# Puma's documentation recommends same port & nodelay which only apply to TCP mode, we're using domain sockets

[Install]
WantedBy=sockets.target

@mcg
Copy link

mcg commented Sep 25, 2019

@AwolDes Is the socket /home/{{ ruby_user }}/shared/puma.sock, removed after a Puna restart?

@MSP-Greg
Copy link
Member

MSP-Greg commented Sep 25, 2019

@AwolDes

Are you in a position to test from a branch/PR here?

The reload for Puma works fine with our config.

phased-restart/USR1 shouldn't have any issues with UNIX listeners.

Stop/Start or restart/USR2 may.

@mcg Can you check against a branch/PR?

@gingerlime
Copy link
Contributor

Another problem I reported on #1962 and I see again with 4.2.0 is seeing these errors in the puma log after a couple of deploys, slightly sanitized:

[32189] + Gemfile in context: /var/local/app/app.d/843/Gemfile
[32189] Early termination of worker
[32266] + Gemfile in context: /var/local/app/app.d/843/Gemfile
[32266] Early termination of worker
[32290] + Gemfile in context: /var/local/app/app.d/843/Gemfile
/usr/lib/ruby/site_ruby/2.5.0/bundler/spec_set.rb:91:in `block in materialize': Could not find nio4r-2.5.1 in any of the sources (Bundler::GemNotFound)
        from /usr/lib/ruby/site_ruby/2.5.0/bundler/spec_set.rb:85:in `map!'
        from /usr/lib/ruby/site_ruby/2.5.0/bundler/spec_set.rb:85:in `materialize'
        from /usr/lib/ruby/site_ruby/2.5.0/bundler/definition.rb:170:in `specs'
        from /usr/lib/ruby/site_ruby/2.5.0/bundler/definition.rb:237:in `specs_for'
        from /usr/lib/ruby/site_ruby/2.5.0/bundler/definition.rb:226:in `requested_specs'
        from /usr/lib/ruby/site_ruby/2.5.0/bundler/runtime.rb:108:in `block in definition_method'
        from /usr/lib/ruby/site_ruby/2.5.0/bundler/runtime.rb:20:in `setup'
        from /usr/lib/ruby/site_ruby/2.5.0/bundler.rb:107:in `setup'
        from /usr/lib/ruby/site_ruby/2.5.0/bundler/setup.rb:20:in `<top (required)>'
        from /usr/lib/ruby/site_ruby/2.5.0/rubygems/core_ext/kernel_require.rb:54:in `require'
        from /usr/lib/ruby/site_ruby/2.5.0/rubygems/core_ext/kernel_require.rb:54:in `require'
        from /var/local/app/app.d/840/vendor/bundle/ruby/2.5.0/gems/puma-4.2.0/lib/puma/configuration.rb:302:in `rack_builder'
        from /var/local/app/app.d/840/vendor/bundle/ruby/2.5.0/gems/puma-4.2.0/lib/puma/configuration.rb:321:in `load_rackup'
        from /var/local/app/app.d/840/vendor/bundle/ruby/2.5.0/gems/puma-4.2.0/lib/puma/configuration.rb:246:in `app'
        from /var/local/app/app.d/840/vendor/bundle/ruby/2.5.0/gems/puma-4.2.0/lib/puma/runner.rb:158:in `app'
        from /var/local/app/app.d/840/vendor/bundle/ruby/2.5.0/gems/puma-4.2.0/lib/puma/runner.rb:165:in `start_server'
        from /var/local/app/app.d/840/vendor/bundle/ruby/2.5.0/gems/puma-4.2.0/lib/puma/cluster.rb:269:in `worker'
        from /var/local/app/app.d/840/vendor/bundle/ruby/2.5.0/gems/puma-4.2.0/lib/puma/cluster.rb:138:in `block (2 levels) in spawn_workers'
        from /var/local/app/app.d/840/vendor/bundle/ruby/2.5.0/gems/puma-4.2.0/lib/puma/cluster.rb:138:in `fork'
        from /var/local/app/app.d/840/vendor/bundle/ruby/2.5.0/gems/puma-4.2.0/lib/puma/cluster.rb:138:in `block in spawn_workers'
        from /var/local/app/app.d/840/vendor/bundle/ruby/2.5.0/gems/puma-4.2.0/lib/puma/cluster.rb:134:in `times'
        from /var/local/app/app.d/840/vendor/bundle/ruby/2.5.0/gems/puma-4.2.0/lib/puma/cluster.rb:134:in `spawn_workers'
        from /var/local/app/app.d/840/vendor/bundle/ruby/2.5.0/gems/puma-4.2.0/lib/puma/cluster.rb:204:in `check_workers'
        from /var/local/app/app.d/840/vendor/bundle/ruby/2.5.0/gems/puma-4.2.0/lib/puma/cluster.rb:487:in `run'
        from /var/local/app/app.d/840/vendor/bundle/ruby/2.5.0/gems/puma-4.2.0/lib/puma/launcher.rb:172:in `run'
        from /var/local/app/app.d/840/vendor/bundle/ruby/2.5.0/gems/puma-4.2.0/lib/puma/cli.rb:80:in `run'
        from /var/local/app/app.d/840/vendor/bundle/ruby/2.5.0/gems/puma-4.2.0/bin/puma-wild:31:in `<main>'

It seems like the app is running fine though, but those errors keep appearing.

Another odd thing here is that those logs appear in the already logrotated puma.log.1 file, and the new puma.log file is empty... So something definitely looks odd here.

My postrotate in logrotate.d simply does systemctl reload puma.service ...

@gingerlime
Copy link
Contributor

gingerlime commented Sep 25, 2019

@nateberkopec I managed to create a reproducible environment with systemd that exhibits the issue with the restart. Please take a look at https://github.com/gingerlime/puma-test-socket - it's basically a bash script that bootstraps a new Ubuntu box with ruby, puma and a sample sinatra app. It takes a bit of time to run everything, but hope it helps to then debug things etc.

(tested on Digital Ocean)

@mcg
Copy link

mcg commented Sep 25, 2019

@gingerlime in your repro setup, is the socket file(/var/run/puma/puma.sock) deleted/missing on restart?

@gingerlime
Copy link
Contributor

gingerlime commented Sep 25, 2019

@mcg yep.

root@ubuntu-s-1vcpu-1gb-nyc3-01:~/puma-test-socket# systemctl start puma.socket puma.service
root@ubuntu-s-1vcpu-1gb-nyc3-01:~/puma-test-socket# ps -ef |grep pum
root       373  1183  0 10:43 pts/0    00:00:00 grep --color=auto pum
root     32747     1 10 10:43 ?        00:00:00 puma 4.2.0 (unix:///var/run/puma/puma.sock) [app]
root     32759 32747  5 10:43 ?        00:00:00 puma: cluster worker 0: 32747 [app]
root     32760 32747  5 10:43 ?        00:00:00 puma: cluster worker 1: 32747 [app]
root     32761 32747  5 10:43 ?        00:00:00 puma: cluster worker 2: 32747 [app]
root     32762 32747  5 10:43 ?        00:00:00 puma: cluster worker 3: 32747 [app]
root     32766 32747  5 10:43 ?        00:00:00 puma: cluster worker 4: 32747 [app]
root@ubuntu-s-1vcpu-1gb-nyc3-01:~/puma-test-socket# socat TCP-LISTEN:1234,reuseaddr UNIX-CLIENT:/var/run/puma/puma.sock
^Croot@ubuntu-s-1vcpu-1gb-nyc3-01:~/puma-test-socket# systemctl restart puma.service
root@ubuntu-s-1vcpu-1gb-nyc3-01:~/puma-test-socket# socat TCP-LISTEN:1234,reuseaddr UNIX-CLIENT:/var/run/puma/puma.sock
2019/09/25 10:44:30 socat[415] E connect(5, AF=1 "/var/run/puma/puma.sock", 25): No such file or directory

@mcg
Copy link

mcg commented Sep 25, 2019

@gingerlime thanks for confirming.

I assume since this is closed we should discuss on #1988 ?

@gingerlime
Copy link
Contributor

@mcg yes, things got a little scattered across several tickets, including #1988 and #1962 and here... I'm pretty sure #1988 and #1962 are duplicates anyway, although not 100% sure yet. Happy to consolidate whichever way the project thinks makes sense, and I'll try to post further comments on whichever issue is "the chosen one" :)

@mcg
Copy link

mcg commented Sep 25, 2019

@MSP-Greg Can probably test yeah.

@AwolDes

Are you in a position to test from a branch/PR here?

The reload for Puma works fine with our config.

phased-restart/USR1 shouldn't have any issues with UNIX listeners.

Stop/Start or restart/USR2 may.

@mcg Can you check against a branch/PR?

@puma puma locked as resolved and limited conversation to collaborators Sep 25, 2019
@nateberkopec
Copy link
Member Author

Yup, let's keep conversation on #1988 and #1962 respectively for my sanity. If one is confirmed as a dupe I'll close and merge the convos. But until then it's much easier for me to keep them separate.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

9 participants