Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop incorrect "Process exited with status 1" puma errors from occurring on SIGTERM #1026

Closed
david-a-wheeler opened this issue Feb 2, 2018 · 9 comments
Labels

Comments

@david-a-wheeler
Copy link
Collaborator

Every day we typically get incorrect error reports of the following form:

Feb 01 10:22:14 production-bestpractices heroku/web.1: Process exited with status 1 

This is caused because Heroku cycles the webserver (puma) by sending SIGTERM, but when puma exits it incorrectly returns an error code of 1 ("error occurred"). That is simply wrong; a process that exits because it was specifically asked to do so (via SIGTERM) has not experienced an error. Instead, exiting on a request to do so, without any other error, is correct behavior and should return a status code of 0. However, because puma incorrectly reports a non-zero error code, lots of other logs whine about a significant problem (even though no problem has occurred). Here's more details showing this happening:

Feb 01 15:22:13 production-bestpractices heroku/web.1: Cycling 
Feb 01 15:22:13 production-bestpractices heroku/web.1: State changed from up to starting 
Feb 01 15:22:14 production-bestpractices heroku/web.1: Stopping all processes with SIGTERM 
Feb 01 15:22:14 production-bestpractices app/web.1: - Gracefully stopping, waiting for requests to finish 
Feb 01 15:22:14 production-bestpractices app/web.1: === puma shutdown: 2018-02-01 15:22:14 +0000 === 
Feb 01 15:22:14 production-bestpractices app/web.1: - Goodbye! 
Feb 01 15:22:14 production-bestpractices app/web.1: bundler: failed to load command: puma (/app/vendor/bundle/ruby/2.4.0/bin/puma) 
Feb 01 15:22:14 production-bestpractices app/web.1: SignalException: SIGTERM 
Feb 01 15:22:14 production-bestpractices app/web.1:   /app/vendor/bundle/ruby/2.4.0/gems/puma-3.11.2/lib/puma/launcher.rb:397:in `block in setup_signals' 
Feb 01 15:22:14 production-bestpractices app/web.1:   /app/vendor/bundle/ruby/2.4.0/gems/puma-3.11.2/lib/puma/single.rb:106:in `join' 
Feb 01 15:22:14 production-bestpractices app/web.1:   /app/vendor/bundle/ruby/2.4.0/gems/puma-3.11.2/lib/puma/single.rb:106:in `run' 
Feb 01 15:22:14 production-bestpractices app/web.1:   /app/vendor/bundle/ruby/2.4.0/gems/puma-3.11.2/lib/puma/launcher.rb:183:in `run' 
Feb 01 15:22:14 production-bestpractices app/web.1:   /app/vendor/bundle/ruby/2.4.0/gems/puma-3.11.2/lib/puma/cli.rb:77:in `run' 
Feb 01 15:22:14 production-bestpractices app/web.1:   /app/vendor/bundle/ruby/2.4.0/gems/puma-3.11.2/bin/puma:10:in `<top (required)>' 
Feb 01 15:22:14 production-bestpractices app/web.1:   /app/vendor/bundle/ruby/2.4.0/bin/puma:23:in `load' 
Feb 01 15:22:14 production-bestpractices app/web.1:   /app/vendor/bundle/ruby/2.4.0/bin/puma:23:in `<top (required)>' 
Feb 01 15:22:14 production-bestpractices heroku/web.1: Process exited with status 1 
Feb 01 15:22:24 production-bestpractices heroku/web.1: Starting process with command `bundle exec puma -C config/puma.rb`

We are running the current version of puma (3.11.2). See puma issue 1438 and bundler issue 6090 for more.

Note that we run puma using this Procfile:

web: bundle exec puma -C config/puma.rb

One solution (noted in the issues above) would be to make "bundle exec" use Kernel.exec instead of Kernel.load. This essentially disables a minor optimization that bundle usually uses. This is slightly slower on startup (it replaces a whole OS process, instead of just reloading from Ruby), but we rarely start processes so it would have practically no effect in the real world. Here's how to force that:

bundle config disable_exec_load true

Other solutions are welcome.

@david-a-wheeler
Copy link
Collaborator Author

For more information, see bundle config documentation

@david-a-wheeler
Copy link
Collaborator Author

Per "bundle config --help", I believe we can just set the environment variable BUNDLE_DISABLE_EXEC_LOAD - the environment variables take precedence, and then we don't have to worry about race conditions setting a global variable if multiple procs start at the same time.

The documentation isn't clear about the expected value of the environment variable; it may not matter. Let's start with "true" as the value, that's clear.

david-a-wheeler added a commit that referenced this issue Feb 4, 2018
Disable exec-load in Procfile to eliminate a false puma error
at runtime on Heroku.  This is a tiny but non-obvious change, so this
commit message provides lots of detail.

Every day we typically get incorrect error reports of the following form:

> heroku/web.1: Process exited with status 1

This is caused because Heroku cycles the webserver (puma) by sending
SIGTERM, but when puma exits it incorrectly returns an error code of 1
("error occurred"). That is simply wrong; a process that exits because
it was specifically asked to do so (via SIGTERM) has not experienced an
error. Instead, exiting on a request to do so, without any other error,
is correct behavior and should return a status code of 0. However,
because puma incorrectly reports a non-zero error code, lots of other
logs whine about a significant problem (even though no problem has
occurred).

At this time there seems to be some disagreement on whether the problem
is really from puma or bundler.  See puma issue 1438 and
bundler issue 6090 for more.  In a grand sense we don't care,
we just don't want false positives.

One solution (noted in the issues above) would be to make "bundle exec"
use Kernel.exec instead of Kernel.load. This essentially disables a minor
optimization that bundler usually uses. This is slightly slower on startup
(it replaces a whole OS process, instead of just reloading from Ruby),
but we rarely start web processes so it would have practically no effect in
the real world.

Per "bundle config --help", we do this just by setting the environment
variable BUNDLE_DISABLE_EXEC_LOAD. The environment variables take
precedence, and then we don't have to worry about race conditions setting
a global variable if multiple procs start at the same time.

This commit uses the "env" command; technically we probably don't need to,
but this ensures that regardless of shell (or straight invocation of
exec) we get our environment variable set.

Signed-off-by: David A. Wheeler <[email protected]>
@dankohn
Copy link
Contributor

dankohn commented Feb 4, 2018 via email

@david-a-wheeler
Copy link
Collaborator Author

Great! I've labelled this issue as "bug". It's not really a bug in our program, it's really a workaround for a bug in a program we depend on. That said, the actions manifest as a bug in our program, and in the end, we're responsible for the impact of the components we depend on .

@david-a-wheeler
Copy link
Collaborator Author

I think we've resolved it. With the environment variable change, here are example log entries we see:

Feb 04 15:11:50 master-bestpractices heroku/web.1: Restarting 
Feb 04 15:11:50 master-bestpractices heroku/web.1: State changed from up to starting 
Feb 04 15:11:50 master-bestpractices app/web.1: - Gracefully stopping, waiting for requests to finish 
Feb 04 15:11:50 master-bestpractices app/web.1: === puma shutdown: 2018-02-04 20:11:50 +0000 === 
Feb 04 15:11:50 master-bestpractices app/web.1: - Goodbye! 
Feb 04 15:11:51 master-bestpractices heroku/web.1: Stopping all processes with SIGTERM 
Feb 04 15:11:51 master-bestpractices heroku/web.1: Process exited with status 143 
Feb 04 15:12:02 master-bestpractices heroku/web.1: Starting process with command `env BUNDLE_DISABLE_EXEC_LOAD=true bundle exec puma -C config/puma.rb` 
Feb 04 15:12:04 master-bestpractices app/web.1: Puma starting in single mode... 
Feb 04 15:12:04 master-bestpractices app/web.1: * Version 3.11.2 (ruby 2.4.2-p198), codename: Love Song 
Feb 04 15:12:04 master-bestpractices app/web.1: * Min threads: 5, max threads: 5 
Feb 04 15:12:04 master-bestpractices app/web.1: * Environment: production 
Feb 04 15:12:06 master-bestpractices app/web.1: [Scout] [02/04/18 20:12:05 +0000 web.1 (4)] WARN : Deferring agent start. Standing by for first request 
Feb 04 15:12:07 master-bestpractices app/web.1: * Listening on tcp://0.0.0.0:52465 
Feb 04 15:12:07 master-bestpractices heroku/web.1: State changed from starting to up 

That compares to these older messages, with the "failed to load" messages from bundler:

    Feb 04 15:10:24 staging-bestpractices heroku/web.1: State changed from starting to up 
    Feb 04 15:10:35 staging-bestpractices heroku/web.1: Restarting 
    Feb 04 15:10:35 staging-bestpractices heroku/web.1: State changed from up to starting 
    Feb 04 15:10:35 staging-bestpractices heroku/web.1: Stopping all processes with SIGTERM 
    Feb 04 15:10:35 staging-bestpractices app/web.1: - Gracefully stopping, waiting for requests to finish 
    Feb 04 15:10:35 staging-bestpractices app/web.1: === puma shutdown: 2018-02-04 20:10:35 +0000 === 
    Feb 04 15:10:35 staging-bestpractices app/web.1: - Goodbye! 
    Feb 04 15:10:35 staging-bestpractices app/web.1: bundler: failed to load command: puma (/app/vendor/bundle/ruby/2.4.0/bin/puma) 
    Feb 04 15:10:35 staging-bestpractices app/web.1: SignalException: SIGTERM 
    Feb 04 15:10:35 staging-bestpractices app/web.1:   /app/vendor/bundle/ruby/2.4.0/gems/puma-3.11.2/lib/puma/launcher.rb:397:in `block in setup_signals' 
    Feb 04 15:10:35 staging-bestpractices app/web.1:   /app/vendor/bundle/ruby/2.4.0/gems/puma-3.11.2/lib/puma/single.rb:106:in `join' 
    Feb 04 15:10:35 staging-bestpractices app/web.1:   /app/vendor/bundle/ruby/2.4.0/gems/puma-3.11.2/lib/puma/single.rb:106:in `run' 
    Feb 04 15:10:35 staging-bestpractices app/web.1:   /app/vendor/bundle/ruby/2.4.0/gems/puma-3.11.2/lib/puma/launcher.rb:183:in `run' 
    Feb 04 15:10:35 staging-bestpractices app/web.1:   /app/vendor/bundle/ruby/2.4.0/gems/puma-3.11.2/lib/puma/cli.rb:77:in `run' 
    Feb 04 15:10:35 staging-bestpractices app/web.1:   /app/vendor/bundle/ruby/2.4.0/gems/puma-3.11.2/bin/puma:10:in `<top (required)>' 
    Feb 04 15:10:35 staging-bestpractices app/web.1:   /app/vendor/bundle/ruby/2.4.0/bin/puma:23:in `load' 
    Feb 04 15:10:35 staging-bestpractices app/web.1:   /app/vendor/bundle/ruby/2.4.0/bin/puma:23:in `<top (required)>' 
    Feb 04 15:10:36 staging-bestpractices heroku/web.1: Process exited with status 1 
    Feb 04 15:10:46 staging-bestpractices heroku/web.1: Starting process with command `bundle exec puma -C config/puma.rb` 
    Feb 04 15:10:49 staging-bestpractices app/web.1: Puma starting in single mode... 
    Feb 04 15:10:49 staging-bestpractices app/web.1: * Version 3.11.2 (ruby 2.4.2-p198), codename: Love Song 
    Feb 04 15:10:49 staging-bestpractices app/web.1: * Min threads: 5, max threads: 5 
    Feb 04 15:10:49 staging-bestpractices app/web.1: * Environment: production 
    Feb 04 15:10:54 staging-bestpractices app/web.1: [Scout] [02/04/18 20:10:54 +0000 web.1 (4)] WARN : Deferring agent start. Standing by for first request 
    Feb 04 15:10:57 staging-bestpractices heroku/web.1: State changed from starting to up 
    Feb 04 15:10:57 staging-bestpractices app/web.1: * Listening on tcp://0.0.0.0:27365 
    Feb 04 15:10:57 staging-bestpractices app/web.1: Use Ctrl-C to stop 

@david-a-wheeler
Copy link
Collaborator Author

Strictly speaking the process now exiting with exit code 143, not 0. But that's a widely-expected signal value for "received and processed SIGTERM normally", and the rest of the system seems to handle it well. Our problem was that we were getting exit code 1, which is widely interpreted as "things went badly".

We'll keep monitoring, but seems okay now.

@david-a-wheeler
Copy link
Collaborator Author

Reopening. Heroku sees the exit code 143, and from its viewpoint that's non-zero and thus STILL an error.

We'll have to insert a little shim that captures error code 143, and turns it into error code 0.

@david-a-wheeler
Copy link
Collaborator Author

#1048 is intended to solve it.

@david-a-wheeler
Copy link
Collaborator Author

Completed. It's switched to error code 143, and the rest of the system now knows that 143 isn't a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants