Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAMN ! worker 7 (pid: 1343) died, killed by signal 11 :( trying respawn ... #1792

Open
jedie opened this issue May 16, 2018 · 34 comments
Open

Comments

@jedie
Copy link

jedie commented May 16, 2018

I'm using docker... After i switched from https://github.com/phusion/baseimage-docker (phusion/baseimage:0.10.1 with Python v3.5) to https://hub.docker.com/_/python/ (python:3.6-alpine with Python v3.6)

After this i get very often the error:

DAMN ! worker X (pid: Y) died, killed by signal 11 :( trying respawn ...

The rest of the setup is the same and used uWSGI==2.0.17

Any idea?!?

@robnardo
Copy link

Hey @jedie - i get the same error. I am building from python:3.6-alpine as well. My ENV and CMD in Dockerfile looks like this..

ENV UWSGI_WSGI_FILE=base/wsgi.py UWSGI_HTTP=:8000 UWSGI_MASTER=1 UWSGI_WORKERS=2 UWSGI_THREADS=8 UWSGI_UID=1000 UWSGI_GID=2000

CMD ["uwsgi", "--http-auto-chunked", "--http-keepalive", "--static-map", "/media/=/code/media/", "--static-map", "/static/=/code/static/"]

I am a bit worried of using this in a PRODUCTION environment.

Rob

@jedie
Copy link
Author

jedie commented Jul 12, 2018

I switched to 3.6-slim-stretch as a "work-a-round"...

@v9Chris
Copy link

v9Chris commented Jul 18, 2018

Also getting this all of a sudden too.

18/07/2018 12:46:08DAMN ! worker 1 (pid: 75) died, killed by signal 11 :( trying respawn ...
18/07/2018 12:46:08Respawned uWSGI worker 1 (new pid: 80)

@robnardo
Copy link

I found that switching the uwsgi config to only one thread makes this go away. Here is my uwsgi config (from Dockerfile)..

ENV UWSGI_WSGI_FILE=base/wsgi.py UWSGI_HTTP=:8000 UWSGI_MASTER=1 UWSGI_WORKERS=8 UWSGI_UID=1000 UWSGI_GID=2000 UWSGI_TOUCH_RELOAD=touch-reload.txt UWSGI_LAZY_APPS=1 UWSGI_WSGI_ENV_BEHAVIOR=holy

@deathemperor
Copy link

I can confirm that configuring it to use one thread and this goes away.

@beaugunderson
Copy link

beaugunderson commented Oct 10, 2018

I'm also seeing this on python:3.6-alpine3.7. Works with threads = 1, random 502s from signal 11s with threads = 2.

@beaugunderson
Copy link

python:3.7-alpine3.8 did not help but switching to python:3.7-slim-stretch did. Would prefer to use alpine but this will be our workaround for now.

@zhongdixiu
Copy link

Hi, I also encountered the same problem. When I run flask uwsgi to call keras (using tensorflow backend) object detection API, an error “DAMN ! worker 1 (pid: 5240)died, killed by signal 11:(trying respawn……)”. Then I try to use only one thread, but it doesn't work. Instead, another error occurs which is " !!!uWSGI process 347 got Segmentation Fault!!!". My configuration file is as follows:
config
Can anyone give me some helps? Thanks !

@kball
Copy link

kball commented Nov 20, 2018

I ran into a similar issue, though for me segfaults were traced down to anything that tried to use ssl (e.g. to talk to a remote API). Changing to stretch-slim seemed to resolve the issue.

@cridenour
Copy link

Just wanted to note I ran across this issue with python3.6:alpine-3.8 but it was solved with python3.6:alpine-3.9, using uwsgi==2.0.17.1

@xeor
Copy link

xeor commented Mar 4, 2019

I'm still getting this using uwsgi 2.0.18 on alpine 3.7.. Others still having the same problem?

@asyncmind0
Copy link

asyncmind0 commented Mar 5, 2019

Still having this problem, is there a way to make uwsgi exit if this happens, I have my service configured to restart on fail.

Better than being in an inconsistent state, running but not alive

I'm using

:»  uwsgi --version                                                                                                                                              
2.0.18
:» lsb_release                                                                                                                                                                             
No LSB modules are available.

:» lsb_release -a                                                                                                                                                                          No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.1 LTS
Release:        18.04
Codename:       bionic

@tamentis
Copy link

tamentis commented Mar 8, 2019

Can confirm that switching to Alpine 3.9 fixed that problem for me. I had the same symptoms, completely out of the blue.

One of the most significant changes in 3.9 is the return to OpenSSL (from LibreSSL), I can imagine how changing such a foundational library could make a difference. It's also completely possible that there is a looming bug somewhere in my software that is no longer triggered due to the different underlying libraries.

maxking added a commit to maxking/docker-mailman that referenced this issue Mar 19, 2019

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
@Mon-ius
Copy link

Mon-ius commented Mar 19, 2019

I also meet this problem

Python 3.7.2
uwsgi --version
2.0.18

@mightydeveloper
Copy link

I also meet this problem. (But strangely, in alpine 3.9)
Base image : python:3.6.8-alpine3.9
uwsgi --version : 2.0.18
but switching threads=1 helps to solve the issue.

@lekksi
Copy link

lekksi commented Apr 10, 2019

Getting the same with python:3.6.8-alpine3.9 and uwsgi==2.0.15

Seems to get fixed by increasing uwsgi's thread-stacksize to 512. Now rolling with 2 or more threads without workers dying.

@koorukuroo
Copy link

In my case, I turn off the option "enable-threads".
I'm not sure if this experience will help you.

Python version: 3.6.7 , uWSGI 2.0.18 (64bit)

@aliashkar
Copy link

Any update on this issue? I have also ran in to the same issue with uWSGI --version 2.0.18 and python:3.6 image.

@adimux
Copy link

adimux commented Nov 15, 2019

Same issue with python:3.7-alpine-3.9. I had to switch to a different distro: debian.

@robnardo
Copy link

I think this error is due to uswgi config. For my Django projects, I have been using Docker (based on python:3.7-alpine) in production with no issues. Below are my Dockerfile, docker-entrypoint.sh and uswgi.ini files - which were borrowed and inspired by other online articles and research. Hope this helps other folks.

Dockerfile:

FROM python:3.7-alpine
COPY ./src/requirements.txt /requirements.txt
RUN set -ex \
	&& apk add --no-cache --virtual .build-deps \
		gcc g++ make libc-dev musl-dev linux-headers pcre-dev \
        mariadb-dev \
		openssl-dev \
		uwsgi-python3 \
	&& pip3 install --upgrade pip \
	&& pip3 install --upgrade wheel \
	&& if [ ! -e /usr/bin/pip ]; then ln -s pip3 /usr/bin/pip ; fi \
	&& if [[ ! -e /usr/bin/python ]]; then ln -sf /usr/bin/python3 /usr/bin/python; fi \
	&& LIBRARY_PATH=/lib:/usr/lib /bin/sh -c "pip install --no-cache-dir -r /requirements.txt" \
	&& runDeps="$( \
		scanelf --needed --nobanner --recursive /usr/local \
			| awk '{ gsub(/,/, "\nso:", $2); print "so:" $2 }' \
			| sort -u \
			| xargs -r apk info --installed \
			| sort -u \
	)" \
	# add dependencies to the '.python-rundeps' virtual package (we will keep these)
	&& apk add --virtual .python-rundeps $runDeps \
	&& apk del .build-deps \
	# add non-build packages..
	&& apk add mariadb-client

RUN mkdir /code
WORKDIR /code/
ADD ./src /code/

EXPOSE 8000
ENV DJANGO_SETTINGS_MODULE=_base.settings
RUN DATABASE_URL='' python manage.py collectstatic --noinput && chmod a+x /code/docker-entrypoint.sh

ENTRYPOINT ["/code/docker-entrypoint.sh"]

docker-entrypoint.sh

#!/bin/sh

while ! mysqladmin ping -h"$MYSQL_HOST" --silent; do
    echo "database is unavailable - sleeping for 2 secs"
    sleep 2
done

if [ "x$DJANGO_MANAGEPY_MIGRATE" = 'xon' ]; then
    echo 'attempting to run "migrate" ..'
    python manage.py migrate --noinput
else
    echo 'DJANGO_MANAGEPY_MIGRATE is not "on", skipping'        
fi

echo "copying mime.types to /etc dir .."
cp mime.types /etc/mime.types

echo "starting uwsgi.."
uwsgi uwsgi.ini

uwsgi.ini

[uwsgi]
strict = true
master = true
enable-threads = true
vacuum = true                        ; Delete sockets during shutdown
single-interpreter = true
die-on-term = true                   ; Shutdown when receiving SIGTERM (default is respawn)
need-app = true

disable-logging = true               ; Disable built-in logging 
log-4xx = true                       ; but log 4xx's anyway
log-5xx = true                       ; and 5xx's

harakiri = 120                       ; forcefully kill workers after XX seconds
; py-callos-afterfork = true           ; allow workers to trap signals

max-requests = 1000                  ; Restart workers after this many requests
max-worker-lifetime = 3600           ; Restart workers after this many seconds
reload-on-rss = 2048                 ; Restart workers after this much resident memory
worker-reload-mercy = 60             ; How long to wait before forcefully killing workers

cheaper-algo = busyness
processes = 64                       ; Maximum number of workers allowed
cheaper = 8                          ; Minimum number of workers allowed
cheaper-initial = 16                 ; Workers created at startup
cheaper-overload = 1                 ; Length of a cycle in seconds
cheaper-step = 8                     ; How many workers to spawn at a time

cheaper-busyness-multiplier = 30     ; How many cycles to wait before killing workers
cheaper-busyness-min = 20            ; Below this threshold, kill workers (if stable for multiplier cycles)
cheaper-busyness-max = 70            ; Above this threshold, spawn new workers
cheaper-busyness-backlog-alert = 16  ; Spawn emergency workers if more than this many requests are waiting in the queue
cheaper-busyness-backlog-step = 2    ; How many emergency workers to create if there are too many requests in the queue

wsgi-file = /code/_base/wsgi.py
http = :8000
static-map = /static/=/code/static/
uid = 1000
gid = 2000
touch-reload = /code/reload-uwsgi

@jacopofar
Copy link

jacopofar commented Dec 2, 2019

Same problem here, using debian:buster image as a base and Python 3.7. I tried both values of enable-threads and a few others settings but it still breaks. Weird enough, the very same Docker image runs normally on my computer, but gives this obscure error on our Kubernetes cluster, so I suspect it has something to do with the kernel or the network.

I noticed that Python 3.7 is not among the officially supported ones, so I downgraded to Python 3.5 but the error manifests nonetheless.

@asherp
Copy link

asherp commented Dec 9, 2019

@jacopofar I too am getting the same error on kubernetes but not when I run locally. My image is based on https://github.com/dockerfiles/django-uwsgi-nginx

@awelzel
Copy link
Contributor

awelzel commented Dec 14, 2019

@jacopofar , @asherp, @aliashkar - any chance there is a stacktrace in the logs before the "killed by signal 11" line and could paste it here?

It would also be very helpful if you could reveal some information about your apps: Are you by any chance using psycopg2 2.7.x wheels and/or other Python wheels that ship their own libssl?

It appears there's a known issue with wheels that include their own libssl (or other libs) - see #1569 and #1590 (also this: http://initd.org/psycopg/articles/2018/02/08/psycopg-274-released/)

@jacopofar
Copy link

@awelzel I tried to reproduce but cannot get it anymore ¯_(ツ)_/¯

I don't remember any additional stacktrace, it only printed that message. This is my requirements.txt for that version:

uwsgi==2.0.18
boto3==1.9.67
pytest==5.2.2
pytest-cov==2.8.1
flake8==3.7.9
pandas==0.25.2
plotly==4.2.1
psycopg2-binary==2.8.3
sqlalchemy==1.2.15
dash==1.5.1
dash_auth==1.3.2
dash-bootstrap-components==0.7.2
requests==2.22.0
pyarrow==0.15.1

I'm not aware of any embedded libssl except for psycopg2, sorry for not being able to provide more details :/

@eburghar
Copy link

Getting the same with python:3.6.8-alpine3.9 and uwsgi==2.0.15

Seems to get fixed by increasing uwsgi's thread-stacksize to 512. Now rolling with 2 or more threads without workers dying.

It also apparently solved my use case. Is there a way to track uwsgi stack memory consumption to be sure that it happens for out of memory reason ?

@wss404
Copy link

wss404 commented Mar 26, 2020

the same error occurred when i try to run a job with frequet http requests.
i guess the error should due to long timeout.
i solved it by setting much bigger harakiri value in uwsgi.ini,then it's working well.

rjw1 added a commit to nhsx/nhsx-website that referenced this issue Apr 8, 2020

Verified

This commit was signed with the committer’s verified signature. The key has expired.
rjw1 bob
When running in staging without this set we would see workers randomly
die causing 502s
`DAMN ! worker 5 (pid: 429) died, killed by signal 11 :( trying respawn
...`
unbit/uwsgi#1792 (comment)
@jasonTu
Copy link

jasonTu commented Jun 5, 2020

I'm still getting this using uwsgi 2.0.18 on alpine 3.7.. Others still having the same problem?

I met this issue at the same env

@arviCV
Copy link

arviCV commented Jul 1, 2021

@jacopofar I too am getting the same error on kubernetes but not when I run locally. My image is based on https://github.com/dockerfiles/django-uwsgi-nginx

I almost tried every single thing explained here. Still exactly same error occurred due to uwsgi server. This is specifically for a particular flask endpoint whenever I deployed to k83 cluster and worked perfectly in dev machine.
Surprisingly, requesting more resource fixed the issue.

resources:
  limits:
    memory: 1Gi
  requests:
    memory: 512Mi

duttonw referenced this issue in qld-gov-au/opswx-ckan-cookbook Oct 11, 2022

Verified

This commit was signed with the committer’s verified signature.
duttonw William Dutton
@ylmuhaha
Copy link

ylmuhaha commented Sep 5, 2023

I have also encountered this problem. my uwsgi version is 2.0.18 . and threads per worker sets to 6
This is my analysis:
one thread ended request and called uwsgi_close_request,
and in uwsgi_close_request, it founds the worker's delta_requests reached max_requests, then it calls goodbye_cruel_world, cursed the worker and then calls simple_goodbye_cruel_world, in simple_goodbye_cruel_world wait for threads end。
However, there is a thread processing a time-consuming problem, but it is not actually stuck. And So after a reload-mercy time(for me it's 60s), in uwsgi_master_check_mercy, it directly killed this worker."

I wonder if there is a more graceful way to handle this, for example, in simple_goodbye_cruel_world, set manage_next_request to zero before wait_for_threads ,thus stop receive requests.then in uwsgi_master_check_mercy wait for the threads to end befor killing it with signal 9. if the worker really stucsk , it can be killed by harakiri

@Stephane-Ag
Copy link

So far, upping threads from 1 to 4 seems to have helped for me.

@toabi
Copy link

toabi commented Sep 18, 2023

I seem to have a similar issue.

alpine 3.17.5, uwsgi 2.0.22 and python 3.10.13

compiling uwsgi on alpine and copying it in my app container.
The django app works for GET requests, but when I try to do some POST requests, it fails with the segfault.

  • setting threads from 1 to 4 did not help
  • disabling threads did not help
  • giving it lots of resources did not help

Everything works locally on macOS/arm64. It fails in our linux/amd64 kubernetes cluster.

@Sprabu4u
Copy link

Sprabu4u commented Nov 8, 2023

I am also facing similar issue with alpine 3.17.5, uwsgi 2.0.22 and python 3.10.13.

But my application works fine with lower verion python 3.9 alpine3.15. Tried all the above suggestions but no luck

@Sprabu4u
Copy link

Sprabu4u commented Nov 8, 2023

I seem to have a similar issue.

alpine 3.17.5, uwsgi 2.0.22 and python 3.10.13

compiling uwsgi on alpine and copying it in my app container. The django app works for GET requests, but when I try to do some POST requests, it fails with the segfault.

  • setting threads from 1 to 4 did not help
  • disabling threads did not help
  • giving it lots of resources did not help

Everything works locally on macOS/arm64. It fails in our linux/amd64 kubernetes cluster.

Is this issue resolved ? Any work around on this ?

snim2 pushed a commit to nhsx/nhsx-website that referenced this issue Nov 20, 2023
When running in staging without this set we would see workers randomly
die causing 502s
`DAMN ! worker 5 (pid: 429) died, killed by signal 11 :( trying respawn
...`
unbit/uwsgi#1792 (comment)
@jedie
Copy link
Author

jedie commented Sep 19, 2024

nhsx/nhsx-website@eb91494 is a solution?!?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests