Segfault on Alpine Linux Docker #1935

blaggacao · 2016-05-30T08:33:13Z

What did you do?

Pillow is embedded into the Odoo library. I was trying to load a data file which referenced a png, when suddenly a segfault terminated my container.

What did you expect to happen?

The processing of the png would complete successfully.

What actually happened?

Putting on DEBUG mode, we can see the following output.

odoo    | 2016-05-30 08:29:20,832 7 INFO test openerp.modules.loading: loading base/base_data.xml
odoo    | 2016-05-30 08:29:21,050 7 DEBUG test PIL.PngImagePlugin: STREAM IHDR 16 13
odoo    | 2016-05-30 08:29:21,052 7 DEBUG test PIL.PngImagePlugin: STREAM bKGD 41 6
odoo    | 2016-05-30 08:29:21,052 7 DEBUG test PIL.PngImagePlugin: bKGD 41 6 (unknown)
odoo    | 2016-05-30 08:29:21,053 7 DEBUG test PIL.PngImagePlugin: STREAM pHYs 59 9
odoo    | 2016-05-30 08:29:21,053 7 DEBUG test PIL.PngImagePlugin: STREAM tIME 80 7
odoo    | 2016-05-30 08:29:21,054 7 DEBUG test PIL.PngImagePlugin: tIME 80 7 (unknown)
odoo    | 2016-05-30 08:29:21,054 7 DEBUG test PIL.PngImagePlugin: STREAM IDAT 99 7809
odoo    | 2016-05-30 08:29:21,087 7 DEBUG test PIL.PngImagePlugin: STREAM IHDR 16 13
odoo    | 2016-05-30 08:29:21,088 7 DEBUG test PIL.PngImagePlugin: STREAM bKGD 41 6
odoo    | 2016-05-30 08:29:21,088 7 DEBUG test PIL.PngImagePlugin: bKGD 41 6 (unknown)
odoo    | 2016-05-30 08:29:21,088 7 DEBUG test PIL.PngImagePlugin: STREAM pHYs 59 9
odoo    | 2016-05-30 08:29:21,093 7 DEBUG test PIL.PngImagePlugin: STREAM tIME 80 7
odoo    | 2016-05-30 08:29:21,094 7 DEBUG test PIL.PngImagePlugin: tIME 80 7 (unknown)
odoo    | 2016-05-30 08:29:21,094 7 DEBUG test PIL.PngImagePlugin: STREAM IDAT 99 7809
odoo    | 2016-05-30 08:29:21,275 7 DEBUG test PIL.PngImagePlugin: STREAM IHDR 16 13
odoo    | 2016-05-30 08:29:21,276 7 DEBUG test PIL.PngImagePlugin: STREAM bKGD 41 6
odoo    | 2016-05-30 08:29:21,276 7 DEBUG test PIL.PngImagePlugin: bKGD 41 6 (unknown)
odoo    | 2016-05-30 08:29:21,276 7 DEBUG test PIL.PngImagePlugin: STREAM pHYs 59 9
odoo    | 2016-05-30 08:29:21,277 7 DEBUG test PIL.PngImagePlugin: STREAM tIME 80 7
odoo    | 2016-05-30 08:29:21,277 7 DEBUG test PIL.PngImagePlugin: tIME 80 7 (unknown)
odoo    | 2016-05-30 08:29:21,278 7 DEBUG test PIL.PngImagePlugin: STREAM IDAT 99 7809
odoo    | 2016-05-30 08:29:21,338 7 DEBUG test PIL.PngImagePlugin: STREAM IHDR 16 13
odoo    | 2016-05-30 08:29:21,338 7 DEBUG test PIL.PngImagePlugin: STREAM bKGD 41 6
odoo    | 2016-05-30 08:29:21,339 7 DEBUG test PIL.PngImagePlugin: bKGD 41 6 (unknown)
odoo    | 2016-05-30 08:29:21,339 7 DEBUG test PIL.PngImagePlugin: STREAM pHYs 59 9
odoo    | 2016-05-30 08:29:21,339 7 DEBUG test PIL.PngImagePlugin: STREAM tIME 80 7
odoo    | 2016-05-30 08:29:21,340 7 DEBUG test PIL.PngImagePlugin: tIME 80 7 (unknown)
odoo    | 2016-05-30 08:29:21,340 7 DEBUG test PIL.PngImagePlugin: STREAM IDAT 99 7809
odoo    | 2016-05-30 08:29:21,384 7 DEBUG test PIL.PngImagePlugin: STREAM IHDR 16 13
odoo    | 2016-05-30 08:29:21,385 7 DEBUG test PIL.PngImagePlugin: STREAM bKGD 41 6
odoo    | 2016-05-30 08:29:21,385 7 DEBUG test PIL.PngImagePlugin: bKGD 41 6 (unknown)
odoo    | 2016-05-30 08:29:21,385 7 DEBUG test PIL.PngImagePlugin: STREAM pHYs 59 9
odoo    | 2016-05-30 08:29:21,386 7 DEBUG test PIL.PngImagePlugin: STREAM tIME 80 7
odoo    | 2016-05-30 08:29:21,386 7 DEBUG test PIL.PngImagePlugin: tIME 80 7 (unknown)
odoo    | 2016-05-30 08:29:21,386 7 DEBUG test PIL.PngImagePlugin: STREAM IDAT 99 7809
odoo    | 2016-05-30 08:29:21,563 7 DEBUG test PIL.PngImagePlugin: STREAM IHDR 16 13
odoo    | 2016-05-30 08:29:21,563 7 DEBUG test PIL.PngImagePlugin: STREAM bKGD 41 6
odoo    | 2016-05-30 08:29:21,564 7 DEBUG test PIL.PngImagePlugin: bKGD 41 6 (unknown)
odoo    | 2016-05-30 08:29:21,564 7 DEBUG test PIL.PngImagePlugin: STREAM pHYs 59 9
odoo    | 2016-05-30 08:29:21,564 7 DEBUG test PIL.PngImagePlugin: STREAM tIME 80 7
odoo    | 2016-05-30 08:29:21,564 7 DEBUG test PIL.PngImagePlugin: tIME 80 7 (unknown)
odoo    | 2016-05-30 08:29:21,565 7 DEBUG test PIL.PngImagePlugin: STREAM IDAT 99 7809
odoo    | Segmentation fault

What versions of Pillow and Python are you using?

Pillow 3.2.0 and (probably through) 2.7.0
Python 2.7.11
Alpine Linux 3.3

The text was updated successfully, but these errors were encountered:

blaggacao · 2016-05-30T08:36:18Z

After resuming the container, we can see also resuming the loading with the following output:

odoo    | 2016-05-30 08:34:52,329 6 INFO test openerp.modules.loading: loading base/base_data.xml
odoo    | 2016-05-30 08:34:52,618 6 DEBUG test PIL.PngImagePlugin: STREAM IHDR 16 13
odoo    | 2016-05-30 08:34:52,619 6 DEBUG test PIL.PngImagePlugin: STREAM bKGD 41 6
odoo    | 2016-05-30 08:34:52,621 6 DEBUG test PIL.PngImagePlugin: bKGD 41 6 (unknown)
odoo    | 2016-05-30 08:34:52,621 6 DEBUG test PIL.PngImagePlugin: STREAM pHYs 59 9
odoo    | 2016-05-30 08:34:52,622 6 DEBUG test PIL.PngImagePlugin: STREAM tIME 80 7
odoo    | 2016-05-30 08:34:52,623 6 DEBUG test PIL.PngImagePlugin: tIME 80 7 (unknown)
odoo    | 2016-05-30 08:34:52,624 6 DEBUG test PIL.PngImagePlugin: STREAM IDAT 99 7809
odoo    | 2016-05-30 08:34:52,659 6 DEBUG test PIL.PngImagePlugin: STREAM IHDR 16 13
odoo    | 2016-05-30 08:34:52,661 6 DEBUG test PIL.PngImagePlugin: STREAM bKGD 41 6
odoo    | 2016-05-30 08:34:52,662 6 DEBUG test PIL.PngImagePlugin: bKGD 41 6 (unknown)
odoo    | 2016-05-30 08:34:52,665 6 DEBUG test PIL.PngImagePlugin: STREAM pHYs 59 9
odoo    | 2016-05-30 08:34:52,670 6 DEBUG test PIL.PngImagePlugin: STREAM tIME 80 7
odoo    | 2016-05-30 08:34:52,674 6 DEBUG test PIL.PngImagePlugin: tIME 80 7 (unknown)
odoo    | 2016-05-30 08:34:52,675 6 DEBUG test PIL.PngImagePlugin: STREAM IDAT 99 7809
odoo    | 2016-05-30 08:34:52,857 6 DEBUG test PIL.PngImagePlugin: STREAM IHDR 16 13
odoo    | 2016-05-30 08:34:52,857 6 DEBUG test PIL.PngImagePlugin: STREAM bKGD 41 6
odoo    | 2016-05-30 08:34:52,857 6 DEBUG test PIL.PngImagePlugin: bKGD 41 6 (unknown)
odoo    | 2016-05-30 08:34:52,858 6 DEBUG test PIL.PngImagePlugin: STREAM pHYs 59 9
odoo    | 2016-05-30 08:34:52,858 6 DEBUG test PIL.PngImagePlugin: STREAM tIME 80 7
odoo    | 2016-05-30 08:34:52,866 6 DEBUG test PIL.PngImagePlugin: tIME 80 7 (unknown)
odoo    | 2016-05-30 08:34:52,867 6 DEBUG test PIL.PngImagePlugin: STREAM IDAT 99 7809
odoo    | 2016-05-30 08:34:52,937 6 DEBUG test PIL.PngImagePlugin: STREAM IHDR 16 13
odoo    | 2016-05-30 08:34:52,938 6 DEBUG test PIL.PngImagePlugin: STREAM bKGD 41 6
odoo    | 2016-05-30 08:34:52,940 6 DEBUG test PIL.PngImagePlugin: bKGD 41 6 (unknown)
odoo    | 2016-05-30 08:34:52,945 6 DEBUG test PIL.PngImagePlugin: STREAM pHYs 59 9
odoo    | 2016-05-30 08:34:52,946 6 DEBUG test PIL.PngImagePlugin: STREAM tIME 80 7
odoo    | 2016-05-30 08:34:52,946 6 DEBUG test PIL.PngImagePlugin: tIME 80 7 (unknown)
odoo    | 2016-05-30 08:34:52,947 6 DEBUG test PIL.PngImagePlugin: STREAM IDAT 99 7809
odoo    | 2016-05-30 08:34:52,984 6 DEBUG test PIL.PngImagePlugin: STREAM IHDR 16 13
odoo    | 2016-05-30 08:34:52,985 6 DEBUG test PIL.PngImagePlugin: STREAM bKGD 41 6
odoo    | 2016-05-30 08:34:52,986 6 DEBUG test PIL.PngImagePlugin: bKGD 41 6 (unknown)
odoo    | 2016-05-30 08:34:52,991 6 DEBUG test PIL.PngImagePlugin: STREAM pHYs 59 9
odoo    | 2016-05-30 08:34:52,991 6 DEBUG test PIL.PngImagePlugin: STREAM tIME 80 7
odoo    | 2016-05-30 08:34:52,992 6 DEBUG test PIL.PngImagePlugin: tIME 80 7 (unknown)
odoo    | 2016-05-30 08:34:52,992 6 DEBUG test PIL.PngImagePlugin: STREAM IDAT 99 7809
odoo    | 2016-05-30 08:34:53,242 6 DEBUG test PIL.PngImagePlugin: STREAM IHDR 16 13
odoo    | 2016-05-30 08:34:53,242 6 DEBUG test PIL.PngImagePlugin: STREAM bKGD 41 6
odoo    | 2016-05-30 08:34:53,243 6 DEBUG test PIL.PngImagePlugin: bKGD 41 6 (unknown)
odoo    | 2016-05-30 08:34:53,245 6 DEBUG test PIL.PngImagePlugin: STREAM pHYs 59 9
odoo    | 2016-05-30 08:34:53,246 6 DEBUG test PIL.PngImagePlugin: STREAM tIME 80 7
odoo    | 2016-05-30 08:34:53,247 6 DEBUG test PIL.PngImagePlugin: tIME 80 7 (unknown)
odoo    | 2016-05-30 08:34:53,247 6 DEBUG test PIL.PngImagePlugin: STREAM IDAT 99 7809
odoo    | 2016-05-30 08:34:53,652 6 DEBUG test PIL.PngImagePlugin: STREAM IHDR 16 13
odoo    | 2016-05-30 08:34:53,653 6 DEBUG test PIL.PngImagePlugin: STREAM IDAT 41 3136
odoo    | 2016-05-30 08:34:53,680 6 DEBUG test PIL.PngImagePlugin: STREAM IHDR 16 13
odoo    | 2016-05-30 08:34:53,681 6 DEBUG test PIL.PngImagePlugin: STREAM IDAT 41 3136
odoo    | 2016-05-30 08:34:53,836 6 DEBUG test PIL.PngImagePlugin: STREAM IHDR 16 13
odoo    | 2016-05-30 08:34:53,836 6 DEBUG test PIL.PngImagePlugin: STREAM IDAT 41 3136
odoo    | 2016-05-30 08:34:53,858 6 DEBUG test PIL.PngImagePlugin: STREAM IHDR 16 13
odoo    | 2016-05-30 08:34:53,858 6 DEBUG test PIL.PngImagePlugin: STREAM IDAT 41 3136

wiredfool · 2016-05-30T09:06:44Z

What were you doing with the pillow library, and what file were you using?

blaggacao · 2016-05-30T09:10:59Z

Hi Eric, I'm still trying to narrow the problem down, as I'm really not even sure, if this is the right repo down the stack... Any hint's you can probably give me to help hunting this down, I would very much appreciate...

both seem fail

blaggacao · 2016-05-30T09:14:01Z

On a normal ubuntu install, it works well, however, alpine is kind of the standard in docker security thanks to pax/Grsecurity...

wiredfool · 2016-05-30T09:24:05Z

Segfaults are actually pretty easy to pin down, more or less.

Run it under gdb. e.g.:

gdb python
...
r my_test_.py
[things go boom]
bt
[big long stack trace where things went wrong]

Fedora is the best out of the box for gdb, although there are macros that you can put in for other linuicies. That can give you a python level back trace from within gdb, as well as being able to inspect python objects. The unified view is really useful.

If you can, it helps to run a version of python with debug symbols, and compile with optimizations off. (on ubuntu, it's by installing python-dbg and compiling and running pillow against that)

Once you have a backtrace of any form, you'll at least be able to trace down which c extension it's from. Hopefully, you can narrow it down and figure out which image is causing the problem, then we can have a very simple test case to dive into and figure out what's going wrong.

blaggacao · 2016-05-30T11:00:35Z

Hm, I wasn't able yet to get python with debug symbols, as I think running on Alpine Linux is relevant.
Yet, the gdb output is not quite useful:

2016-05-30 10:58:16,468 6 DEBUG test PIL.PngImagePlugin: STREAM tIME 80 7
2016-05-30 10:58:16,468 6 DEBUG test PIL.PngImagePlugin: tIME 80 7 (unknown)
2016-05-30 10:58:16,470 6 DEBUG test PIL.PngImagePlugin: STREAM IDAT 99 7809
During startup program terminated with signal SIGSEGV, Segmentation fault.
(gdb)
(gdb) bt
No stack.
(gdb)

wiredfool · 2016-05-30T11:26:13Z

That's odd. If you're interpreting a png, then you should be well through the startup.

Try strace? It's going to have a ton of output, but you should be able to tell what's tried to execute and fail there.

blaggacao · 2016-05-30T11:27:53Z

Neither...

2016-05-30 10:58:16,470 6 DEBUG test PIL.PngImagePlugin: STREAM IDAT 99 7809
During startup program terminated with signal SIGSEGV, Segmentation fault.
(gdb)
(gdb) bt
No stack.
(gdb) strace
No default breakpoint address now.

blaggacao · 2016-05-30T11:30:25Z

Can this be related to PaX ?
https://en.wikibooks.org/wiki/Grsecurity/Appendix/PaX_Flags
eg.: https://pax.grsecurity.net/docs/mprotect.txt

I've seen paxctl -cm in order to disable mprotect on some nodejs dockers beeing in connection with segfaults...

I already tried paxctl -cm on some libraries (all of PIL/*.so and /lib/zlib.so) to no avail.

wiredfool · 2016-05-30T11:46:22Z

There's nothing in Pillow that would need executable memory maps, and at any rate, png decoding would not trigger memory mapping in our code. That's not to say that somewhere deep inside python the image file isn't getting memory mapped in, but it's not something that we request. (and even at that, it wouldn't require execute privs).

wiredfool · 2016-05-30T11:47:47Z

To narrow this down, you could build just enough of a docker image to run pillow, and attempt to open and load the images in question. If they succeed, then you at least know that it's somewhere else. And if they fail, it's going to be a smaller test case.

blaggacao · 2016-05-30T11:51:34Z

In fact I had already done so, but I couldn't really draw a conclusion...

FROM alpine
USER root

# Part of my stack, no idea if maybe relevant.
RUN set -x \
    && apk add --no-cache su-exec \
    && chmod 500 $(which su-exec)

RUN apk --no-cache add python \
                       build-base \
                       python-dev \
                       py-pip \
                       # Pillow image libraries
                       jpeg-dev \
                       zlib-dev 
RUN LIBRARY_PATH=/lib:/usr/lib /bin/sh -c "pip install Pillow==3.2.0"

# --- User Creation ---
RUN addgroup -S -g 9001 otheruser && \
    adduser -S -u 9001 -G otheruser -h /home -s /sbin/nologin otheruser && \
    chown -R otheruser:otheruser /home
RUN apk --no-cache add gdb
ADD https://raw.githubusercontent.com/odoo/odoo/d2f0dba64c7afb3a9de49e7c3d46f74542f8bbbd/openerp/addons/base/static/img/main_partner-image.png /testimage.png
ADD https://gist.githubusercontent.com/blaggacao/aee86f3d8687efefb8186c9ffc19ac9b/raw/04b14f2407294c563e4b646162c754f0bc8fb00d/run.py /run.py
RUN echo "#!/bin/sh" >> /running && \
    echo "gdb -ex r --args python -v run.py" >> /running
RUN chmod +x /running

CMD ["/running"]

With this for running:
https://gist.githubusercontent.com/blaggacao/aee86f3d8687efefb8186c9ffc19ac9b/raw/04b14f2407294c563e4b646162c754f0bc8fb00d/run.py

I'll sleep over this... If you are interested in this error, maybe you can quickly build /run this dockerfile to see if you can find something out of this...

HEadache, something broke this test case again 😸

wiredfool · 2016-05-30T12:23:55Z

I'm not sure what's up with that test script, especially the round trip through base64.

But, both of these test scripts work, without crashing:

from PIL import Image
im = Image.open('testimage.png')
im.load()

and:

import cStringIO as StringIO
from PIL import Image
with open('/testimage.png', 'rb') as f:
    data = f.read()
image_stream = StringIO.StringIO(data)
image = Image.open(image_stream)
image.load()

The load() at the end is important, since pillow does lazy loading of the image -- open only hits the python layer and gets the image metadata. If you don't actually access the image pixels, then it never loads them and you never hit the c layer for decompression.

So based on this, I'd say that either you're doing something else with the image that's causing issues, or it's failing in an entirely different place.

blaggacao · 2016-05-30T12:29:45Z

The roundtrip was because I tried to mimic the original code. Well, thanks a lot for your help! Indeed, it also works on my docker for windows beta. So at least I can exclude this possible entropy as a cause for now...

wiredfool · 2016-05-30T12:35:01Z

The base64 should be a noop, and if it's not, then that's going to be a completely separate problem.

In general though, Pillow shouldn't segfault, and I'll try to track them down given some sort of reproducible test case.

blaggacao · 2016-05-30T12:39:17Z

True. OK, thanks again. I'll keep on digging...

blaggacao · 2016-05-30T17:11:36Z

Finally, with valgrind, I was able to get at least some error message:
https://gist.github.com/blaggacao/1df4f9dffb0e515aaf6decc6e67ff2d4#file-valgrind-log-L2354

It seems that like something "deep inside python", based on the logs, can there be anything said from a Pillow point of view, such as "definitely not related to pillow", "related but not responsible" or something alike?

Effectively it seems to be triggered by PaX flags and probably by the simple image loading, as this passed without problem in the test. I still couldn't figure out how to get a useful stack trace in order to see which code is calling but on a best guess basis, the program implements also image manipulation at some point https://github.com/odoo/odoo/blob/9.0/openerp/tools/image.py#L10

wiredfool · 2016-05-30T19:32:06Z

The trace and activity from valgrind make it look like its a gc pass that's segfaulting. Its possible that it's related to the pax flags, but the underlying cause of that would be a mismatch in incref/decref in an extension, so that there is a use after freeing.

The other thing that I noticed is that the error is at least 3 seconds after the logging about pngs, so it's unlikely that that is the direct cause. It would fit with a gc issue, but that just means tha the crash is far removed from the source of the problem.

blaggacao · 2016-05-30T20:17:52Z

💫 I think this is closable, thank you for you help! I'm trying to get help from the alpine linux mainainers now...

blaggacao · 2016-05-31T21:46:11Z

This get's ever stranger, on the test docker we made, the first images passes and a second one also.
NOW, in the actual code the second image fails, effectively near the Pillow interaction, I'll try to narrow this down further and eventually reopen...

blaggacao · 2016-06-01T06:22:39Z

I don't know exactly what I did, probably something out of (or a combination of):

ulimit -c unlimited
call gdb from within a script so it's not PID1 on docker
run container in privileged mode
change /proc/sys/kernel/core_pattern, although no core dump file seems to be written on the new location

EDIT
I double checked and definitely it's the privileged flag which made the interactive gdb session working. Seems that gdb needs some kind of kernel capabilities. Not that far a catch... The ulimit -c unlimited in conjunction with the core_patten then would be responsible for generating the dump file, which is not created when the process is properly caught by gdb.
END EDIT

But, I now got a backtrace for the situation, maybe you can have a look at it?
https://gist.github.com/blaggacao/c081658dea0e7b95cb006a1b1673786f

Without knowing about coredump analysis at all, I would guess, the last item is the culprit (musl), but yeah, things are never that easy 😄

wiredfool · 2016-06-01T09:52:38Z

Do you have an actual core file? Or a docker file that can replicate this?

The backtrace looks like pure python, it's a bunch of frames and PyObject_Call. The bit about musl looks like the entry point for python, so I think that's a red herring.

(aside, I'm not seeing how to get symbols for the apks, nor are breakpoints working, nor the python tweaks to gdb to let me see the python objects. Alpine seems to be something of a step backwards in terms of debugability)

blaggacao · 2016-06-01T20:08:20Z

I can managed to dump a core file off the Docker in VM, https://filebin.net/9svhknfet7ofut0j
I can draft a dockerfile, which should be able to replicate, however it would be our full blown deployment stack. Thanks for your help!

blaggacao · 2016-06-03T02:10:01Z

I found a list of differences between muslibc and glibc, might be of interest, yet maybe not directly related: http://wiki.musl-libc.org/wiki/Functional_differences_from_glibc#Name_Resolver_.2F_DNS

wiredfool · 2016-06-03T09:59:34Z

I wasn't able to get much more out of the core dump, since I can't find any symbols packages for the python from the .apk. I'm pretty sure that symbols are required for the py-bt command.

I'm not sure what to tell you at this point, other than to try to narrow down what causes it. That may require switching to a python build that has symbols, or running under strace (which is blocked by something) or really verbose logging to narrow down the trigger.

blaggacao · 2016-06-03T12:38:11Z

OK, I'm really grateful for your help as I have never ever touched these topics and don't have that much real options due to lacks in knowledge. I've tried to build a python with symbols, yet objdump -t tells me "no symbols found" so I'm confused again...

I've tried it this way, I'll try to check back with the alpine community.

# # --- Build Debug Liraries for GDB Debugging ---
# ENV CFLAGS='-Wall -O0 -g'
# RUN apk add --update alpine-sdk \
#   && adduser -G abuild -g "Alpine Package Builder" -s /bin/ash -D builder \
#   && echo "builder ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers \
#   && mkdir /packages \
#   && chown builder:abuild /packages \
#   && git clone --depth 1 git://dev.alpinelinux.org/aports \
#   && abuild-keygen -a -i \
#   && cd aports/main/python
# USER builder
# RUN cp -r aports/main/python /packages/python \
#   && cd /packages/python \
#   && abuild-keygen -a -i \
#   && abuild checksum \
#   && abuild -r \
#   && abuild -i python \
#   && abuild -i python-dev
# USER root

wiredfool · 2016-06-03T13:04:13Z

The symbols might be in a central location like /usr/lib/debug or /usr/share/gdb.

blaggacao · 2016-06-03T23:52:26Z

I think, I'm lucky: http://pkgs.alpinelinux.org/package/edge/main/x86_64/python-dbg Will make another attempt with this.

blaggacao · 2016-06-04T05:01:31Z

Well steps are pretty small ones, I'm constantly running into this, google doesn't say a useful word about this...

Starting program: /usr/lib/debug/usr/bin/python2.7.debug
warning: Unable to find dynamic linker breakpoint function.
GDB will be unable to debug shared library initializers
and track explicitly loaded dynamic code.
Warning:
Cannot insert breakpoint -2.
Cannot access memory at address 0x65d

lasley · 2017-02-02T19:36:24Z

Yeah there's an initial boot up sequence that you're seeing there. I've been isolating the logs with an initial request to /, which triggers that - then a subsequent request to the fail-page.

I haven't been getting any logging traces from my failures either, it's as if the app just vanishes.

lasley · 2017-02-02T19:37:22Z

Bravo on getting the specific curl isolated btw! I've just been hitting the whole page 😑

lasley · 2017-02-02T19:53:06Z

This new curl is perfect - I think @wiredfool is correct with the hypothesis that this is likely caused by lxml.

I traced the code backwards from web.planner/render, and it's basically just a template render. Traced that all the way back to the compilation, which is a bunch of heavy XML handling. No Pillow in sight for any of that!

This would also explain the original error @blaggacao was seeing with the database not being created, which is also pretty heavy on the lxml side of things.

wiredfool · 2017-02-02T23:24:18Z

This sort of debugging would be far easier if Python was compiled --with-dtrace for USDT support. That combined with BPF in recent linux kernels would mean that we could dump enter and exit from each python function from outside the container. Then again, simply having the dbg symbols would have helped as well.

lasley · 2017-02-03T17:22:24Z

I totally agree - I guess we all knew there were sacrifices when moving to Alpine though. How else would we have such a small OS?

After playing around a bit more, I am confident this isn't your problem @wiredfool. Thank you incredibly for your help in this, I am going to take the issue into a more appropriate realm (once I identify it is truly lxml).

You can go ahead and close this ticket, again 😸

inakoll · 2017-02-04T18:39:19Z

I've just hit a similar issue with a django/Pillow based app on Alpine with python 3.5.2. I've quickly suspected a C extension (and Pillow) to cause the segfault but in my case it was simply due to a stack overflow (819+ stack frames of ~100 bytes each). With musl libc the default stack size is 80 KB (81920 B) while it is 8 MB with glibc. 80 KB is a "reasonable" default but it might not be a suitable choice for an existing application that assume a 8 MB stack size. In my case increasing the stack size from the main thread "solved" the problem (threading.stack_size(4*80*1024) 4x80 KB stack size for example).

I've done some research to see if Pillow or its dependencies were allocating large buffer on the stack but it doesn't look like. I would appreciate if someone could confirm that point. I am sorry if I burden you

wiredfool · 2017-02-05T12:03:43Z

@inakoll That would make some sense, as the stack trace linked (way) above is something like 414 items deep in c python. XML processing would tend to create that sort of stack trace. At any rate, it's an easy to test hypothesis.

yajo · 2017-02-06T10:25:33Z

Any idea on how to add that to the image? I'm not much in the C side of the street...

wiredfool · 2017-02-06T10:51:16Z

The threading line is python, so I threw that into the odoo-bin launcher as a test:

#!/usr/bin/env python

# set server timezone in UTC before time module imported
__import__('os').environ['TZ'] = 'UTC'
__import__('pkg_resources').declare_namespace('odoo.addons')

import inject_logger

import threading
threading.stack_size(4*80*1024)

import odoo

if __name__ == "__main__":
    odoo.cli.main()

And it appears to be successful, at least in the case found above. There may be other calls that have issues.

lasley · 2017-02-07T20:29:19Z

It seems successful in mine too, albeit with limited testing. I'll submit a PR to your image @yajo so we can perform some more extensive testing.

lasley · 2017-02-07T20:45:04Z

Hrmmm ok maybe I don't quite know your image well enough to submit that change @yajo, but here's what you need:

# Fix stack size overflow
# https://github.com/python-pillow/Pillow/issues/1935
RUN awk '/import odoo/ { print; print "import threading; threading.stack_size(4*80*1024)"; next }1' /usr/local/bin/odoo > /tmp-bin \
    && mv /tmp-bin /usr/local/bin/odoo

/usr/local/bin/odoo being the Odoo installation path

* Fix threading stack overflow size from python-pillow/Pillow#1935

* [FIX] Threading stack overflow size * Fix threading stack overflow size from python-pillow/Pillow#1935 * Fix awk usage

blaggacao · 2017-02-11T07:00:15Z

An excellent whodunnit!

blaggacao · 2017-02-11T07:01:51Z

I might recall http://wiki.musl-libc.org/wiki/Functional_differences_from_glibc for the way to go.

yajo · 2017-02-13T11:09:16Z

I think the best would be to open PR to Odoo directly. I guess this could affect any other deployments with different stack sizes. Adding a configuration parameter, env variable, or simply a sane default that works everywhere seems a good improvement, even for stable releases. What do you think @lasley?

lasley · 2017-02-13T16:16:50Z

IMO we would need to identify the real CPython lib causing this issue then submit it there. This isn't really an Odoo issue, as @inakoll confirmed by mentioning that the issue is also occurring in a Django app. Setting the thread stack size is totally just duct tape on the situation.

wiredfool · 2017-02-13T17:01:26Z

The current theory is that it's a stack size issue. Overflowing the stack can cause a segfault and happen in pure python (hence stackless python which was tuned for recursive loads). It doesn't have to be a CPython extension. One likely cause would be recursive parsing of deep XML documents, or an AST generated from them. (This is at least plausible from the backtrace listed above).

You may be able to cause the same error by parsing any xml in the accounting package, or it might require building bits back up with the render.

deed02392 · 2017-07-22T17:13:44Z

Edit: just realised that python-pillow is obviously not the right place for this issue but anyway I think you guys would be interested in more evidence to support where the issue stems from.

Hi all,

I can also report what I think is the same issue occurring with another Docker container: ressu/sickgear . This also depends on lxml which hardly seems like a coincidence. What is interesting however is that an old release of sickgear itself (0.11.x) did not suffer from the fault.

To reproduce, build the container and run it: run -it --rm --entrypoint=/bin/sh -p 8081:8081 ressu/sickgear then navigate to localhost:8081 and under the Cog wheel choose Notifications (or just navigate to http://localhost:8081/config/notifications and the application will SIGSEGV.

I too have tried to debug this, first with strace which indicated it happening around the time of doing template processing but this wasn't helpful enough. I then attempted gdb and although the application starts up great, when I trigger the segfault, gdb says:

During startup program terminated with signal SIGSEGV

Which doesn't enable any debugging in terms of following stack traces. I'm not sure why it treats the obviously fully loaded software a still starting. I do get an error about being unable to disable address randomisation but I don't think that's relevant unless this is indirectly indicating another problem.

lasley · 2017-07-22T17:28:46Z

The plot thickens!

An interesting note - we were able to solve this problem in our Alpine container by switching to multi-threaded worker model. That revelation was honestly what caused me to just call it a lost cause and conform with the switch to Debian ☹️

deed02392 · 2017-07-22T19:43:58Z

Edit: just confirming that this issue can also be fixed on SickGear by increasing threading stack size: threading.stack_size(4*80*1024).

Does that resolve the issue because each thread gets its 'own stack' and therefore there is less chance for running out of stack space (different from stack overflow, I assume)? In which case I wonder if I can patch SickGear with the larger thread stack size. How does one come up with an optimal size? Some calculation based on the likely size/depth of the XML documents being parsed?

deed02392 · 2017-07-23T00:07:30Z

@lasley I get that. I have managed to keep with Alpine by manually increasing the stack size as mentioned above. It seems like there's a bit of pressure being applied to the Alpine maintainers atm to increase the stack size, but we shall have to wait and see what happens. It's pretty sucky to have to patch upstream so many apps because this core setting is so conservative.

blaggacao closed this as completed May 30, 2016

lasley added a commit to LasLabs/docker-odoo that referenced this issue Feb 7, 2017

[FIX] Threading stack overflow size

a058663

* Fix threading stack overflow size from python-pillow/Pillow#1935

lasley mentioned this issue Feb 7, 2017

[FIX] Threading stack overflow size LasLabs/docker-odoo#6

Merged

lasley added a commit to LasLabs/docker-odoo that referenced this issue Feb 7, 2017

[FIX] Threading stack overflow size (#6)

4a6b2c5

* [FIX] Threading stack overflow size * Fix threading stack overflow size from python-pillow/Pillow#1935 * Fix awk usage

wiredfool closed this as completed Feb 7, 2017

yajo mentioned this issue Feb 20, 2017

Go back to Alpine Tecnativa/doodba#16

Closed

gasman mentioned this issue Mar 9, 2017

cannot start wagtail in an alpine Linux container wagtail/wagtail#3438

Closed

lasley mentioned this issue Mar 17, 2017

[BUG] Alpine Odoo Segfault YannickB/odoo-hosting#221

Open

deed02392 mentioned this issue Jul 23, 2017

Crash on notification settings SickGear/SickGear.Docker#6

Closed

yeraydiazdiaz mentioned this issue Oct 21, 2017

Alpine Linux segfaults in dev environment pypi/warehouse#2523

Closed

su600 mentioned this issue Jan 1, 2021

Random segmentation fault (core dumped) in ARM64 dmroeder/pylogix#151

Closed

Segfault on Alpine Linux Docker #1935

Segfault on Alpine Linux Docker #1935

Comments

blaggacao commented May 30, 2016 • edited Loading

What did you do?

What did you expect to happen?

What actually happened?

What versions of Pillow and Python are you using?

blaggacao commented May 30, 2016

wiredfool commented May 30, 2016

blaggacao commented May 30, 2016

blaggacao commented May 30, 2016

wiredfool commented May 30, 2016

blaggacao commented May 30, 2016

wiredfool commented May 30, 2016

blaggacao commented May 30, 2016

blaggacao commented May 30, 2016

wiredfool commented May 30, 2016

wiredfool commented May 30, 2016

blaggacao commented May 30, 2016 • edited Loading

wiredfool commented May 30, 2016

blaggacao commented May 30, 2016

wiredfool commented May 30, 2016

blaggacao commented May 30, 2016

blaggacao commented May 30, 2016 • edited Loading

wiredfool commented May 30, 2016

blaggacao commented May 30, 2016

blaggacao commented May 31, 2016

blaggacao commented Jun 1, 2016 • edited Loading

wiredfool commented Jun 1, 2016

blaggacao commented Jun 1, 2016

blaggacao commented Jun 3, 2016

wiredfool commented Jun 3, 2016

blaggacao commented Jun 3, 2016

wiredfool commented Jun 3, 2016

blaggacao commented Jun 3, 2016

blaggacao commented Jun 4, 2016

lasley commented Feb 2, 2017 • edited Loading

lasley commented Feb 2, 2017 • edited Loading

lasley commented Feb 2, 2017 • edited Loading

wiredfool commented Feb 2, 2017

lasley commented Feb 3, 2017 • edited Loading

inakoll commented Feb 4, 2017

wiredfool commented Feb 5, 2017

yajo commented Feb 6, 2017

wiredfool commented Feb 6, 2017

lasley commented Feb 7, 2017 • edited Loading

lasley commented Feb 7, 2017 • edited Loading

blaggacao commented Feb 11, 2017

blaggacao commented Feb 11, 2017

yajo commented Feb 13, 2017

lasley commented Feb 13, 2017

wiredfool commented Feb 13, 2017

deed02392 commented Jul 22, 2017 • edited Loading

lasley commented Jul 22, 2017

deed02392 commented Jul 22, 2017 • edited Loading

deed02392 commented Jul 23, 2017

blaggacao commented May 30, 2016 •

edited

Loading

blaggacao commented May 30, 2016 •

edited

Loading

blaggacao commented May 30, 2016 •

edited

Loading

blaggacao commented Jun 1, 2016 •

edited

Loading

lasley commented Feb 2, 2017 •

edited

Loading

lasley commented Feb 2, 2017 •

edited

Loading

lasley commented Feb 2, 2017 •

edited

Loading

lasley commented Feb 3, 2017 •

edited

Loading

lasley commented Feb 7, 2017 •

edited

Loading

lasley commented Feb 7, 2017 •

edited

Loading

deed02392 commented Jul 22, 2017 •

edited

Loading

deed02392 commented Jul 22, 2017 •

edited

Loading