Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault on Alpine Linux Docker #1935

Closed
blaggacao opened this issue May 30, 2016 · 71 comments
Closed

Segfault on Alpine Linux Docker #1935

blaggacao opened this issue May 30, 2016 · 71 comments

Comments

@blaggacao
Copy link

blaggacao commented May 30, 2016

What did you do?

Pillow is embedded into the Odoo library. I was trying to load a data file which referenced a png, when suddenly a segfault terminated my container.

What did you expect to happen?

The processing of the png would complete successfully.

What actually happened?

Putting on DEBUG mode, we can see the following output.

odoo    | 2016-05-30 08:29:20,832 7 INFO test openerp.modules.loading: loading base/base_data.xml
odoo    | 2016-05-30 08:29:21,050 7 DEBUG test PIL.PngImagePlugin: STREAM IHDR 16 13
odoo    | 2016-05-30 08:29:21,052 7 DEBUG test PIL.PngImagePlugin: STREAM bKGD 41 6
odoo    | 2016-05-30 08:29:21,052 7 DEBUG test PIL.PngImagePlugin: bKGD 41 6 (unknown)
odoo    | 2016-05-30 08:29:21,053 7 DEBUG test PIL.PngImagePlugin: STREAM pHYs 59 9
odoo    | 2016-05-30 08:29:21,053 7 DEBUG test PIL.PngImagePlugin: STREAM tIME 80 7
odoo    | 2016-05-30 08:29:21,054 7 DEBUG test PIL.PngImagePlugin: tIME 80 7 (unknown)
odoo    | 2016-05-30 08:29:21,054 7 DEBUG test PIL.PngImagePlugin: STREAM IDAT 99 7809
odoo    | 2016-05-30 08:29:21,087 7 DEBUG test PIL.PngImagePlugin: STREAM IHDR 16 13
odoo    | 2016-05-30 08:29:21,088 7 DEBUG test PIL.PngImagePlugin: STREAM bKGD 41 6
odoo    | 2016-05-30 08:29:21,088 7 DEBUG test PIL.PngImagePlugin: bKGD 41 6 (unknown)
odoo    | 2016-05-30 08:29:21,088 7 DEBUG test PIL.PngImagePlugin: STREAM pHYs 59 9
odoo    | 2016-05-30 08:29:21,093 7 DEBUG test PIL.PngImagePlugin: STREAM tIME 80 7
odoo    | 2016-05-30 08:29:21,094 7 DEBUG test PIL.PngImagePlugin: tIME 80 7 (unknown)
odoo    | 2016-05-30 08:29:21,094 7 DEBUG test PIL.PngImagePlugin: STREAM IDAT 99 7809
odoo    | 2016-05-30 08:29:21,275 7 DEBUG test PIL.PngImagePlugin: STREAM IHDR 16 13
odoo    | 2016-05-30 08:29:21,276 7 DEBUG test PIL.PngImagePlugin: STREAM bKGD 41 6
odoo    | 2016-05-30 08:29:21,276 7 DEBUG test PIL.PngImagePlugin: bKGD 41 6 (unknown)
odoo    | 2016-05-30 08:29:21,276 7 DEBUG test PIL.PngImagePlugin: STREAM pHYs 59 9
odoo    | 2016-05-30 08:29:21,277 7 DEBUG test PIL.PngImagePlugin: STREAM tIME 80 7
odoo    | 2016-05-30 08:29:21,277 7 DEBUG test PIL.PngImagePlugin: tIME 80 7 (unknown)
odoo    | 2016-05-30 08:29:21,278 7 DEBUG test PIL.PngImagePlugin: STREAM IDAT 99 7809
odoo    | 2016-05-30 08:29:21,338 7 DEBUG test PIL.PngImagePlugin: STREAM IHDR 16 13
odoo    | 2016-05-30 08:29:21,338 7 DEBUG test PIL.PngImagePlugin: STREAM bKGD 41 6
odoo    | 2016-05-30 08:29:21,339 7 DEBUG test PIL.PngImagePlugin: bKGD 41 6 (unknown)
odoo    | 2016-05-30 08:29:21,339 7 DEBUG test PIL.PngImagePlugin: STREAM pHYs 59 9
odoo    | 2016-05-30 08:29:21,339 7 DEBUG test PIL.PngImagePlugin: STREAM tIME 80 7
odoo    | 2016-05-30 08:29:21,340 7 DEBUG test PIL.PngImagePlugin: tIME 80 7 (unknown)
odoo    | 2016-05-30 08:29:21,340 7 DEBUG test PIL.PngImagePlugin: STREAM IDAT 99 7809
odoo    | 2016-05-30 08:29:21,384 7 DEBUG test PIL.PngImagePlugin: STREAM IHDR 16 13
odoo    | 2016-05-30 08:29:21,385 7 DEBUG test PIL.PngImagePlugin: STREAM bKGD 41 6
odoo    | 2016-05-30 08:29:21,385 7 DEBUG test PIL.PngImagePlugin: bKGD 41 6 (unknown)
odoo    | 2016-05-30 08:29:21,385 7 DEBUG test PIL.PngImagePlugin: STREAM pHYs 59 9
odoo    | 2016-05-30 08:29:21,386 7 DEBUG test PIL.PngImagePlugin: STREAM tIME 80 7
odoo    | 2016-05-30 08:29:21,386 7 DEBUG test PIL.PngImagePlugin: tIME 80 7 (unknown)
odoo    | 2016-05-30 08:29:21,386 7 DEBUG test PIL.PngImagePlugin: STREAM IDAT 99 7809
odoo    | 2016-05-30 08:29:21,563 7 DEBUG test PIL.PngImagePlugin: STREAM IHDR 16 13
odoo    | 2016-05-30 08:29:21,563 7 DEBUG test PIL.PngImagePlugin: STREAM bKGD 41 6
odoo    | 2016-05-30 08:29:21,564 7 DEBUG test PIL.PngImagePlugin: bKGD 41 6 (unknown)
odoo    | 2016-05-30 08:29:21,564 7 DEBUG test PIL.PngImagePlugin: STREAM pHYs 59 9
odoo    | 2016-05-30 08:29:21,564 7 DEBUG test PIL.PngImagePlugin: STREAM tIME 80 7
odoo    | 2016-05-30 08:29:21,564 7 DEBUG test PIL.PngImagePlugin: tIME 80 7 (unknown)
odoo    | 2016-05-30 08:29:21,565 7 DEBUG test PIL.PngImagePlugin: STREAM IDAT 99 7809
odoo    | Segmentation fault

What versions of Pillow and Python are you using?

Pillow 3.2.0 and (probably through) 2.7.0
Python 2.7.11
Alpine Linux 3.3

@blaggacao
Copy link
Author

After resuming the container, we can see also resuming the loading with the following output:

odoo    | 2016-05-30 08:34:52,329 6 INFO test openerp.modules.loading: loading base/base_data.xml
odoo    | 2016-05-30 08:34:52,618 6 DEBUG test PIL.PngImagePlugin: STREAM IHDR 16 13
odoo    | 2016-05-30 08:34:52,619 6 DEBUG test PIL.PngImagePlugin: STREAM bKGD 41 6
odoo    | 2016-05-30 08:34:52,621 6 DEBUG test PIL.PngImagePlugin: bKGD 41 6 (unknown)
odoo    | 2016-05-30 08:34:52,621 6 DEBUG test PIL.PngImagePlugin: STREAM pHYs 59 9
odoo    | 2016-05-30 08:34:52,622 6 DEBUG test PIL.PngImagePlugin: STREAM tIME 80 7
odoo    | 2016-05-30 08:34:52,623 6 DEBUG test PIL.PngImagePlugin: tIME 80 7 (unknown)
odoo    | 2016-05-30 08:34:52,624 6 DEBUG test PIL.PngImagePlugin: STREAM IDAT 99 7809
odoo    | 2016-05-30 08:34:52,659 6 DEBUG test PIL.PngImagePlugin: STREAM IHDR 16 13
odoo    | 2016-05-30 08:34:52,661 6 DEBUG test PIL.PngImagePlugin: STREAM bKGD 41 6
odoo    | 2016-05-30 08:34:52,662 6 DEBUG test PIL.PngImagePlugin: bKGD 41 6 (unknown)
odoo    | 2016-05-30 08:34:52,665 6 DEBUG test PIL.PngImagePlugin: STREAM pHYs 59 9
odoo    | 2016-05-30 08:34:52,670 6 DEBUG test PIL.PngImagePlugin: STREAM tIME 80 7
odoo    | 2016-05-30 08:34:52,674 6 DEBUG test PIL.PngImagePlugin: tIME 80 7 (unknown)
odoo    | 2016-05-30 08:34:52,675 6 DEBUG test PIL.PngImagePlugin: STREAM IDAT 99 7809
odoo    | 2016-05-30 08:34:52,857 6 DEBUG test PIL.PngImagePlugin: STREAM IHDR 16 13
odoo    | 2016-05-30 08:34:52,857 6 DEBUG test PIL.PngImagePlugin: STREAM bKGD 41 6
odoo    | 2016-05-30 08:34:52,857 6 DEBUG test PIL.PngImagePlugin: bKGD 41 6 (unknown)
odoo    | 2016-05-30 08:34:52,858 6 DEBUG test PIL.PngImagePlugin: STREAM pHYs 59 9
odoo    | 2016-05-30 08:34:52,858 6 DEBUG test PIL.PngImagePlugin: STREAM tIME 80 7
odoo    | 2016-05-30 08:34:52,866 6 DEBUG test PIL.PngImagePlugin: tIME 80 7 (unknown)
odoo    | 2016-05-30 08:34:52,867 6 DEBUG test PIL.PngImagePlugin: STREAM IDAT 99 7809
odoo    | 2016-05-30 08:34:52,937 6 DEBUG test PIL.PngImagePlugin: STREAM IHDR 16 13
odoo    | 2016-05-30 08:34:52,938 6 DEBUG test PIL.PngImagePlugin: STREAM bKGD 41 6
odoo    | 2016-05-30 08:34:52,940 6 DEBUG test PIL.PngImagePlugin: bKGD 41 6 (unknown)
odoo    | 2016-05-30 08:34:52,945 6 DEBUG test PIL.PngImagePlugin: STREAM pHYs 59 9
odoo    | 2016-05-30 08:34:52,946 6 DEBUG test PIL.PngImagePlugin: STREAM tIME 80 7
odoo    | 2016-05-30 08:34:52,946 6 DEBUG test PIL.PngImagePlugin: tIME 80 7 (unknown)
odoo    | 2016-05-30 08:34:52,947 6 DEBUG test PIL.PngImagePlugin: STREAM IDAT 99 7809
odoo    | 2016-05-30 08:34:52,984 6 DEBUG test PIL.PngImagePlugin: STREAM IHDR 16 13
odoo    | 2016-05-30 08:34:52,985 6 DEBUG test PIL.PngImagePlugin: STREAM bKGD 41 6
odoo    | 2016-05-30 08:34:52,986 6 DEBUG test PIL.PngImagePlugin: bKGD 41 6 (unknown)
odoo    | 2016-05-30 08:34:52,991 6 DEBUG test PIL.PngImagePlugin: STREAM pHYs 59 9
odoo    | 2016-05-30 08:34:52,991 6 DEBUG test PIL.PngImagePlugin: STREAM tIME 80 7
odoo    | 2016-05-30 08:34:52,992 6 DEBUG test PIL.PngImagePlugin: tIME 80 7 (unknown)
odoo    | 2016-05-30 08:34:52,992 6 DEBUG test PIL.PngImagePlugin: STREAM IDAT 99 7809
odoo    | 2016-05-30 08:34:53,242 6 DEBUG test PIL.PngImagePlugin: STREAM IHDR 16 13
odoo    | 2016-05-30 08:34:53,242 6 DEBUG test PIL.PngImagePlugin: STREAM bKGD 41 6
odoo    | 2016-05-30 08:34:53,243 6 DEBUG test PIL.PngImagePlugin: bKGD 41 6 (unknown)
odoo    | 2016-05-30 08:34:53,245 6 DEBUG test PIL.PngImagePlugin: STREAM pHYs 59 9
odoo    | 2016-05-30 08:34:53,246 6 DEBUG test PIL.PngImagePlugin: STREAM tIME 80 7
odoo    | 2016-05-30 08:34:53,247 6 DEBUG test PIL.PngImagePlugin: tIME 80 7 (unknown)
odoo    | 2016-05-30 08:34:53,247 6 DEBUG test PIL.PngImagePlugin: STREAM IDAT 99 7809
odoo    | 2016-05-30 08:34:53,652 6 DEBUG test PIL.PngImagePlugin: STREAM IHDR 16 13
odoo    | 2016-05-30 08:34:53,653 6 DEBUG test PIL.PngImagePlugin: STREAM IDAT 41 3136
odoo    | 2016-05-30 08:34:53,680 6 DEBUG test PIL.PngImagePlugin: STREAM IHDR 16 13
odoo    | 2016-05-30 08:34:53,681 6 DEBUG test PIL.PngImagePlugin: STREAM IDAT 41 3136
odoo    | 2016-05-30 08:34:53,836 6 DEBUG test PIL.PngImagePlugin: STREAM IHDR 16 13
odoo    | 2016-05-30 08:34:53,836 6 DEBUG test PIL.PngImagePlugin: STREAM IDAT 41 3136
odoo    | 2016-05-30 08:34:53,858 6 DEBUG test PIL.PngImagePlugin: STREAM IHDR 16 13
odoo    | 2016-05-30 08:34:53,858 6 DEBUG test PIL.PngImagePlugin: STREAM IDAT 41 3136

@wiredfool
Copy link
Member

What were you doing with the pillow library, and what file were you using?

@blaggacao
Copy link
Author

Hi Eric, I'm still trying to narrow the problem down, as I'm really not even sure, if this is the right repo down the stack... Any hint's you can probably give me to help hunting this down, I would very much appreciate...

main_partner-image
public_user-image

both seem fail

@blaggacao
Copy link
Author

On a normal ubuntu install, it works well, however, alpine is kind of the standard in docker security thanks to pax/Grsecurity...

@wiredfool
Copy link
Member

Segfaults are actually pretty easy to pin down, more or less.

  1. Run it under gdb. e.g.:
gdb python
...
r my_test_.py
[things go boom]
bt
[big long stack trace where things went wrong]

Fedora is the best out of the box for gdb, although there are macros that you can put in for other linuicies. That can give you a python level back trace from within gdb, as well as being able to inspect python objects. The unified view is really useful.

If you can, it helps to run a version of python with debug symbols, and compile with optimizations off. (on ubuntu, it's by installing python-dbg and compiling and running pillow against that)

  1. Once you have a backtrace of any form, you'll at least be able to trace down which c extension it's from. Hopefully, you can narrow it down and figure out which image is causing the problem, then we can have a very simple test case to dive into and figure out what's going wrong.

@blaggacao
Copy link
Author

Hm, I wasn't able yet to get python with debug symbols, as I think running on Alpine Linux is relevant.
Yet, the gdb output is not quite useful:

2016-05-30 10:58:16,468 6 DEBUG test PIL.PngImagePlugin: STREAM tIME 80 7
2016-05-30 10:58:16,468 6 DEBUG test PIL.PngImagePlugin: tIME 80 7 (unknown)
2016-05-30 10:58:16,470 6 DEBUG test PIL.PngImagePlugin: STREAM IDAT 99 7809
During startup program terminated with signal SIGSEGV, Segmentation fault.
(gdb)
(gdb) bt
No stack.
(gdb)

@wiredfool
Copy link
Member

That's odd. If you're interpreting a png, then you should be well through the startup.

Try strace? It's going to have a ton of output, but you should be able to tell what's tried to execute and fail there.

@blaggacao
Copy link
Author

Neither...

2016-05-30 10:58:16,470 6 DEBUG test PIL.PngImagePlugin: STREAM IDAT 99 7809
During startup program terminated with signal SIGSEGV, Segmentation fault.
(gdb)
(gdb) bt
No stack.
(gdb) strace
No default breakpoint address now.

@blaggacao
Copy link
Author

Can this be related to PaX ?
https://en.wikibooks.org/wiki/Grsecurity/Appendix/PaX_Flags
eg.: https://pax.grsecurity.net/docs/mprotect.txt

I've seen paxctl -cm in order to disable mprotect on some nodejs dockers beeing in connection with segfaults...

I already tried paxctl -cm on some libraries (all of PIL/*.so and /lib/zlib.so) to no avail.

@wiredfool
Copy link
Member

There's nothing in Pillow that would need executable memory maps, and at any rate, png decoding would not trigger memory mapping in our code. That's not to say that somewhere deep inside python the image file isn't getting memory mapped in, but it's not something that we request. (and even at that, it wouldn't require execute privs).

@wiredfool
Copy link
Member

To narrow this down, you could build just enough of a docker image to run pillow, and attempt to open and load the images in question. If they succeed, then you at least know that it's somewhere else. And if they fail, it's going to be a smaller test case.

@blaggacao
Copy link
Author

blaggacao commented May 30, 2016

In fact I had already done so, but I couldn't really draw a conclusion...

FROM alpine
USER root

# Part of my stack, no idea if maybe relevant.
RUN set -x \
    && apk add --no-cache su-exec \
    && chmod 500 $(which su-exec)

RUN apk --no-cache add python \
                       build-base \
                       python-dev \
                       py-pip \
                       # Pillow image libraries
                       jpeg-dev \
                       zlib-dev 
RUN LIBRARY_PATH=/lib:/usr/lib /bin/sh -c "pip install Pillow==3.2.0"

# --- User Creation ---
RUN addgroup -S -g 9001 otheruser && \
    adduser -S -u 9001 -G otheruser -h /home -s /sbin/nologin otheruser && \
    chown -R otheruser:otheruser /home
RUN apk --no-cache add gdb
ADD https://raw.githubusercontent.com/odoo/odoo/d2f0dba64c7afb3a9de49e7c3d46f74542f8bbbd/openerp/addons/base/static/img/main_partner-image.png /testimage.png
ADD https://gist.githubusercontent.com/blaggacao/aee86f3d8687efefb8186c9ffc19ac9b/raw/04b14f2407294c563e4b646162c754f0bc8fb00d/run.py /run.py
RUN echo "#!/bin/sh" >> /running && \
    echo "gdb -ex r --args python -v run.py" >> /running
RUN chmod +x /running

CMD ["/running"]

With this for running:
https://gist.githubusercontent.com/blaggacao/aee86f3d8687efefb8186c9ffc19ac9b/raw/04b14f2407294c563e4b646162c754f0bc8fb00d/run.py

I'll sleep over this... If you are interested in this error, maybe you can quickly build /run this dockerfile to see if you can find something out of this...

HEadache, something broke this test case again 😸

@wiredfool
Copy link
Member

I'm not sure what's up with that test script, especially the round trip through base64.

But, both of these test scripts work, without crashing:

from PIL import Image
im = Image.open('testimage.png')
im.load()

and:

import cStringIO as StringIO
from PIL import Image
with open('/testimage.png', 'rb') as f:
    data = f.read()
image_stream = StringIO.StringIO(data)
image = Image.open(image_stream)
image.load()

The load() at the end is important, since pillow does lazy loading of the image -- open only hits the python layer and gets the image metadata. If you don't actually access the image pixels, then it never loads them and you never hit the c layer for decompression.

So based on this, I'd say that either you're doing something else with the image that's causing issues, or it's failing in an entirely different place.

@blaggacao
Copy link
Author

The roundtrip was because I tried to mimic the original code. Well, thanks a lot for your help! Indeed, it also works on my docker for windows beta. So at least I can exclude this possible entropy as a cause for now...

@wiredfool
Copy link
Member

The base64 should be a noop, and if it's not, then that's going to be a completely separate problem.

In general though, Pillow shouldn't segfault, and I'll try to track them down given some sort of reproducible test case.

@blaggacao
Copy link
Author

True. OK, thanks again. I'll keep on digging...

@blaggacao
Copy link
Author

blaggacao commented May 30, 2016

Finally, with valgrind, I was able to get at least some error message:
https://gist.github.com/blaggacao/1df4f9dffb0e515aaf6decc6e67ff2d4#file-valgrind-log-L2354

It seems that like something "deep inside python", based on the logs, can there be anything said from a Pillow point of view, such as "definitely not related to pillow", "related but not responsible" or something alike?

Effectively it seems to be triggered by PaX flags and probably by the simple image loading, as this passed without problem in the test. I still couldn't figure out how to get a useful stack trace in order to see which code is calling but on a best guess basis, the program implements also image manipulation at some point https://github.com/odoo/odoo/blob/9.0/openerp/tools/image.py#L10

@wiredfool
Copy link
Member

The trace and activity from valgrind make it look like its a gc pass that's segfaulting. Its possible that it's related to the pax flags, but the underlying cause of that would be a mismatch in incref/decref in an extension, so that there is a use after freeing.

The other thing that I noticed is that the error is at least 3 seconds after the logging about pngs, so it's unlikely that that is the direct cause. It would fit with a gc issue, but that just means tha the crash is far removed from the source of the problem.

@blaggacao
Copy link
Author

💫 I think this is closable, thank you for you help! I'm trying to get help from the alpine linux mainainers now...

@blaggacao
Copy link
Author

This get's ever stranger, on the test docker we made, the first images passes and a second one also.
NOW, in the actual code the second image fails, effectively near the Pillow interaction, I'll try to narrow this down further and eventually reopen...

@blaggacao
Copy link
Author

blaggacao commented Jun 1, 2016

I don't know exactly what I did, probably something out of (or a combination of):

  • ulimit -c unlimited
  • call gdb from within a script so it's not PID1 on docker
  • run container in privileged mode
  • change /proc/sys/kernel/core_pattern, although no core dump file seems to be written on the new location

EDIT
I double checked and definitely it's the privileged flag which made the interactive gdb session working. Seems that gdb needs some kind of kernel capabilities. Not that far a catch... The ulimit -c unlimited in conjunction with the core_patten then would be responsible for generating the dump file, which is not created when the process is properly caught by gdb.
END EDIT

But, I now got a backtrace for the situation, maybe you can have a look at it?
https://gist.github.com/blaggacao/c081658dea0e7b95cb006a1b1673786f

Without knowing about coredump analysis at all, I would guess, the last item is the culprit (musl), but yeah, things are never that easy 😄

@wiredfool
Copy link
Member

Do you have an actual core file? Or a docker file that can replicate this?

The backtrace looks like pure python, it's a bunch of frames and PyObject_Call. The bit about musl looks like the entry point for python, so I think that's a red herring.

(aside, I'm not seeing how to get symbols for the apks, nor are breakpoints working, nor the python tweaks to gdb to let me see the python objects. Alpine seems to be something of a step backwards in terms of debugability)

@blaggacao
Copy link
Author

I can managed to dump a core file off the Docker in VM, https://filebin.net/9svhknfet7ofut0j
I can draft a dockerfile, which should be able to replicate, however it would be our full blown deployment stack. Thanks for your help!

@blaggacao
Copy link
Author

I found a list of differences between muslibc and glibc, might be of interest, yet maybe not directly related: http://wiki.musl-libc.org/wiki/Functional_differences_from_glibc#Name_Resolver_.2F_DNS

@wiredfool
Copy link
Member

I wasn't able to get much more out of the core dump, since I can't find any symbols packages for the python from the .apk. I'm pretty sure that symbols are required for the py-bt command.

I'm not sure what to tell you at this point, other than to try to narrow down what causes it. That may require switching to a python build that has symbols, or running under strace (which is blocked by something) or really verbose logging to narrow down the trigger.

@blaggacao
Copy link
Author

OK, I'm really grateful for your help as I have never ever touched these topics and don't have that much real options due to lacks in knowledge. I've tried to build a python with symbols, yet objdump -t tells me "no symbols found" so I'm confused again...

I've tried it this way, I'll try to check back with the alpine community.

# # --- Build Debug Liraries for GDB Debugging ---
# ENV CFLAGS='-Wall -O0 -g'
# RUN apk add --update alpine-sdk \
#   && adduser -G abuild -g "Alpine Package Builder" -s /bin/ash -D builder \
#   && echo "builder ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers \
#   && mkdir /packages \
#   && chown builder:abuild /packages \
#   && git clone --depth 1 git://dev.alpinelinux.org/aports \
#   && abuild-keygen -a -i \
#   && cd aports/main/python
# USER builder
# RUN cp -r aports/main/python /packages/python \
#   && cd /packages/python \
#   && abuild-keygen -a -i \
#   && abuild checksum \
#   && abuild -r \
#   && abuild -i python \
#   && abuild -i python-dev
# USER root

@wiredfool
Copy link
Member

The symbols might be in a central location like /usr/lib/debug or /usr/share/gdb.

@blaggacao
Copy link
Author

I think, I'm lucky: http://pkgs.alpinelinux.org/package/edge/main/x86_64/python-dbg Will make another attempt with this.

@blaggacao
Copy link
Author

Well steps are pretty small ones, I'm constantly running into this, google doesn't say a useful word about this...

Starting program: /usr/lib/debug/usr/bin/python2.7.debug
warning: Unable to find dynamic linker breakpoint function.
GDB will be unable to debug shared library initializers
and track explicitly loaded dynamic code.
Warning:
Cannot insert breakpoint -2.
Cannot access memory at address 0x65d

@lasley
Copy link

lasley commented Feb 2, 2017

Yeah there's an initial boot up sequence that you're seeing there. I've been isolating the logs with an initial request to /, which triggers that - then a subsequent request to the fail-page.

I haven't been getting any logging traces from my failures either, it's as if the app just vanishes.

@lasley
Copy link

lasley commented Feb 2, 2017

Bravo on getting the specific curl isolated btw! I've just been hitting the whole page 😑

@lasley
Copy link

lasley commented Feb 2, 2017

This new curl is perfect - I think @wiredfool is correct with the hypothesis that this is likely caused by lxml.

I traced the code backwards from web.planner/render, and it's basically just a template render. Traced that all the way back to the compilation, which is a bunch of heavy XML handling. No Pillow in sight for any of that!

This would also explain the original error @blaggacao was seeing with the database not being created, which is also pretty heavy on the lxml side of things.

@wiredfool
Copy link
Member

This sort of debugging would be far easier if Python was compiled --with-dtrace for USDT support. That combined with BPF in recent linux kernels would mean that we could dump enter and exit from each python function from outside the container. Then again, simply having the dbg symbols would have helped as well.

@lasley
Copy link

lasley commented Feb 3, 2017

I totally agree - I guess we all knew there were sacrifices when moving to Alpine though. How else would we have such a small OS?

After playing around a bit more, I am confident this isn't your problem @wiredfool. Thank you incredibly for your help in this, I am going to take the issue into a more appropriate realm (once I identify it is truly lxml).

You can go ahead and close this ticket, again 😸

@inakoll
Copy link

inakoll commented Feb 4, 2017

I've just hit a similar issue with a django/Pillow based app on Alpine with python 3.5.2. I've quickly suspected a C extension (and Pillow) to cause the segfault but in my case it was simply due to a stack overflow (819+ stack frames of ~100 bytes each). With musl libc the default stack size is 80 KB (81920 B) while it is 8 MB with glibc. 80 KB is a "reasonable" default but it might not be a suitable choice for an existing application that assume a 8 MB stack size. In my case increasing the stack size from the main thread "solved" the problem (threading.stack_size(4*80*1024) 4x80 KB stack size for example).

I've done some research to see if Pillow or its dependencies were allocating large buffer on the stack but it doesn't look like. I would appreciate if someone could confirm that point. I am sorry if I burden you

@wiredfool
Copy link
Member

@inakoll That would make some sense, as the stack trace linked (way) above is something like 414 items deep in c python. XML processing would tend to create that sort of stack trace. At any rate, it's an easy to test hypothesis.

@yajo
Copy link

yajo commented Feb 6, 2017

Any idea on how to add that to the image? I'm not much in the C side of the street...

@wiredfool
Copy link
Member

The threading line is python, so I threw that into the odoo-bin launcher as a test:

#!/usr/bin/env python

# set server timezone in UTC before time module imported
__import__('os').environ['TZ'] = 'UTC'
__import__('pkg_resources').declare_namespace('odoo.addons')

import inject_logger

import threading
threading.stack_size(4*80*1024)

import odoo

if __name__ == "__main__":
    odoo.cli.main()

And it appears to be successful, at least in the case found above. There may be other calls that have issues.

@lasley
Copy link

lasley commented Feb 7, 2017

It seems successful in mine too, albeit with limited testing. I'll submit a PR to your image @yajo so we can perform some more extensive testing.

@lasley
Copy link

lasley commented Feb 7, 2017

Hrmmm ok maybe I don't quite know your image well enough to submit that change @yajo, but here's what you need:

# Fix stack size overflow
# https://github.com/python-pillow/Pillow/issues/1935
RUN awk '/import odoo/ { print; print "import threading; threading.stack_size(4*80*1024)"; next }1' /usr/local/bin/odoo > /tmp-bin \
    && mv /tmp-bin /usr/local/bin/odoo

/usr/local/bin/odoo being the Odoo installation path

lasley added a commit to LasLabs/docker-odoo that referenced this issue Feb 7, 2017
* Fix threading stack overflow size from python-pillow/Pillow#1935
lasley added a commit to LasLabs/docker-odoo that referenced this issue Feb 7, 2017
* [FIX] Threading stack overflow size
* Fix threading stack overflow size from python-pillow/Pillow#1935

* Fix awk usage
@blaggacao
Copy link
Author

An excellent whodunnit!

@blaggacao
Copy link
Author

I might recall http://wiki.musl-libc.org/wiki/Functional_differences_from_glibc for the way to go.

@yajo
Copy link

yajo commented Feb 13, 2017

I think the best would be to open PR to Odoo directly. I guess this could affect any other deployments with different stack sizes. Adding a configuration parameter, env variable, or simply a sane default that works everywhere seems a good improvement, even for stable releases. What do you think @lasley?

@lasley
Copy link

lasley commented Feb 13, 2017

IMO we would need to identify the real CPython lib causing this issue then submit it there. This isn't really an Odoo issue, as @inakoll confirmed by mentioning that the issue is also occurring in a Django app. Setting the thread stack size is totally just duct tape on the situation.

@wiredfool
Copy link
Member

The current theory is that it's a stack size issue. Overflowing the stack can cause a segfault and happen in pure python (hence stackless python which was tuned for recursive loads). It doesn't have to be a CPython extension. One likely cause would be recursive parsing of deep XML documents, or an AST generated from them. (This is at least plausible from the backtrace listed above).

You may be able to cause the same error by parsing any xml in the accounting package, or it might require building bits back up with the render.

@deed02392
Copy link

deed02392 commented Jul 22, 2017

Edit: just realised that python-pillow is obviously not the right place for this issue but anyway I think you guys would be interested in more evidence to support where the issue stems from.

Hi all,

I can also report what I think is the same issue occurring with another Docker container: ressu/sickgear . This also depends on lxml which hardly seems like a coincidence. What is interesting however is that an old release of sickgear itself (0.11.x) did not suffer from the fault.

To reproduce, build the container and run it: run -it --rm --entrypoint=/bin/sh -p 8081:8081 ressu/sickgear then navigate to localhost:8081 and under the Cog wheel choose Notifications (or just navigate to http://localhost:8081/config/notifications and the application will SIGSEGV.

I too have tried to debug this, first with strace which indicated it happening around the time of doing template processing but this wasn't helpful enough. I then attempted gdb and although the application starts up great, when I trigger the segfault, gdb says:

During startup program terminated with signal SIGSEGV

Which doesn't enable any debugging in terms of following stack traces. I'm not sure why it treats the obviously fully loaded software a still starting. I do get an error about being unable to disable address randomisation but I don't think that's relevant unless this is indirectly indicating another problem.

@lasley
Copy link

lasley commented Jul 22, 2017

The plot thickens!

An interesting note - we were able to solve this problem in our Alpine container by switching to multi-threaded worker model. That revelation was honestly what caused me to just call it a lost cause and conform with the switch to Debian ☹️

@deed02392
Copy link

deed02392 commented Jul 22, 2017

Edit: just confirming that this issue can also be fixed on SickGear by increasing threading stack size: threading.stack_size(4*80*1024).

Does that resolve the issue because each thread gets its 'own stack' and therefore there is less chance for running out of stack space (different from stack overflow, I assume)? In which case I wonder if I can patch SickGear with the larger thread stack size. How does one come up with an optimal size? Some calculation based on the likely size/depth of the XML documents being parsed?

@deed02392
Copy link

@lasley I get that. I have managed to keep with Alpine by manually increasing the stack size as mentioned above. It seems like there's a bit of pressure being applied to the Alpine maintainers atm to increase the stack size, but we shall have to wait and see what happens. It's pretty sucky to have to patch upstream so many apps because this core setting is so conservative.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants