-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault on Alpine Linux Docker #1935
Comments
After resuming the container, we can see also resuming the loading with the following output:
|
What were you doing with the pillow library, and what file were you using? |
On a normal ubuntu install, it works well, however, alpine is kind of the standard in docker security thanks to pax/Grsecurity... |
Segfaults are actually pretty easy to pin down, more or less.
Fedora is the best out of the box for gdb, although there are macros that you can put in for other linuicies. That can give you a python level back trace from within gdb, as well as being able to inspect python objects. The unified view is really useful. If you can, it helps to run a version of python with debug symbols, and compile with optimizations off. (on ubuntu, it's by installing python-dbg and compiling and running pillow against that)
|
Hm, I wasn't able yet to get python with debug symbols, as I think running on Alpine Linux is relevant.
|
That's odd. If you're interpreting a png, then you should be well through the startup. Try strace? It's going to have a ton of output, but you should be able to tell what's tried to execute and fail there. |
Neither...
|
Can this be related to PaX ? I've seen I already tried |
There's nothing in Pillow that would need executable memory maps, and at any rate, png decoding would not trigger memory mapping in our code. That's not to say that somewhere deep inside python the image file isn't getting memory mapped in, but it's not something that we request. (and even at that, it wouldn't require execute privs). |
To narrow this down, you could build just enough of a docker image to run pillow, and attempt to open and load the images in question. If they succeed, then you at least know that it's somewhere else. And if they fail, it's going to be a smaller test case. |
In fact I had already done so, but I couldn't really draw a conclusion...
With this for running: I'll sleep over this... If you are interested in this error, maybe you can quickly build /run this dockerfile to see if you can find something out of this... HEadache, something broke this test case again 😸 |
I'm not sure what's up with that test script, especially the round trip through base64. But, both of these test scripts work, without crashing:
and:
The load() at the end is important, since pillow does lazy loading of the image -- open only hits the python layer and gets the image metadata. If you don't actually access the image pixels, then it never loads them and you never hit the c layer for decompression. So based on this, I'd say that either you're doing something else with the image that's causing issues, or it's failing in an entirely different place. |
The roundtrip was because I tried to mimic the original code. Well, thanks a lot for your help! Indeed, it also works on my docker for windows beta. So at least I can exclude this possible entropy as a cause for now... |
The base64 should be a noop, and if it's not, then that's going to be a completely separate problem. In general though, Pillow shouldn't segfault, and I'll try to track them down given some sort of reproducible test case. |
True. OK, thanks again. I'll keep on digging... |
Finally, with valgrind, I was able to get at least some error message: It seems that like something "deep inside python", based on the logs, can there be anything said from a Pillow point of view, such as "definitely not related to pillow", "related but not responsible" or something alike? Effectively it seems to be triggered by PaX flags and probably by the simple image loading, as this passed without problem in the test. I still couldn't figure out how to get a useful stack trace in order to see which code is calling but on a best guess basis, the program implements also image manipulation at some point https://github.com/odoo/odoo/blob/9.0/openerp/tools/image.py#L10 |
The trace and activity from valgrind make it look like its a gc pass that's segfaulting. Its possible that it's related to the pax flags, but the underlying cause of that would be a mismatch in incref/decref in an extension, so that there is a use after freeing. The other thing that I noticed is that the error is at least 3 seconds after the logging about pngs, so it's unlikely that that is the direct cause. It would fit with a gc issue, but that just means tha the crash is far removed from the source of the problem. |
💫 I think this is closable, thank you for you help! I'm trying to get help from the alpine linux mainainers now... |
This get's ever stranger, on the test docker we made, the first images passes and a second one also. |
I don't know exactly what I did, probably something out of (or a combination of):
EDIT But, I now got a backtrace for the situation, maybe you can have a look at it? Without knowing about coredump analysis at all, I would guess, the last item is the culprit (musl), but yeah, things are never that easy 😄 |
Do you have an actual core file? Or a docker file that can replicate this? The backtrace looks like pure python, it's a bunch of frames and PyObject_Call. The bit about musl looks like the entry point for python, so I think that's a red herring. (aside, I'm not seeing how to get symbols for the apks, nor are breakpoints working, nor the python tweaks to gdb to let me see the python objects. Alpine seems to be something of a step backwards in terms of debugability) |
I can managed to dump a core file off the Docker in VM, https://filebin.net/9svhknfet7ofut0j |
I found a list of differences between muslibc and glibc, might be of interest, yet maybe not directly related: http://wiki.musl-libc.org/wiki/Functional_differences_from_glibc#Name_Resolver_.2F_DNS |
I wasn't able to get much more out of the core dump, since I can't find any symbols packages for the python from the .apk. I'm pretty sure that symbols are required for the py-bt command. I'm not sure what to tell you at this point, other than to try to narrow down what causes it. That may require switching to a python build that has symbols, or running under strace (which is blocked by something) or really verbose logging to narrow down the trigger. |
OK, I'm really grateful for your help as I have never ever touched these topics and don't have that much real options due to lacks in knowledge. I've tried to build a python with symbols, yet I've tried it this way, I'll try to check back with the alpine community.
|
The symbols might be in a central location like /usr/lib/debug or /usr/share/gdb. |
I think, I'm lucky: http://pkgs.alpinelinux.org/package/edge/main/x86_64/python-dbg Will make another attempt with this. |
Well steps are pretty small ones, I'm constantly running into this, google doesn't say a useful word about this...
|
Yeah there's an initial boot up sequence that you're seeing there. I've been isolating the logs with an initial request to I haven't been getting any logging traces from my failures either, it's as if the app just vanishes. |
Bravo on getting the specific curl isolated btw! I've just been hitting the whole page 😑 |
This new curl is perfect - I think @wiredfool is correct with the hypothesis that this is likely caused by I traced the code backwards from This would also explain the original error @blaggacao was seeing with the database not being created, which is also pretty heavy on the |
This sort of debugging would be far easier if Python was compiled |
I totally agree - I guess we all knew there were sacrifices when moving to Alpine though. How else would we have such a small OS? After playing around a bit more, I am confident this isn't your problem @wiredfool. Thank you incredibly for your help in this, I am going to take the issue into a more appropriate realm (once I identify it is truly lxml). You can go ahead and close this ticket, again 😸 |
I've just hit a similar issue with a django/Pillow based app on Alpine with python 3.5.2. I've quickly suspected a C extension (and Pillow) to cause the segfault but in my case it was simply due to a stack overflow (819+ stack frames of ~100 bytes each). With musl libc the default stack size is 80 KB (81920 B) while it is 8 MB with glibc. 80 KB is a "reasonable" default but it might not be a suitable choice for an existing application that assume a 8 MB stack size. In my case increasing the stack size from the main thread "solved" the problem ( I've done some research to see if Pillow or its dependencies were allocating large buffer on the stack but it doesn't look like. I would appreciate if someone could confirm that point. I am sorry if I burden you |
@inakoll That would make some sense, as the stack trace linked (way) above is something like 414 items deep in c python. XML processing would tend to create that sort of stack trace. At any rate, it's an easy to test hypothesis. |
Any idea on how to add that to the image? I'm not much in the C side of the street... |
The threading line is python, so I threw that into the odoo-bin launcher as a test:
And it appears to be successful, at least in the case found above. There may be other calls that have issues. |
It seems successful in mine too, albeit with limited testing. I'll submit a PR to your image @yajo so we can perform some more extensive testing. |
Hrmmm ok maybe I don't quite know your image well enough to submit that change @yajo, but here's what you need:
|
* Fix threading stack overflow size from python-pillow/Pillow#1935
* [FIX] Threading stack overflow size * Fix threading stack overflow size from python-pillow/Pillow#1935 * Fix awk usage
An excellent whodunnit! |
I might recall http://wiki.musl-libc.org/wiki/Functional_differences_from_glibc for the way to go. |
I think the best would be to open PR to Odoo directly. I guess this could affect any other deployments with different stack sizes. Adding a configuration parameter, env variable, or simply a sane default that works everywhere seems a good improvement, even for stable releases. What do you think @lasley? |
IMO we would need to identify the real CPython lib causing this issue then submit it there. This isn't really an Odoo issue, as @inakoll confirmed by mentioning that the issue is also occurring in a Django app. Setting the thread stack size is totally just duct tape on the situation. |
The current theory is that it's a stack size issue. Overflowing the stack can cause a segfault and happen in pure python (hence stackless python which was tuned for recursive loads). It doesn't have to be a CPython extension. One likely cause would be recursive parsing of deep XML documents, or an AST generated from them. (This is at least plausible from the backtrace listed above). You may be able to cause the same error by parsing any xml in the accounting package, or it might require building bits back up with the render. |
Edit: just realised that python-pillow is obviously not the right place for this issue but anyway I think you guys would be interested in more evidence to support where the issue stems from. Hi all, I can also report what I think is the same issue occurring with another Docker container: ressu/sickgear . This also depends on To reproduce, build the container and run it: I too have tried to debug this, first with
Which doesn't enable any debugging in terms of following stack traces. I'm not sure why it treats the obviously fully loaded software a still starting. I do get an error about being unable to disable address randomisation but I don't think that's relevant unless this is indirectly indicating another problem. |
The plot thickens! An interesting note - we were able to solve this problem in our Alpine container by switching to multi-threaded worker model. That revelation was honestly what caused me to just call it a lost cause and conform with the switch to Debian |
Edit: just confirming that this issue can also be fixed on SickGear by increasing threading stack size: Does that resolve the issue because each thread gets its 'own stack' and therefore there is less chance for running out of stack space (different from stack overflow, I assume)? In which case I wonder if I can patch SickGear with the larger thread stack size. How does one come up with an optimal size? Some calculation based on the likely size/depth of the XML documents being parsed? |
@lasley I get that. I have managed to keep with Alpine by manually increasing the stack size as mentioned above. It seems like there's a bit of pressure being applied to the Alpine maintainers atm to increase the stack size, but we shall have to wait and see what happens. It's pretty sucky to have to patch upstream so many apps because this core setting is so conservative. |
What did you do?
Pillow is embedded into the Odoo library. I was trying to load a data file which referenced a png, when suddenly a segfault terminated my container.
What did you expect to happen?
The processing of the png would complete successfully.
What actually happened?
Putting on DEBUG mode, we can see the following output.
What versions of Pillow and Python are you using?
Pillow 3.2.0 and (probably through) 2.7.0
Python 2.7.11
Alpine Linux 3.3
The text was updated successfully, but these errors were encountered: