-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider adding '--enable-optimizations' configure argument #160
Comments
I think it may be worth considering. I see why it takes so much longer; it forces a run of all the tests: Running code to generate profile data (this can take a while):
make run_profile_task
make[1]: Entering directory '/usr/src/python'
: # FIXME: can't run for a cross build
LD_LIBRARY_PATH=/usr/src/python ./python -m test.regrtest --pgo || true
Run tests sequentially
0:00:00 [ 1/405] test_grammar
.... |
Yes, PGO means you need to compile all code once with profiling enabled, than execute some code that exercises the code in a typical way and finally compile everything again using the generated profile data as guidance. Python for now choose to use their regression test suite as the "typical code" to run and generate the profiling data. Personally I don't think running all their tests is the best thing to do, but that's an issue to bring up with the Python folks and at least in my mind outside the scope of the docker image for it. |
It definitely is much longer: $ time docker build --no-cache 3.6-rc/
...
real 1m38.180s
user 0m0.157s
sys 0m0.069s
$ # after adding `--enable-optimizations`
$ time docker build 3.6-rc/
real 34m21.418s
user 0m0.267s
sys 0m0.124s
$ # the first build even includes the `apt-get update && apt-get install tcl tk` layer while the second used cache |
Well, we have another problem (besides the performance issue) to solve before enabling that flag. just for curiosity: When running in Debian (changing the Dockerfile related to python:3.6 normally), all went ok. But..when changing the Dockerfile related to python:3.6-alpine, a segfault occurs while Python runs the test suite needed to create the profile related to the ctypes module:
After that segfault, the tests stop, and so I think the profile needed to enable the optimizations too, though I did not went deeper in that problem..I think it worth the mention. :) |
Please also consider adding the |
@philtay From I read in release notes of the related issue and in the resulting patch, lto is enabled, except for macOS. |
@Kentzo the release notes and bugs are a bit hard to follow, but
And the way that comment sounds, I think it's too early for the general purpose Python docker image to adopt |
@hannosch Interesting, missed that. Thank you. |
@hannosch LTO are pretty safe. They're not enabled by default simply because they're not widely supported (build tools wise). For instance, it's years that Debian & friends have enabled them on Python. |
@philtay ah, I didn't know that. Since most of the docker images are based on Debian it stands to reason that at least those variants should be compatible and support the same options as the Debian Python packages. It does look as though the Debian Python packages do a couple of conditional checks before enabling PGO and LTO, especially checking the architecture and gcc version. They also seem to change a bunch of other options (LTO_CFLAGS, EXTRA_OPT_CFLAGS, AR, RANLIB) as seen in http://bazaar.launchpad.net/~doko/python/pkg2.7-debian/view/head:/rules#L156 So it's probably not as easy as just adding |
@hannosch Debian enables LTO on the following architectures |
As far as I can tell, Debian does not enable LTO on I am also still very hesitant to add the
|
I've had the same issue with the Alpine build and Python 3.6. The Alpine Edge version actually has a release candidate build for 3.6.1. The slightly modified way they build the image is by using the system packages for I've opened a PR with the changes that fix it for me. This also adds the optimization flags discussed above. |
Given the huge time and resource increase that |
Actually, do we even know how PGO and LTO work? Would the optimizations be CPU-specific? E.g. if you build the image on an 8th-gen Intel Core i7 CPU, could it result in a Python executable that sucks on an AMD Ryzen CPU or on older generations of Intel Core CPUs? That's something to consider, given that Docker images are supposed to be run on any x86-64 machine. |
@ksze that is a really excellent point we hadn't considered! ❤️ Given that pretty compelling argument, I'm going to close this. 👍 |
I don't think PGO is architecture-specific. At least, I can't find any reference that says it is. Popular software such as Chrome and Firefox are compiled with PGO on some platforms and I'm pretty sure there aren't architecture-specific (e.g. Haswell/Ryzen/Skylake) binaries of those. |
@JayH5 That sounds reasonable. My initial question was also partly rhetorical. I didn't mean to completely shoot down and close the issue. I was just pointing out something we should investigate. If we want a clear answer, we may need to ask the GCC/Clang people. I have also heard suggestions that PGO mostly just collects stats about branch hits and rearranges the code to be more cache-efficient, which shouldn't be particularly CPU-specific. An optimization flag like @tianon Could you re-open the issue? |
LTO by itself is definitely not CPU-specific. |
That's fair, but my objections from #160 (comment) definitely still stand. |
I found out why Python 2.7 was segfaulting in Alpine. The fix that was applied to Python 3 wasn't added to Python 2 as well. The PR I have applies the same fix to Python 2.7 and now there's a green build and no segfaults! I think regardless if the optimizations are accepted, the fixes for 2.7 should probably be merged. Thoughts? |
Alright! I ran some tests over night and here are the results: https://gist.github.com/blopker/9fff37e67f14143d2757f9f2172c8ea0 The Json improvements are pretty impressive, in my opinion. I've seen applications where half the time is spent dumping Json. I feel like I've beaten this one down enough to have an informed discussion so I'll wait until someone wants to go over the results. Here's the last successful build: https://travis-ci.org/docker-library/python/builds/462346459 Cheers! |
Any updates regarding this? |
Nothing has changed since #160 (comment) (see also #357 (comment)). |
Every single computer in the world using these docker python packages is wasting 10-20% of their Python process CPU cycles as a result of these being non-optimized builds. Complaining about the build time is missing the entire point. Higher latency and cpu consumption on this single project's end building things is worth ~infinitely less than all of the wasted cpu cycles and carbon consumed from not doing this change around the world. If you do not like the build time, you can change the profile training run workload. Currently it defaults to effectively the entire test suite. That is overkill and runs unnecessary, useless for profile data, slow tests as part of that. You'll get a similar efficiency boost from an optimizations build if you specify a much smaller profiling workload at build time. When you do your make step, pass in a PROFILE_TASK that is more limited after you configure with
The fact that Python started up and ran the test.regrtest test suite launcher is good already - that exercises the bulk of the ceval interpreter bytecode event loop. The goal of a profile task, in this case test suite test selection is merely to exercise specific portions of CPython internals implemented in C such as regular expressions, json & pickle, fstring formatting, unicode, the decimal type, math functions, etc. There is no need to be perfect in that list. If particular tests in there take a long time or are flaky, feel free to exclude them; you'll still see a large benefit overall. The examples above showing super long builds would be reduced to builds taking probably 3-4x as long instead of 20-30x as long. Do it. Every docker python user in the world would see a huge benefit. caveats? you may find yourself needing to customize the PROFILE_TASK setting slightly for the oldest CPython builds. differences should be minor and only need to change at most once per X.Y release. |
So if this profiling data doesn't change very often, why isn't it published with the upstream release? Presumably the tests were run during the release process, and the relevant (now "official") profiles could simply be published for downstream consumption? What is the gain in having everyone regenerate the same profiles themselves by running the entire test suite? |
The profiling data changes any time any C code changes and depends on the local build configuration and system (pyconfig.h) and specific compiler versions. It is a binary build artifact that can't meaningfully be published. |
Case in point on Docker being the odd duck here: All noteworthy Linux distros build their distribution |
Ok thanks for the added background. To be clear, this is something we want to do, but we are very limited on actual compute resources, which has been our largest hesitation (and most noteworthy distros don't have that same constraint). Is there an official (or even semi-official) recommendation for a sane limited value to supply to |
There's never been anything official, otherwise we'd presumably just put that in as the default. I suggest testing the build speed using that list and if it still takes whatever you deem to be too long, whittle it down. If you look in the build log, it'll list which tests took the longest to run during the profiling step. ex: test_pickle seems to take a long time, it does way more work than is likely to be necessary for profiling purposes; you could omit it if needed. |
Nice, the additional background and assistance is definitely appreciated! ❤️ 👍 I did some testing with this list you've provided (on 3.5, 3.7, and 3.8-rc builds), and the build time on my local (fairly beefy) workstation was still on the order of minutes (between 4-6 here, probably a bit more on our build servers but still entirely reasonable IMO). 🤘 |
FWIW, https://bugs.python.org/issue36044 seems relevant too (talking about how the way the tests run isn't exactly indicative of real-world code given the way they force GC, which lends further credence to using a slimmer list). 👍 |
For posterity I ran some pyperformance suite benchmarks on a CPython 3.7 built without --enable-optimizations, then again with Benchmarks range from 1.02x to 1.28x faster in the optimized build. with the bulk of them being > 1.11x faster. :) |
Whoo! Very happy this worked out. Now we don't have to make our own images :) Thanks everyone! |
Awesome! Are these changes now Live in the latest published images? Ex. Slim-Stretch and Alpine |
This comment has been minimized.
This comment has been minimized.
@ranjeetjangra this is the github repo for the Docker image that contains Python. For general help building you'll need to look at stackoverflow or a python (or centos) specific forum or mailing list. |
Python has recently added a new configure argument called
--enable-optimizations
, one of the relevant issues is https://bugs.python.org/issue26359.The new argument will turn on compile time optimizations relevant to production builds. Currently this turns on profile guided optimizations (PGO). PGO builds will take a lot longer (possible 20-40 minutes) to produce, but the resulting Python binary is about 10% faster at executing Python code (according to Python's new benchmark suite).
The new configure flag is available in the recently released Python 3.6.0b4 and was backported to the 3.5 and 2.7 branches, so it should become available in Python 2.7.13 and 3.5.3.
In the future the Python developers may decide to turn on further optimizations based on this argument, for example link-time optimizations (LTO), though they haven't worked out all the bugs for that one yet.
I think the docker images should use this argument, to benefit from the added optimizations.
Thoughts?
The text was updated successfully, but these errors were encountered: