-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pylint 2.7 hangs on many-core Windows machines #6965
Comments
Thanks for the report @randomascii! Seems reasonable to fix and also pretty straightforward. I would like to be able to link to a |
I am assuming that no upstream Python bug was ever filed, and I believe that that person is not currently working on Chromium and therefore will not be filing one. BTW, if you have any advice on how to pass a jobs count through lint.Run() then that would be helpful. I suspect it is easy but I couldn't divine the incantation. Until then I'm stucking using my machine with just 28 logical processors :-) |
It seems easy to fix once we know the precise number we should use. Maybe 50 is close enough ? Regarding using pylint API there's this documentation: I suppose it would work with: from pylint.lint import Run
Run(argv=["--jobs 42", "myfile.py"]) |
I will run some tests to try to determine the correct number. Thanks for the pointer. It looks like the correct syntax is this:
That is, '--jobs' and '42' need to be separate arguments. argv is our existing list of files to analyze. |
So, the good news is that the syntax above lets me increase the number of jobs. This means that I can reproduce the hangs on my four-core eight-logical-processor laptop. So, yay! This let me determine that 61 jobs reliably causes hangs, whereas 60 does not. This odd limit is presumably because WaitForMultipleObjects allows a maximum of MAXIMUM_WAIT_OBJECTS in its array (MAXIMUM_WAIT_OBJECTS == 64) and there must be four other handles (besides the child processes) which multiprocessing ends up waiting on. The bad news is that the --jobs feature does not let me decrease the number of jobs. That is, if I pass '--jobs', '42' on my 96 logical-processor machine then I still get 96 jobs (verified with procexp.exe and with using pskill.exe when things get hung). I don't know why this is the case - I couldn't find any corresponding logic in run.py, so I guess I'm looking in the wrong place. I've noticed that spawning the child processes is extremely slow - about 1-2 processes per second. So, 60-way parallelism generally means that there is about a 30 s delay while things spin up, then almost no time spent doing work. This slow startup time was happening despite my CPUs being about 70% idle. This slow startup time presumably explains the bug I saw about how ineffectual multiprocessing was for pylint. You would need a huge batch of files to justify doing significant multi-processing. I don't know whether the slow startup is a bug in multiprocessing or in pylint. I just measured some of our presubmits and 60-way parallelism more than doubles the time that they take, from ~30 to ~70 s. I don't know how long the average file takes to process but presumably jobs should be set something like this:
|
This indeed seems like a bug in Either way, I think a bug report to
Hmm, that's weird. If you're able to do so it would be helpful if you could add a That's what we are passing to
Yeah, this is a big issue, which I'm not actually sure how to resolve. We would need a complete rewrite of the
I feel like it is a combination but we could certainly do better here. |
Thanks for all the help! Now that I know enough about this issue I will file a CPython bug and get the root cause fixed. I dropped in a bit of logging as you suggested and eventually realized that the -jobs weirdness is self inflicted. We are already passing --jobs to PyLint when we run it in our self-defined parallel mode. When we run it in serial mode we don't pass --jobs at all and then PyLint uses its default. So, all I was doing was changing our serial mode into parallel. Sigh... And, it is ironic that we were already passing --jobs to lint.Run, just in a totally different part of the code. So, I can work around this and I will cap the value at some reasonable number to avoid this bug and reduce startup overhead. I don't know if PyLint should add a mitigation or if it should just wait for a CPython fix. |
👍
🎉
Well, I don't think this will be classified as a security bug so it won't be backported to all the versions we support. So we should at least mitigate on older versions. Btw, since you did most of the work here you're obviously more than welcome to make the bug fix for |
Thank you for the analysis @randomascii, much appreciated. I opened #6967 for the other issue with multi-processing startup time. |
This is the change I am proposing in Chromium's depot_tools. By limiting cpu_count and _pool_size this may mitigate this python 3 bug across multiple parts of our presubmit system. Plus, I cap the number of jobs at 1 + file_count // 10 to avoid the situation where 96 child processes are created for just a few inputs (I assume that input files are not shared across processes, but I might be wrong). Here is the proposed depot_tools/Chromium CL: https://chromium-review.googlesource.com/c/chromium/tools/depot_tools/+/3711282 |
I landed my depot_tools change so Chromium no longer hangs when running presubmits, but as promised I'm investigating what's going on under the hood and how it should be changed. Whoo boy - this whole thing is a bug farm. For starters, this CPython comment is wrong: However, if the maximum was 63 then given that there are two other handles that multiprocessing waits for then 61 would be the correct maximum pool size, as this comment says: However there is a CPython bug. When I use a pool size of 61 with Python 3.8.10 I get this error: So, CPython is range-checking with the wrong value (63 instead of 64) and it is (was, actually) checking with >= instead of >. The comparison-operator bug was fixed here (issue was python/cpython#84444): Therefore, while the CPython pool-size limit on Windows should be 62, it is actually 60. Once the fix above ships then the limit will go up to 61, which is still not where it should be. The other problem, however, is that the ValueError exception gets thrown but Python still hangs. If output is not being displayed then the ValueError exception is invisible and we just have a hang. BTW, after landing my fix I noticed some new Pylint warnings in some scripts that hadn't changed in a long time. I eventually tracked the new warnings to the change in parallelism. That is, when I analyzed these files with jobs=1 I got eight extra warnings that I didn't get when analyzing them with jobs=2 or higher. More details and the change that I had to land to fix these issues are here: I have a couple of CPython bugs to file, I guess. As for Pylint, I suspect that a version of this fix that I made to wpt is appropriate: This fix wouldn't actually help Chromium (because we passed in our own jobs parameter) but it could help others. |
Yeah this is sadly a known issue. There are some messages that depend on the state of other files/messages which we don't really handle well for multiprocessing. I know there was a contributor who was planning to look into this for |
After discussion on python/cpython#94242 it appears that 61 is the correct maximum because Python requires space for three extra handles. The comments give the wrong reason for the 61 limit, but 61 is correct (assuming Python 3.9 or higher). |
Creating a multiprocessing Pool with too many processes can hit ValueError exceptions or hangs or both. The number that counts as "too many" depends on the Python version so this change uses 56 as a guaranteed safe limit. Details in #6965 Note that this only avoids the issue if an explicit jobs count is not passed in. Co-authored-by: Daniël van Noord <[email protected]>
Creating a multiprocessing Pool with too many processes can hit ValueError exceptions or hangs or both. The number that counts as "too many" depends on the Python version so this change uses 56 as a guaranteed safe limit. Details in pylint-dev#6965 Note that this only avoids the issue if an explicit jobs count is not passed in. Co-authored-by: Daniël van Noord <[email protected]>
Creating a multiprocessing Pool with too many processes can hit ValueError exceptions or hangs or both. The number that counts as "too many" depends on the Python version so this change uses 56 as a guaranteed safe limit. Details in #6965 Note that this only avoids the issue if an explicit jobs count is not passed in. Co-authored-by: Daniël van Noord <[email protected]>
Bug description
We use PyLint in Chromium and we have been switching to Python 3 and PyLint 2.7 recently. While testing on a many-core (96 logical processor) machine I found that PyLint reliably hangs. This is due to a known Python 3 bug with multi-processing on machines with more than 56 (possibly more than 63 or 64) logical processors. We first hit this bug last year (discussed in crbug.com/1190269) but only recently discovered that PyLint is vulnerable.
If we were running PyLint from the command line then we could just pass a -j value that is capped at 56 (or 63, or 64) but I have not been able to figure out how to cap the jobs value when invoking PyLint through lint.Run().
Internally PyLint should cap its parallelism until the Python 3 bug (not yet reported I believe, sorry) is fixed.
Configuration
Command used
Pylint output
Expected behavior
I expect the PyLint run to complete.
Pylint version
OS / Environment
Windows 10
Additional dependencies
No response
The text was updated successfully, but these errors were encountered: