Avoid kernel failures with multiple processes #437

alexrudy · 2019-05-08T17:39:12Z

This PR adds several new tests which ensure that kernels can be used when started and operated in a multiprocessing context.

Before this change, if the global ZMQ context was initialized before the processed forked, then the child processes might fail. This is easy to do accidentally – run some kernel first in the current process, to check if it works, then farm that same function out to a pool of processes, and things will go poorly. I think this is because ZMQ contexts are not safe after fork. However, switching from using the global ZMQ context to a single ZMQ context per client eliminates the failures.

One could argue that kernels should never be used in multiprocessing contexts, because they are run in a subprocess anyhow, so multiprocessing doesn't provide a better way to escape the gil. However, I can't think of a way to detect that case and raise an error without waiting for a timeout, and the error returned there has to be pretty generic (it is currently RuntimeError: Kernel didn't respond in 30 seconds).

Sorry to drop this PR in without opening an issue, but the issue itself is hard enough to reproduce that a minimal reproduction was easiest in the context of the test suite. Most of the code here is adding various combinations of parallel execution to the test suite and ensuring that they work.

We discovered this problem when @MSeal and I tried to get parallel uses of papermill running during pycon sprints. The original papermill issue is nteract/papermill#329. The ability to run kernels in parallel without surprising results is also helpful for nbconvert (see jupyter/nbconvert#1018)

MSeal · 2019-05-12T23:32:43Z

@minrk @mpacer @Carreau We traced down this issue during PyCon sprints as one of the reasons for parallel nbconvert / papermill calls failing. Would love your eyes on the PR and to get some momentum on resolving this, plus the PRs in nbconvert associated so we can enable parallel executions end-to-end. Alexander actually has more experience with zeroMQ than myself so it was great he ended up digging into this problem. My main concern is not knowing the trade-offs with original codification of zeromq here and if the changes would have subtle issues that aren't obvious.

@alexrudy Do you think we'll need to add an accessor to the active session counter inside the cython cdef or does isolating to individual clients resolve that entirely for the multiprocessing and threaded case?

One could argue that kernels should never be used in multiprocessing contexts

I disagree there. It's a valid use-case and there are other constraints on problems that would require using multiprocessing with a jupyter client.

The test failure is with python 3.4 which most of the jupyter ecosystem tools (or their upstreams) have dropped support for. I was seeing the same failure pattern without any code changes in nbconvert 5.4 when developing. I'd suggest we remove 3.4 support for the next release and not block the PR on kernel timeouts with 3.4 (which is what we did in nbconvert).

Side note, I don't have permissions on this repo to even assign reviewers. Would love to have broader permissions in jupyter_client and help out here when possible.

minrk · 2019-05-13T10:07:28Z

@MSeal I gave you permissions on this repo and pushed a commit dropping Python 3.4 support. Feel free to make this your first merge if all looks well here.

alexrudy · 2019-05-13T16:53:28Z

@MSeal –

@alexrudy Do you think we'll need to add an accessor to the active session counter inside the cython cdef or does isolating to individual clients resolve that entirely for the multiprocessing and threaded case?

After thinking about this for a while, I didn't go the "reference counting" route. The problem with that approach is that it is possible to start kernel A (probably in a thread), then fork, then start kernel B, then interact with kernel A & kernel B – in this case, the "reference count" for the ZMQ context won't have dropped to zero when the fork happens, and the ZMQ context in the forked process will still be in an invalid state. I tried to demonstrate that in the test test_start_parallel_process_kernels.

Its not really that the context doesn't get "cleaned up", but that no open context can be passed across a process fork. IMO, the "right" solution here really is to use a separate context for each client, which does prevent ZMQ's speedy in-process communication between different clients, but otherwise makes them safe share across processes.

MSeal

LGTM now. Thanks for the detailed explanation and thought @alexrudy

lumberbot-app · 2019-05-13T23:40:19Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

$ git checkout 5.x
$ git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

$ git cherry-pick -m1 1cec38633c049d916f5e65d4d74129737ee9851e

You will likely have some merge/cherry-pick conflict here, fix them and commit:

$ git commit -am 'Backport PR #437: Avoid kernel failures with multiple processes'

Push to a named branch :

git push YOURFORK 5.x:auto-backport-of-pr-437-on-5.x

Create a PR against branch 5.x, I would have named this PR:

"Backport PR #437 on branch 5.x"

And apply the correct labels and milestones.

Congratulation you did some good work ! Hopefully your backport PR will be tested by the continuous integration and merged soon!

If these instruction are inaccurate, feel free to suggest an improvement.

Backport PR #437: Avoid kernel failures with multiple processes

Demonstrate kernel failures with multiple processes

0658a7b

drop Python 3.4 support

d3ddee0

minrk added this to the 5.3 milestone May 13, 2019

MSeal approved these changes May 13, 2019

View reviewed changes

MSeal merged commit 1cec386 into jupyter:master May 13, 2019

lumberbot-app bot added the Still Needs Manual Backport label May 13, 2019

alexrudy pushed a commit to alexrudy/jupyter_client that referenced this pull request May 14, 2019

Backport PR jupyter#437: Avoid kernel failures with multiple processes

f1dba27

mpacer added a commit that referenced this pull request May 24, 2019

Merge pull request #438 from alexrudy/auto-backport-of-pr-437-on-5.x

08b13c1

Backport PR #437: Avoid kernel failures with multiple processes

MSeal mentioned this pull request Jun 13, 2019

Fixed socket binding race conditions ipython/ipykernel#412

Merged

willingc mentioned this pull request Jul 5, 2019

Changelog for release 5.3.0 #453

Closed

kevin-bates mentioned this pull request May 26, 2020

[BugFix] [Resource Leak] Gracefully Close ZMQ Context upon kernel shutdown #548

Merged

minrk mentioned this pull request May 28, 2020

use global shared zmq.Context as default #549

Open

vidartf mentioned this pull request Mar 12, 2021

Ensure operational with parallel tests voila-dashboards/hotpot_km#4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid kernel failures with multiple processes #437

Avoid kernel failures with multiple processes #437

alexrudy commented May 8, 2019

MSeal commented May 12, 2019 •

edited

Loading

minrk commented May 13, 2019

alexrudy commented May 13, 2019

MSeal left a comment

lumberbot-app bot commented May 13, 2019

Avoid kernel failures with multiple processes #437

Avoid kernel failures with multiple processes #437

Conversation

alexrudy commented May 8, 2019

MSeal commented May 12, 2019 • edited Loading

minrk commented May 13, 2019

alexrudy commented May 13, 2019

MSeal left a comment

Choose a reason for hiding this comment

lumberbot-app bot commented May 13, 2019

MSeal commented May 12, 2019 •

edited

Loading