Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid kernel failures with multiple processes #437

Merged
merged 2 commits into from
May 13, 2019

Conversation

alexrudy
Copy link

@alexrudy alexrudy commented May 8, 2019

This PR adds several new tests which ensure that kernels can be used when started and operated in a multiprocessing context.

Before this change, if the global ZMQ context was initialized before the processed forked, then the child processes might fail. This is easy to do accidentally – run some kernel first in the current process, to check if it works, then farm that same function out to a pool of processes, and things will go poorly. I think this is because ZMQ contexts are not safe after fork. However, switching from using the global ZMQ context to a single ZMQ context per client eliminates the failures.

One could argue that kernels should never be used in multiprocessing contexts, because they are run in a subprocess anyhow, so multiprocessing doesn't provide a better way to escape the gil. However, I can't think of a way to detect that case and raise an error without waiting for a timeout, and the error returned there has to be pretty generic (it is currently RuntimeError: Kernel didn't respond in 30 seconds).

Sorry to drop this PR in without opening an issue, but the issue itself is hard enough to reproduce that a minimal reproduction was easiest in the context of the test suite. Most of the code here is adding various combinations of parallel execution to the test suite and ensuring that they work.

We discovered this problem when @MSeal and I tried to get parallel uses of papermill running during pycon sprints. The original papermill issue is nteract/papermill#329. The ability to run kernels in parallel without surprising results is also helpful for nbconvert (see jupyter/nbconvert#1018)

@MSeal
Copy link
Contributor

MSeal commented May 12, 2019

@minrk @mpacer @Carreau We traced down this issue during PyCon sprints as one of the reasons for parallel nbconvert / papermill calls failing. Would love your eyes on the PR and to get some momentum on resolving this, plus the PRs in nbconvert associated so we can enable parallel executions end-to-end. Alexander actually has more experience with zeroMQ than myself so it was great he ended up digging into this problem. My main concern is not knowing the trade-offs with original codification of zeromq here and if the changes would have subtle issues that aren't obvious.

@alexrudy Do you think we'll need to add an accessor to the active session counter inside the cython cdef or does isolating to individual clients resolve that entirely for the multiprocessing and threaded case?

One could argue that kernels should never be used in multiprocessing contexts

I disagree there. It's a valid use-case and there are other constraints on problems that would require using multiprocessing with a jupyter client.

The test failure is with python 3.4 which most of the jupyter ecosystem tools (or their upstreams) have dropped support for. I was seeing the same failure pattern without any code changes in nbconvert 5.4 when developing. I'd suggest we remove 3.4 support for the next release and not block the PR on kernel timeouts with 3.4 (which is what we did in nbconvert).

Side note, I don't have permissions on this repo to even assign reviewers. Would love to have broader permissions in jupyter_client and help out here when possible.

@minrk minrk added this to the 5.3 milestone May 13, 2019
@minrk
Copy link
Member

minrk commented May 13, 2019

@MSeal I gave you permissions on this repo and pushed a commit dropping Python 3.4 support. Feel free to make this your first merge if all looks well here.

@alexrudy
Copy link
Author

@MSeal

@alexrudy Do you think we'll need to add an accessor to the active session counter inside the cython cdef or does isolating to individual clients resolve that entirely for the multiprocessing and threaded case?

After thinking about this for a while, I didn't go the "reference counting" route. The problem with that approach is that it is possible to start kernel A (probably in a thread), then fork, then start kernel B, then interact with kernel A & kernel B – in this case, the "reference count" for the ZMQ context won't have dropped to zero when the fork happens, and the ZMQ context in the forked process will still be in an invalid state. I tried to demonstrate that in the test test_start_parallel_process_kernels.

Its not really that the context doesn't get "cleaned up", but that no open context can be passed across a process fork. IMO, the "right" solution here really is to use a separate context for each client, which does prevent ZMQ's speedy in-process communication between different clients, but otherwise makes them safe share across processes.

Copy link
Contributor

@MSeal MSeal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now. Thanks for the detailed explanation and thought @alexrudy

@MSeal MSeal merged commit 1cec386 into jupyter:master May 13, 2019
@lumberbot-app
Copy link

lumberbot-app bot commented May 13, 2019

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
$ git checkout 5.x
$ git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
$ git cherry-pick -m1 1cec38633c049d916f5e65d4d74129737ee9851e
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
$ git commit -am 'Backport PR #437: Avoid kernel failures with multiple processes'
  1. Push to a named branch :
git push YOURFORK 5.x:auto-backport-of-pr-437-on-5.x
  1. Create a PR against branch 5.x, I would have named this PR:

"Backport PR #437 on branch 5.x"

And apply the correct labels and milestones.

Congratulation you did some good work ! Hopefully your backport PR will be tested by the continuous integration and merged soon!

If these instruction are inaccurate, feel free to suggest an improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants