-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zombie Process Fix #34
Conversation
nbclient/client.py
Outdated
try: | ||
# For AsyncKernelManager | ||
loop.run_until_complete(self.km.shutdown_kernel(now=True)) | ||
except TypeError: | ||
self.km.shutdown_kernel(now=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check if self.km.shutdown_kernel
is a coroutine, instead of using try/except
?
Also, we might want to have an async version of _cleanup_kernel
, which is use in blocking and async functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can change to that pattern if you prefer.
nbclient/client.py
Outdated
@@ -330,22 +380,47 @@ def start_kernel_manager(self): | |||
if self.km.ipykernel and self.ipython_hist_file: | |||
self.extra_arguments += ['--HistoryManager.hist_file={}'.format(self.ipython_hist_file)] | |||
|
|||
self.km.start_kernel(extra_arguments=self.extra_arguments, **kwargs) | |||
# Support AsyncKernelManager | |||
if inspect.iscoroutinefunction(self.km.start_kernel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also use asyncio.iscoroutinefunction
, but I guess this is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on https://stackoverflow.com/questions/36076619/test-if-function-or-method-is-normal-or-asynchronous the asyncio
is probably a better choice to cover more cases. I'll change em over.
nbclient/client.py
Outdated
|
||
self.kc = self.km.client() | ||
self.kc.start_channels() | ||
try: | ||
await self.kc.wait_for_ready(timeout=self.startup_timeout) | ||
except RuntimeError: | ||
self.kc.stop_channels() | ||
self.km.shutdown_kernel() | ||
self._cleanup_kernel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we might use an async version of _cleanup_kernel
, since we are in an async function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 will do
nbclient/client.py
Outdated
try: | ||
await yield_(None) # would just yield in python >3.5 | ||
finally: | ||
self.kc.stop_channels() | ||
self.kc = None | ||
self._cleanup_kernel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we might use an async version of _cleanup_kernel
, since we are in an async function.
nbclient/exceptions.py
Outdated
""" | ||
A custom exception used to indicate that the exception is used for cell | ||
control actions (not the best model, but it's needed to cover existing | ||
behvior without major refactors). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo behvior
-> behavior
.
self.async_wait_for_reply(msg_id, cell=cell) | ||
) | ||
|
||
async def async_wait_for_reply(self, msg_id, cell=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be _async_wait_for_reply
for name consistency with _wait_for_reply
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of exposing async_wait_for_reply
as not private so it could be used in contracts for extensibility since papermill ended up needing the blocking verison to extend behavior. If I'm doing that I should probably commit and make wait_for_reply
, or revert the decision overall and keep with _async_wait_for_reply
.
Thanks for this PR @MSeal. |
We could do that (or rather have two files, asyncclient.py and client.py would be simple enough given the lack of directory structure here). Only downside is that we may end up repeating some logic that we're currently wrapping in run_until_complete. It would be a be cleaner and we could have two distinct classes that clearly separate async from blocking concerns... Might need a base class to hold common synchronous code patterns. What's you preference @davidbrochart ? Also might be a couple days before responding on this PR fyi |
I've made some refactoring along this line in #37. |
Assuming #37 merges, do you want me to address separating the class into blocking / non-blocking versions or finish providing lower level interfaces in each to blocking and non-blocking functions to be used by downstream extensions still? |
I think what we now have in #37 with blocking methods using |
I believe this is blocking the release. Do you need help @MSeal ? |
I've got time to take a pass tonight on updating. Should be able to more quickly respond / finish up the release now that some other threads are resolved. |
See how those changes look. Had to do more than I expected unfortunately so the diff is longer |
nbclient/util.py
Outdated
async def await_or_block(func, *args, **kwargs): | ||
"""Awaits the function if it's an asynchronous function. Otherwise block | ||
on execution. | ||
""" | ||
if asyncio.iscoroutinefunction(func): | ||
return await func(*args, **kwargs) | ||
else: | ||
result = func(*args, **kwargs) | ||
# Mocks mask that the function is a coroutine :/ | ||
if isinstance(result, Coroutine): | ||
return await result | ||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In jupyter_server, we will have something similar that we call ensure_async. We use it as a wrapper on a function call when we don't know if this function is async or not. I prefer that over passing the function object and its parameters to await_or_block
and let it make the call. I will try it and maybe make a PR on your zombieProcFix
branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with that change, thanks for helping
I haven't gotten a chance to dig into the failures -- Looks like something is trying to await a non-async call for at least one of them. |
I sent a PR on your branch: MSeal#2 |
Zombie proc fix
:doh: -- I added that fork as a watched repo so I see those changes in a timely manner going forward. |
I think we may need to drop 3.5 support if async keeps hanging with it :/. Same failures are occurring in the nbconvert port to use nbclient with python 3.5. End of life is around the corner 2020-09-13, maybe we should just pull the trigger now and drop it. @captainsafia @willingc I checked some known images / VMs that were pinned to 3.5 and those I know about have moved the defaults forward to 3.6 or 3.7. Is there internals you know of that depend heavily on 3.5? If we drop support here / in jupyter_client (where I think the issue also manifesting) it would mean papermill, nbconvert, and scrapbook would have to follow to 3.6+. |
Based on my inquiries, it seems like most people are on 3.6 already. |
I asked around a bit more to confirm the change would generally be accepted. I'm going to go forward with dropping 3.5 support so we don't mislead people into failing processes. |
@davidbrochart Good with a merge / release? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, let's release!
BTW, do you need help in the future for the releases? |
@davidbrochart That would be great. I'll get this one as PyPI permissions are more restrictive than github for jupyter project. I would really like to have more python developers doing release management across the projects so I'm not a bottleneck. I'll get that conversation going about promoting PyPI permissions to more people with the core team. |
@davidbrochart Actually if you wanted to help with this one, we could use an announcement to discourse and the jupyter mailing list. You did the lion share of the async work and it'd be great to publicize the changes in the release there. Up to you, I can also write it up |
@MSeal sure, let me try. |
This PR (probably) fixes both the dead / zombie process issues that come up when calling nbclient in nested subprocesses as well as the memory leak when running nbclient repeatidely in the same process (might fix nteract/papermill#478). These were both being caused by zmq processes not being cleaned up, which the change in moving to nbclient left around by not deleting the kernel objects after execution. The cleanup process has been made much more robust to a number of failure modes and will attempt a direct kill call on the child process if graceful shutdown is not achieved.
Also fixed a number of async issues / warnings that were cropping up. Downstream libraries were relying on some functions being synchronous, so I renamed a number of functions to
async_
and used loop calls to avoid compatibility issues for the next release.Most of these were really hard to write a meaningful unittest. I made some notebooks that recreated complex setups that caused process issues (or emitted a warning that would do so in some circumstances) and cleaned things up until warnings and hanging processes were no long around. I don't think there's an easy way to encode that into unittests unfortunately.
Needed for moving forward with #31
Fixes runtime issues, but not all test warnings for #33