[BugFix] [Resource Leak] Gracefully Close ZMQ Context upon kernel shutdown #548

jalpan-randeri · 2020-05-24T08:21:11Z

As part of kernel manager instance creation, we create new context of ZMQ.
This kernel manager are responsible for managing lifecycle of kernel process.
At the end of kernel process lifecycle we are just simply killing kernel process and not closing ZMQ context. This leaked ZMQ contexts holds sockets and end up exhausting system resources

This PR fixes this by gracefully shutting down ZMQ context during kernel shutdown process

Tests

Enhanced existing tests to track zmq context checks

Manual Tests Steps:

start jupyter server
run kernel_lifecycle.sh script 3 times
record socket usage

kernel_lifecycle.sh

kernel_id=$(curl --silent --show-error  -X POST 'localhost:8888/api/kernels' | jq .id | tr -d '"')
echo $kernel_id
read -n 2 -p 'Stop?'
echo 'stopping '
curl -X DELETE "localhost:8888/api/kernels/$kernel_id"

Track socket usage using following cmd

macos:   lsof -p <pid of jupyter> | grep unix | wc -l
linux:   lsof -p <pid of jupyter> | grep event | wc -l

Local Macbook Test Runs:

After 3 executions of kernel_lifecycle script, socket usage

without fix

$ lsof -p 17207 | grep unix | wc -l 
      21

with fix

$ lsof -p 16076 | grep unix | wc -l 
       3

During kernel startup we create new context of ZMQ. However during kernel shutdown we are not shuting down ZMQ context This leads to leaks of ZMQ reaper sockets.

jalpan-randeri · 2020-05-24T08:24:07Z

#jupyter-server/jupyter_server#234

SylvainCorlay · 2020-05-24T08:26:15Z

Ping @JohanMabille.

blink1073 · 2020-05-24T13:52:15Z

Well done, @jalpan-randeri! I'll buy you a beverage of choice if we meet in person.

kevin-bates · 2020-05-24T15:18:49Z

Outstanding @jalpan-randeri - thank you! I ditto @blink1073's offer! This appears to resolve the leaks we're seeing in EG: jupyter-server/enterprise_gateway#762 (comment)

jupyter_client/manager.py

kevin-bates · 2020-05-26T02:30:05Z

The recent commit makes sense for where we want to be and restart=False is common, but I think we need a b/c change for existing users of KerneManager subclasses that need this awesome fix. Any chance we could add the original commit (or compatible) into a 6.1.4 release? Then, I guess we'd need 6.2.0 for the parameter change.

jalpan-randeri · 2020-05-26T08:33:13Z

@kevin-bates Sure, will you guide me on how to achieve this?
I found a tag for 6.1.4, but i am not sure how to cherry-pick original commit to this release.
Alternatively, I was thinking reusing connection_file parameter to cleanup ZMQ context, but then it made logic convoluted

minrk

Great job tracking down the issue! I think the ideal fix is to share the global Context rather than requiring separate Contexts per kernel manager, which this PR does. I thought we were already doing that, but apparently not.

The standard pattern for contexts in zmq applications is to use a single process-global Context object (facilitated with zmq.Context.instance() as the default provider if passing around is unwieldy, but passing around is more explicit).

The fact that we are creating Contexts for each KernelManager is actually part of the problem - creating several too many threads and sockets for internal zmq operations per kernel, rather than using the same Context throughout the application except where explicitly required.

jupyter_client/manager.py

minrk · 2020-05-26T09:35:07Z

I think this issue might be resolved by a smaller change, switching the default context:

    def _context_default(self):
        return zmq.Context()

to default to the shared global context (more idiomatic use of zmq):

    def _context_default(self):
        return zmq.Context.instance()

(this default generator occurs in a few places)

If this change still leaks FDs, that means there are sockets being left open that should be closed, not just Contexts. These should be identified and closed explicitly in stop_channels. destroy is a way to ensure there are no sockets still open, even if we lost track of them, but in the long run it's probably better to not lose track of those sockets in the first place.

kevin-bates · 2020-05-26T13:36:33Z

Thanks @minrk.

I think the ideal fix is to share the global Context rather than requiring separate Contexts per kernel manager, which this PR does. I thought we were already doing that, but apparently not.

You're correct. You had added a global context 9 years ago, that was switched just over a year ago. The change to a local context first took place in #437 to address multiprocessing aspects of kernels. It seems that reverting to a global context would jeopardize those efforts (there were 3 additional PRs related to this, although those appear to be mostly version-management related, back-ports, etc)?

FWIW, I've reverted my EG env back to using a global ZMQ context and do not see the leaks occurring (implying we haven't lost track of the sockets). 👍

MSeal · 2020-05-26T17:43:02Z

@minrk ! Glad to see you on more threads again :)

Yeah as @kevin-bates pointed out we definitely don't want to go back to globals as they break all concurrency in higher abstractions, causing deadlocks against the global state. So it looks like we need to dig into zmq context cleanup more to figure out why we're not cleaning the FDs up under ideal exit conditions?

MSeal · 2020-05-26T17:45:11Z

Or did I misunderstand the latest comments? Is it that we still see FD leaks with this PR or that this PR resolves them similarly to switching back to a global context?

kevin-bates · 2020-05-26T18:00:12Z

We no longer see the leaks with this PR or if we were to switch back to the global context as Min suggested.

My primary concern is compatibility with existing installations. After thinking further I think we need at least two releases - a backport to 5.3.x and a current release. I'd prefer we have a compatible fix in 6.1 as well, then have the incompatible fix in, I guess, 6.2?

jalpan-randeri · 2020-05-26T18:03:50Z

Looking at system level, without this fix we are leaking only ZMQbg/Reaper sockets, these seems to be coming from ZMQ context only. We create 5 sockets during kernel startup and we shutdown these 5 sockets at shutdown, Left behinds are only ZMQbg/Reaperone, shutting down zmq.context cleans them up.

The ask was to make ZMQ.context global, which will lead to race conditions, we have exclusive 1:1 zmq context per kernel manager.

This PR fixes it by destroying zmq context at kernel shutdown. I tested with running my script concurrently and did not observe any failures

I am just stuck at making change backward compatible.

MSeal · 2020-05-26T18:50:17Z

I don't think we need to backport to 5.x unless we really want to (and shouldn't block this PR if we do) since Python 2 is fully unsupported now. The 6.x release has been fairly stable relative with only some carryover of 5.x issues that weren't resolved like this one. Do you have some code @kevin-bates that still relies on Python 2 for EG that would need this fix?

What's the incompatible fix vs compatible fix referring to?

I think this would be a 6.2 release since there's some minor contract changes and a cleanup change.

Are we good with a merge for now and follow up with release strategy?

kevin-bates · 2020-05-26T19:53:21Z

I agree with moving forward and merging now, but I believe we need at least one backport.

Do you have some code @kevin-bates that still relies on Python 2 for EG that would need this fix?

We have users that are running EG relative to 5.3.4. I don't know if they're running Python 2 but when taking master to check this change out, they ran into another 6.x change (port caching) that breaks them and they shouldn't have to take a new EG release to have their leaks fixed. Enterprise users are slow-moving, so would really prefer an option for them via 5.3.5.

What's the incompatible fix vs compatible fix referring to?

I may not be understanding python bindings correctly, but when moved into cleanup() it replaces the positional parameter and flips its default value from True to False. This is incompatible with subclasses since they will be calling the superclass relative to the old parameter value (that defaults to True) - not to mention I think there would be a positional argument mismatch. As a result, applications that subclass KernelManager and override cleanup() will break.

I believe the "compatible" change needs to either be the original commit (that lies outside cleanup()) or placed in cleanup(), but with the current parameter (and its default) retained.

There may be other applications besides EG affected by this and those applications shouldn't require their own patch releases to resolve their leaks, IMO.

FWIW, I just pulled the candidate fix w/o changing EG and this manifests itself as a TypeError due to the positional argument change.

[E 200526 12:47:55 web:1670] Uncaught exception DELETE /api/kernels/99367f67-c6f5-4029-8bbf-c9ad1d82afa1 (9.160.104.100)
    HTTPServerRequest(protocol='http', host='yarn-eg-node-1.fyre.ibm.com:8888', method='DELETE', uri='/api/kernels/99367f67-c6f5-4029-8bbf-c9ad1d82afa1', version='HTTP/1.1', remote_ip='9.160.104.100')
    Traceback (most recent call last):
      File "/opt/anaconda2/envs/py3/lib/python3.6/site-packages/tornado/web.py", line 1592, in _execute
        result = yield result
      File "/opt/anaconda2/envs/py3/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
        value = future.result()
      File "/opt/anaconda2/envs/py3/lib/python3.6/site-packages/tornado/gen.py", line 326, in wrapper
        yielded = next(result)
      File "/opt/anaconda2/envs/py3/lib/python3.6/site-packages/notebook/services/kernels/handlers.py", line 66, in delete
        yield maybe_future(km.shutdown_kernel(kernel_id))
      File "/opt/anaconda2/envs/py3/lib/python3.6/site-packages/notebook/services/kernels/kernelmanager.py", line 302, in shutdown_kernel
        return super(MappingKernelManager, self).shutdown_kernel(kernel_id, now=now)
      File "/opt/anaconda2/envs/py3/lib/python3.6/site-packages/jupyter_client/multikernelmanager.py", line 183, in shutdown_kernel
        km.shutdown_kernel(now=now, restart=restart)
      File "/opt/anaconda2/envs/py3/lib/python3.6/site-packages/jupyter_client/manager.py", line 381, in shutdown_kernel
        self.cleanup(restart=restart)
    TypeError: cleanup() got an unexpected keyword argument 'restart'

MSeal · 2020-05-26T20:01:15Z

Ahh I see.

/opt/anaconda2/envs/py3/lib/python3.6/site-packages/jupyter_client/multikernelmanager.py", line 183, in shutdown_kernel
km.shutdown_kernel(now=now, restart=restart)

That just looks like the shutdown_kernel call was missed in the PR's changes: https://github.com/jupyter/jupyter_client/blob/master/jupyter_client/multikernelmanager.py#L183

We can try making the interface more backwards compatible before releasing. Should we just accept connection_file as an argument mapped to not restart with a deprecation warning printed that this will be removed in... 7.0? Should make it completely compatible and clear to users what to change to in the future.

MSeal · 2020-05-26T20:07:56Z

Enterprise users are slow-moving, so would really prefer an option for them via 5.3.5.

Alright, if you think this change is important for those users we can make a back-port for this after we get a 6.x release.

kevin-bates · 2020-05-26T20:59:55Z

Thank you Matthew - this change is critical for EG users since EG represents a long-running server who's sole purpose in life is to create and shutdown kernels and leaks like these add up quickly.

Regarding the change itself, shutdown_kernel() is fine. It's all about the signature change on cleanup() that's the compatibility issue. I agree that leaving connection_file=True in place with a deprecation warning is the best way forward. I'm sorry that's the case but it's the correct thing to do.

MSeal · 2020-05-26T21:09:49Z

@jalpan-randeri Would you mind adding the backwards comparability for connection_file with a warning to the PR? Or would you prefer we tackle that after merge?

minrk · 2020-06-08T10:21:13Z

This diff applies the following changes:

destroy context in __del__ instead of cleanup_resources (having a cleanup method makes sense to me, but doing this action in shutdown seems less clear)
share context across children by default in MultiKernelManager

and has no warnings with lab or classic in my tests (because contexts are never terminated)

What do folks think?

MSeal · 2020-06-08T23:42:19Z

destroy context in del instead of cleanup_resources

This won't be guaranteed to be called until after all references are cleaned, and even then it's not guaranteed to run if the process is exiting. So if someone has a local kc = km.client snippet in their code the cleanup behavior would stop executing. Typically I've seen with IO cleanup, you explicitly call it when you're done with that IO and make the del method call as well just in case for this reason.

share context across children by default in MultiKernelManager

That seems like a good idea to add.

minrk · 2020-06-12T13:43:11Z

What's the best way forward? It does make sense to me to have explicit cleanup, not relying on del, but shutdown does not seem like the right place to close a context (it's quite reasonable to close sockets after shutdown, and we see that jupyterlab does exactly that, hence the errors reported above). We have already that 'explicit cleanup' is manager.context.term(), but a wrapper method that closes everything we might have opened is a good idea, too. Should we stop calling cleanup_resources() in shutdown and make it an explicit call for consumers of the KernelManager API?

So maybe the right thing is to keep this as-is—manager terminates the context on shutdown when the context is not shared, and then ensure for contexts like the notebook server that the context is shared (cherry-pick 4cdbf54 onto this PR)? We'll get implicit cleanup on shutdown in single KernelManager use cases, one context by default in MultiKernel cases (notebook servers, most notably).

MSeal · 2020-06-14T22:55:55Z

So maybe the right thing is to keep this as-is—manager terminates the context on shutdown when the context is not shared, and then ensure for contexts like the notebook server that the context is shared (cherry-pick 4cdbf54 onto this PR)? We'll get implicit cleanup on shutdown in single KernelManager use cases, one context by default in MultiKernel cases (notebook servers, most notably).

I think that could be a working plan. I think we want to emphasize that client cleanup and if they don't we do a best effort to cleanup for them. That should cover current behavior for lab and classic unless I am mistaken, and we can ask that lab move to a more deliberate cleanup mode over time? Unless I missed anything?

MSeal · 2020-06-14T22:56:30Z

Should one of us apply the commit to this PR and get it so folks can test it out to confirm behavior before merge?

jalpan-randeri · 2020-06-15T05:38:19Z

just to confirm, we want to provide cleanup_resources as explicit call and not do it as part of shutdown method. something like this commit 4cdbf54

minrk · 2020-06-15T11:07:15Z

@jalpan-randeri yes, please cherry-pick that commit onto your PR. Then I think we can proceed with testing.

@MSeal to be clear, I believe lab is doing nothing wrong - explicitly closing sockets after shutdown in general should be fine, so I don't think we need to ask them to change. If there is a change to make forcing the closure of sockets before shutdown, that would belong in the notebook server, collecting and closing all WebSocketHandlers associated with a kernel as part of shutdown.

jupyter@4cdbf54

jalpan-randeri · 2020-06-17T05:07:37Z

cherry-picking commit gave me following error

Not sure if this is expected behavior

AttributeError: 'super' object has no attribute '__del__'
Exception ignored in: <function MultiKernelManager.__del__ at 0x10a0d9290>
Traceback (most recent call last):
  File "/jupyter_client/jupyter_client/multikernelmanager.py", line 132, in __del__
    super().__del__()
AttributeError: 'super' object has no attribute '__del__'

jupyter_client/multikernelmanager.py

kevin-bates

I just have the one comment regarding the warning message.

I tried this out using EnterpriseGateway and it works great - other than not seeing any warnings. I do not see leaks happening nor do I see issues stemming from JupyterLab closing the WebSocket after shutdown. Two thumbs up! 👍 👍

kevin-bates · 2020-06-18T22:41:32Z

jupyter_client/manager.py

+    def cleanup(self, connection_file=True):
+        """Clean up resources when the kernel is shut down"""
+        warnings.warn("Method cleanup(connection_file=True) is deprecated, use cleanup_resources(restart=False).",
+                      DeprecationWarning)


As noted in this previous comment, I'm not seeing any warning messages produced in the server console/logs. I think we should adjust the class such that it produces at least one message. (Using FutureWarning does just that.)

MSeal

I'm good with a merge now. Will let kevin or min do the merge here though.

kevin-bates

Looks good. Let's let @minrk confirm.

minrk · 2020-06-22T06:56:58Z

Looks great to me, thanks everyone!

minrk · 2020-06-22T07:03:03Z

@kevin-bates if you want to do a backport in order to make sure more versions of the notebook server get it, backporting just the shared context for MultiKernelManager should be the smallest way to achieve that.

kevin-bates · 2020-06-22T13:54:37Z

Thanks @minrk. I'm kinda new to back-porting, but it sounds like I'd cherry-pick 2d5ba4b and 1ce1d97 to a 5.x branch. Since no branch for back-ports exists, would we just create 5.x from the 5.3.4 tag? or should that branch be named 5.3.x?

If someone could set up the target branch, I'd be happy to back-port the appropriate commits and help get a 5.3.5 out.

minrk · 2020-06-25T07:17:27Z

5.x branch is now at 5.3.4, if you want to make the PR backporting those two commits. I don't think we need A.B.x branches except where we are backporting past the latest minor revision for a given major revision (e.g. 5.2.x).

MSeal · 2020-06-25T17:24:23Z

I can help kick off a releases for these things. Shall I do the 6.1.4 release now?

MSeal · 2020-06-30T00:14:03Z

6.1.5 and 5.3.5 are both released with this change included. I messed up the 6.1.4 upload (had the wrong commits) and deleted that release right away.

kevin-bates · 2020-06-30T00:18:44Z

Thank you Matt!

jalpan-randeri · 2020-06-30T07:24:10Z

wow!. thank you Kevin, Matt, Min RK 😃

[BugFix] Close ZMQ Context upon kernel shutdown

3e1436f

During kernel startup we create new context of ZMQ. However during kernel shutdown we are not shuting down ZMQ context This leads to leaks of ZMQ reaper sockets.

jalpan-randeri marked this pull request as ready for review May 24, 2020 08:21

kevin-bates mentioned this pull request May 24, 2020

Query for Spark application state failed jupyter-server/enterprise_gateway#762

Closed

davidbrochart reviewed May 25, 2020

View reviewed changes

jupyter_client/manager.py Outdated Show resolved Hide resolved

jupyter_client/manager.py Outdated Show resolved Hide resolved

Jalpan Randeri added 2 commits May 25, 2020 14:15

cr feedback move to cleanup method

de91042

changes to cleanup method in multikernel manager

02f3c50

jalpan-randeri requested review from kevin-bates, davidbrochart and MSeal May 25, 2020 21:39

jalpan-randeri mentioned this pull request May 25, 2020

[Bugfix] Update cleanup method signature jupyter-server/enterprise_gateway#815

Merged

minrk reviewed May 26, 2020

View reviewed changes

jupyter_client/manager.py Outdated Show resolved Hide resolved

MSeal mentioned this pull request May 26, 2020

Errors after upgrading to 0.3.0 jupyter/nbclient#58

Open

cherry-pick from @minrk work on sharing context by default

2d5ba4b

jupyter@4cdbf54

minrk reviewed Jun 17, 2020

View reviewed changes

jupyter_client/multikernelmanager.py Outdated Show resolved Hide resolved

handle case when del is not availiable in parent

1ce1d97

kevin-bates reviewed Jun 18, 2020

View reviewed changes

change warning class to FutureWarning

9c0ef5d

MSeal approved these changes Jun 19, 2020

View reviewed changes

jalpan-randeri requested a review from kevin-bates June 19, 2020 20:55

kevin-bates approved these changes Jun 19, 2020

View reviewed changes

jalpan-randeri requested a review from minrk June 20, 2020 00:20

minrk approved these changes Jun 22, 2020

View reviewed changes

minrk merged commit b5dfd16 into jupyter:master Jun 22, 2020

kevin-bates mentioned this pull request Jun 25, 2020

Backport zmq leaks fix for MultiKernelManager #555

Merged

kevin-bates mentioned this pull request Jul 6, 2020

KernelManager.shutdown_kernel raises FutureWarning #557

Closed

kevin-bates mentioned this pull request Aug 20, 2020

Jupyter Notebok server process stays up after "shut down" due to shutdown_no_activity_timeout jupyter/notebook#5386

Closed

[BugFix] [Resource Leak] Gracefully Close ZMQ Context upon kernel shutdown #548

[BugFix] [Resource Leak] Gracefully Close ZMQ Context upon kernel shutdown #548

Conversation

jalpan-randeri commented May 24, 2020 • edited Loading

Tests

Manual Tests Steps:

kernel_lifecycle.sh

Track socket usage using following cmd

Local Macbook Test Runs:

jalpan-randeri commented May 24, 2020

SylvainCorlay commented May 24, 2020

blink1073 commented May 24, 2020

kevin-bates commented May 24, 2020

kevin-bates commented May 26, 2020

jalpan-randeri commented May 26, 2020

minrk left a comment

Choose a reason for hiding this comment

minrk commented May 26, 2020

kevin-bates commented May 26, 2020

MSeal commented May 26, 2020

MSeal commented May 26, 2020

kevin-bates commented May 26, 2020

jalpan-randeri commented May 26, 2020

MSeal commented May 26, 2020

kevin-bates commented May 26, 2020

MSeal commented May 26, 2020

MSeal commented May 26, 2020

kevin-bates commented May 26, 2020

MSeal commented May 26, 2020

minrk commented Jun 8, 2020 • edited Loading

MSeal commented Jun 8, 2020

minrk commented Jun 12, 2020

MSeal commented Jun 14, 2020

MSeal commented Jun 14, 2020

jalpan-randeri commented Jun 15, 2020

minrk commented Jun 15, 2020

jalpan-randeri commented Jun 17, 2020

kevin-bates left a comment

Choose a reason for hiding this comment

kevin-bates Jun 18, 2020

Choose a reason for hiding this comment

jalpan-randeri Jun 19, 2020

Choose a reason for hiding this comment

MSeal left a comment

Choose a reason for hiding this comment

kevin-bates left a comment

Choose a reason for hiding this comment

minrk commented Jun 22, 2020

minrk commented Jun 22, 2020

kevin-bates commented Jun 22, 2020

minrk commented Jun 25, 2020

MSeal commented Jun 25, 2020

MSeal commented Jun 30, 2020

kevin-bates commented Jun 30, 2020

jalpan-randeri commented Jun 30, 2020

jalpan-randeri commented May 24, 2020 •

edited

Loading

minrk commented Jun 8, 2020 •

edited

Loading