-
-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hang after too many gl requests in Docker #39
Comments
Note that this blocks using the new imageservers in prod, in On-Prem, and also blocks plotly/plotly.js#1972 |
I plugged a bunch of memory leaks in |
Self-assigned it on info from @scjody that @jackparmer suggested I continue with the IOW running it through all public plots is an incredibly effective way for ferreting out all memory leaks that can happen on a single render pass (it doesn't solve leaks from interactions though). |
@monfera The best way I've found is to run |
@monfera I'd like to do that as a workaround for #41, but I don't think it's feasible here since the server hangs so quickly. We observe hangs after between 30 and 40 gl plots, so to be safe we'd need to restart every 10 plots or so, and that would reduce server performance to an unacceptable level. I also don't think this particular issue is caused by a memory leak. Memory usage does grow steadily (#41) but does not increase significantly after 30 to 40 plots (gl or otherwise). Some other kind of resource leak is certainly possible! |
thanks @scjody for the added notes! | Memory usage does grow steadily (#41) but does not increase significantly after 30 to 40 plots (gl or otherwise) Reading #41 I got the impression that, even without the |
#41 is a problem, but I'm reasonably sure restarting every 1000 plots is an adequate workaround for that issue. |
@monfera @etpinard I've added some debugging stuff to the Docker container in #44.
Usage:
In my case the image exporter window where the work was occurring was in fluxbox tab 2. Some errors were printed during normal operation (at this point images were being generated successfully): When the server hung, a different message was printed: I have not had the chance to investigate the significance of either of these things. |
I might be worth trying to listen to https://electronjs.org/docs/api/app#event-gpu-process-crashed in the app code. I'll give this a shot this afternoon. |
I setup an Ubuntu 16.04 VM (since that's what @etpinard uses as a desktop and I can't easily get gl renders working with XQuartz on OS X), opened up the X server to TCP connections, and ran With So the issue is unlikely to be caused by |
@scjody just jumping in to see if I can help with this stuff but may just state the obvious (perhaps incorrectly assuming that your development machine uses the graphics drivers of that machine for WebGL content rendering.) | What's different about the app (or Electron) when it's built and run inside Docker vs. outside? One difference is that when Electron is used on a desktop, the WebGL API calls go through whatever graphics chip driver the desktop has, eg. from Nvidia, AMD or intel integrated graphics. In a Docker container or on CI systems in general, it's running in headless mode, still expecting a display driver, in this case, Many of our WebGL plots run on Various drivers, by extension, Something else: apparently we're running tests with |
@monfera If you can think of experiments to try (related to I realize My next experiment is to put an externally-built (and non-hanging) |
Oh I should mention. I tried running the gl image tests in docker without the ignore-gpu-blacklist flag yesterday:. Without that flag the gl images fail to generate. |
@scjody I started making a WebGL resetting tool for at least testing, then felt like maybe there's one already :-) https://github.com/stackgl/gl-reset I'm about to make an ubuntu VM like you to test with it. It'd help if we had a minimal case that triggers the error. What would be a minimal run? You mention it fails at |
@monfera Note that we've only been able to reproduce the hang when it's running in a Docker container, not directly on an Ubuntu VM. I don't have any more information than what's in #39 (comment) regarding reproducing this issue. Trying |
If I copy a working My next step is to create an Ubuntu VM that's as close as possible to the Docker container (in terms of packages installed and versions) and see if I can reproduce the issue there. Incidentally I ran |
@scjody based on the nature of the WebGL resource management, I also suspect that no single mock will reproduce the issue, but a (perhaps N times repeated) transition between two mocks might. There may be many such pairs; the easiest is to use the mock where it breaks as one of the mocks, and a preceding mock as the other mock, and alternate between the two. If we're lucky, it's the directly preceding mock. Eg. If that doesn't reproduce the issue, bisecting the preceding ones may help, eg. if |
(btw. working on the setup w/ your previous help but may not be around for too long as it's getting late here so you have a better chance to try with the directly preceding one) |
Maybe we should try pinging Mikola or one of his friends (e.g. Hugh Kennedy) about this topic? Perhaps they came across the same problems before. |
Using an Ubuntu 16.04 VM with as close a copy of the Docker image as I could create (for obvious reasons directories like This suggests that the issue is extremely specific to running under Docker. There really shouldn't be any significant differences there (unless Electron is somehow accessing the graphics hardware directly rather than through X). Unfortunately creating the VM was fairly time consuming and I still managed to delete some important things (VMware tools for one thing) without which the VM can't be used. So I'm going to abandon this line of investigation unless someone else sees a lot of value in it. Right now I'm going to reach out to Replicated to see if they have any ideas. |
Also I'm definitely in favour of reaching out to folks like Mikola who may be able to help with this. |
Note that the deadline for solving this issue in time for On-Prem 2.3.0 is 5 PM on Friday. Any later and we won't be able to do enough testing for a 2017 release. |
From Replicated, here are some profiles for running apps like Chrome and Slack (which uses Electron) under Docker. I'm going to work through these and see if any make a difference. |
Running with a huge number of Docker options from that list allows a full
I'm now going to try Note that this may not provide a complete solution to the problem because not all these options are available in Replicated and Kubernetes. |
|
I confirm that running the container with the docker options In the next few hours, I'll try to dissect which of the option(s) above make this work. |
Got it: Everything except So mapping in the shared memory device works around the issue. I'll check if this is possible in On-Prem and GKE but it seems unlikely especially with GKE 😿. It might be just that this is a good clue towards what's going on with Electron or Chrome... at first glance this really feels like a Chrome bug but it's too early to say. |
Just to confirm, it works with This is the minimal set of arguments:
|
This will cause problems on CircleCI as CircleCI 2.0 doesn't allow mounting volumes, see plotly/plotly.js#1798 (comment) of more info. |
There's nothing special about the host On the Chrome side, the underlying issue has just been fixed in Chromium but there's no Chrome release yet, and it's certainly not in Electron... https://bugs.chromium.org/p/chromium/issues/detail?id=736452 Also Chrome may need up to 512 MiB so I suggest that as a My next steps are to see if we can increase |
In GKE, you can't use the
With this, I am able to run Replicated is next... |
With Replicated Mounting the host's shm device into the imageserver container works, but the path that needs to be used is different on CentOS/RHEL vs. Ubuntu/Debian: I'll start a discussion about our options once I have more information from Replicated. |
The Replicated workaround is no longer relevant since we're not shipping this in On-Prem 2.3.0 😿. They will definitely implement the |
Looking at puppeteer docs, I came across puppeteer/puppeteer#1603 - so yeah we're not the only one having issue with docker + chromium 🙃 |
Yeah, I found at least a dozen cases throughout my travels yesterday. |
I've been watching this issue, looks like it got closed / fixed today |
The thing we're waiting for is for this to get into Electron (or whatever we use if we switch away from that). According to https://bugs.chromium.org/p/chromium/issues/detail?id=736452#c61 it will be in Chrome 65.0.3299.6 but even the latest beta Electron is still on 59.0.3071.115 😿 Anyway, we do have a workaround for this and it will work in all the environments we care about:
|
What does this mean (if anything) for a Python distribution of this app? E.g. if we were to try to distribute this app through pip with a Python interface for running it? (Essentially a solution for this issue: plotly/plotly.py#880) |
It doesn't mean anything unless we expect python users to use a docker container (which I don't think is a good idea). |
From the tests documented starting at https://github.com/plotly/streambed/issues/9865#issuecomment-349995119:
When image-exporter is run as an imageserver in Docker, after a small number of gl requests (30-40), the image-exporter hangs completely. The request in progress times out, and the server won't accept any more connections and must be restarted.
Two examples:
test/image-make_baseline.js
is used to rungl*
, it hangs atgl3d_chrisp-nan-1.json
.test/image-make_baseline.js
is used to rungl3d*
, it makes it pastgl3d_chrisp-nan-1.json
and hangs atgl3d_snowden_altered.json
.This means that the issue is unlikely to be specific to any one plot, but rather some resource becomes exhausted or something builds up to the point where image generation can't proceed.
@etpinard @monfera FYI
The text was updated successfully, but these errors were encountered: