Hang after too many gl requests in Docker #39

scjody · 2017-12-08T19:48:44Z

From the tests documented starting at https://github.com/plotly/streambed/issues/9865#issuecomment-349995119:

When image-exporter is run as an imageserver in Docker, after a small number of gl requests (30-40), the image-exporter hangs completely. The request in progress times out, and the server won't accept any more connections and must be restarted.

Two examples:

If test/image-make_baseline.js is used to run gl*, it hangs at gl3d_chrisp-nan-1.json.
If test/image-make_baseline.js is used to run gl3d*, it makes it past gl3d_chrisp-nan-1.json and hangs at gl3d_snowden_altered.json.

This means that the issue is unlikely to be specific to any one plot, but rather some resource becomes exhausted or something builds up to the point where image generation can't proceed.

@etpinard @monfera FYI

The text was updated successfully, but these errors were encountered:

scjody · 2017-12-08T19:50:42Z

Note that this blocks using the new imageservers in prod, in On-Prem, and also blocks plotly/plotly.js#1972

monfera · 2017-12-08T19:55:57Z

I plugged a bunch of memory leaks in gl plots about a year ago and we got information that those worked but there were other leaks too, so I'm not too surprised. As there are many types of gl plots I wonder if periodically restarting the server would be a feasible short-term option, as hacky as it sounds. Also, in the future, there may always come a new plot type that has unintended leaks so it'd be a good preventive step. Maybe the real solution would be a plotly.js test bench, separate from the rest due to its resource needs, that would go through a large number of plot API calls and monotonic memory growth could be identified.

monfera · 2017-12-08T20:09:35Z

Self-assigned it on info from @scjody that @jackparmer suggested I continue with the gl plot leak detection&fix work. Jody, when work is starting on this (I understand after the couple of issues assigned to me are fixed) I'll need some bag of representative plots that I can run the server through, to reproduce the issue locally, so that we don't miss something.

IOW running it through all public plots is an incredibly effective way for ferreting out all memory leaks that can happen on a single render pass (it doesn't solve leaks from interactions though).

scjody · 2017-12-08T20:26:23Z

@monfera The best way I've found is to run test/image/mocks/gl* or test/image/mocks/gl3d* (from plotly.js). Both sets of mocks will reproduce the issue, but at different places.

scjody · 2017-12-08T20:35:03Z

I wonder if periodically restarting the server would be a feasible short-term option

@monfera I'd like to do that as a workaround for #41, but I don't think it's feasible here since the server hangs so quickly. We observe hangs after between 30 and 40 gl plots, so to be safe we'd need to restart every 10 plots or so, and that would reduce server performance to an unacceptable level.

I also don't think this particular issue is caused by a memory leak. Memory usage does grow steadily (#41) but does not increase significantly after 30 to 40 plots (gl or otherwise). Some other kind of resource leak is certainly possible!

monfera · 2017-12-08T21:17:19Z

thanks @scjody for the added notes!

| Memory usage does grow steadily (#41) but does not increase significantly after 30 to 40 plots (gl or otherwise)

Reading #41 I got the impression that, even without the gl plots, memory consumption grows steadily throughout the first 1.5hrs you added, with the text above referring to 351 mocks, 0.8GB -> 1.6GB. I get your point about how the gl and non-gl may be separate issues, but thought the non-gl ones represent a problem for us too (the raison d'être of #41)

scjody · 2017-12-08T21:23:19Z

#41 is a problem, but I'm reasonably sure restarting every 1000 plots is an adequate workaround for that issue.

scjody · 2017-12-12T03:09:34Z

@monfera @etpinard I've added some debugging stuff to the Docker container in #44.

Xvfb runs with X errors sent to STDOUT (instead of being ignored), and auditing is turned on. This means it prints a message for every X connection and disconnection, which shows if Electron is connecting and disconnecting (spoiler alert: it isn't).
Xvfb's screen dimensions have been doubled.
I added a VNC server, window manager, and a wrapper script.

Usage:

Build the image (docker build -f deployment/Dockerfile -t isdebug .), or grab it from quay.
Run the image as container isdebug and expose the VNC and imageserver ports: docker rm -f isdebug ; docker run -p 9091:9091 -p 5900:5900 --name isdebug -ti isdebug
From a second window, connect and run the VNC wrapper: docker exec -ti isdebug /vnc
Connect to the VNC display localhost:0 using your favourite client, such as gvncviewer on Ubuntu or Chicken of the VNC on OS X.
From a third window, run whatever tests are needed, such as (in the plotly.js repo): node test/image/make_baseline.js gl* (you'll want to modify testContainerUrl in tasks/util/constants.js, like in https://github.com/plotly/plotly.js/tree/image-exporter-testing)

In my case the image exporter window where the work was occurring was in fluxbox tab 2. Some errors were printed during normal operation (at this point images were being generated successfully):

When the server hung, a different message was printed:

I have not had the chance to investigate the significance of either of these things.

etpinard · 2017-12-12T14:04:42Z

I might be worth trying to listen to

https://electronjs.org/docs/api/app#event-gpu-process-crashed

in the app code. I'll give this a shot this afternoon.

scjody · 2017-12-13T03:00:45Z

I setup an Ubuntu 16.04 VM (since that's what @etpinard uses as a desktop and I can't easily get gl renders working with XQuartz on OS X), opened up the X server to TCP connections, and ran image-export-server using that X server to see if the issues were caused by Xvfb.

With image-export-server running in Docker the tests hung at the usual place (testing gl* caused a hang at gl3d-chrisp-nan-1). With image-export-server running on my development machine the tests completed successfully.

So the issue is unlikely to be caused by Xvfb. What's different about the app (or Electron) when it's built and run inside Docker vs. outside?

monfera · 2017-12-13T13:01:44Z

@scjody just jumping in to see if I can help with this stuff but may just state the obvious (perhaps incorrectly assuming that your development machine uses the graphics drivers of that machine for WebGL content rendering.)

| What's different about the app (or Electron) when it's built and run inside Docker vs. outside?

One difference is that when Electron is used on a desktop, the WebGL API calls go through whatever graphics chip driver the desktop has, eg. from Nvidia, AMD or intel integrated graphics. In a Docker container or on CI systems in general, it's running in headless mode, still expecting a display driver, in this case, Xvfb.

Many of our WebGL plots run on stack.gl and gl-vis which, unlike regl, don't provide automatic resource management. The typical WebGL resources are the shader programs, and buffer bindings that link typed JS arrays to contents on the GPU, used by the shaders (uniforms, attributes, textures etc.). As there's no automatic resource management, a small issue in one specific trace type may yield state inconsistency, eg. no buffer is bound to an enabled attribute (see on your log) or sometimes the other way around.

Various drivers, by extension, Xvfb too, have different idiosyncrasies and tolerances for handling slightly out of spec WebGL state. As we did encounter such things in the past, it might be that Xvfb is more sensitive to some of those.

Something else: apparently we're running tests with ignore-gpu-blacklist which in turn might disable some WebGL extensions depending on the graphics stack; if we rely on one of these and a related warning gets swallowed, we might get seemingly unrelated state inconsistency reports just like the above. A candidate is OES_vertex_array_object. We seem to only use it via a polyfill but the same graphics card may process real VAO and polyfill differently, or not properly 'disusing' the polyfill may lead to inconsistencies too.

scjody · 2017-12-13T13:19:08Z

@monfera If you can think of experiments to try (related to ignore-gpu-blacklist or otherwise), please try them and let us know!

I realize Xvfb has different properties from an X server running with a real graphics card. That's why I tried it yesterday using an X server running on Ubuntu 16.04 (with access to my laptop's graphics card). To summarize, with image-export-server running in Docker using an external Ubuntu X server, the hang still occurred. With it running on my development machine using the same external Ubuntu X server, no hang occurred. Is it possible that Electron is accessing the graphics card in some way that does not go through the X11 protocol? This seems unlikely but if so it could explain the difference.

My next experiment is to put an externally-built (and non-hanging) image-exporter-server and Electron into a Docker container and test that.

etpinard · 2017-12-13T15:54:30Z

Oh I should mention. I tried running the gl image tests in docker without the ignore-gpu-blacklist flag yesterday:. Without that flag the gl images fail to generate.

monfera · 2017-12-13T17:34:14Z

@scjody I started making a WebGL resetting tool for at least testing, then felt like maybe there's one already :-) https://github.com/stackgl/gl-reset

I'm about to make an ubuntu VM like you to test with it. It'd help if we had a minimal case that triggers the error. What would be a minimal run? You mention it fails at gl3d-chrisp-nan-1 - maybe it fails if this single thing is executed, or it fails if only the previous plot (or a previous plot) and this are executed (ie. it's likely it doesn't need to go through all that came before).

scjody · 2017-12-13T17:37:10Z

@monfera gl-reset looks good!

Note that we've only been able to reproduce the hang when it's running in a Docker container, not directly on an Ubuntu VM.

I don't have any more information than what's in #39 (comment) regarding reproducing this issue. Trying gl3d_chrisp-nan-1 repeatedly is on my list, and if you want to try that (or anything similar) go for it.

scjody · 2017-12-13T18:47:52Z

If I copy a working image-export-server.js and its entire source + build tree to Docker, I'm still able to reproduce the hang. (I'm still using the Ubuntu 16.04 X server just to keep that constant too, so it's not related to Xvfb.) So there's something different about running in Docker, or the packages installed in the Docker image, that causes the hang.

My next step is to create an Ubuntu VM that's as close as possible to the Docker container (in terms of packages installed and versions) and see if I can reproduce the issue there.

Incidentally I ran gl3d_bunny 200 times then gl3d_chrisp-nan-1 200 times successfully so I'm still no closer to finding a single mock that reproduces the issue. Running with gl* still reliably reproduces the issue for me though.

monfera · 2017-12-13T19:25:52Z

@scjody based on the nature of the WebGL resource management, I also suspect that no single mock will reproduce the issue, but a (perhaps N times repeated) transition between two mocks might. There may be many such pairs; the easiest is to use the mock where it breaks as one of the mocks, and a preceding mock as the other mock, and alternate between the two. If we're lucky, it's the directly preceding mock. Eg. gl3d-chrisp-nan-1 where it broke in your run, and whatever the preceding one was.

If that doesn't reproduce the issue, bisecting the preceding ones may help, eg. if gl3d-chrisp-nan-1 is the 20th on the set, and running all, ie. 1..20 will cause the error, then maybe 10..20 will cause it too; if eg. 10..20 causes it but 15..20 doesn't, then maybe 10..15 plus 20 will (perhaps not on a 1st run but after 5 runs or something), it sucks but if this transitioning hypothesis is true then we might be lucky and have a plot pair in logarithmic time that reproduces it.

monfera · 2017-12-13T19:27:07Z

(btw. working on the setup w/ your previous help but may not be around for too long as it's getting late here so you have a better chance to try with the directly preceding one)

etpinard · 2017-12-13T21:47:31Z

Maybe we should try pinging Mikola or one of his friends (e.g. Hugh Kennedy) about this topic? Perhaps they came across the same problems before.

cc @bpostlethwaite

scjody · 2017-12-13T22:07:42Z

Using an Ubuntu 16.04 VM with as close a copy of the Docker image as I could create (for obvious reasons directories like /boot needed to be preserved in their VM state), the error does not occur. (I'm still using my external Ubuntu 16.04 X server.)

This suggests that the issue is extremely specific to running under Docker. There really shouldn't be any significant differences there (unless Electron is somehow accessing the graphics hardware directly rather than through X).

Unfortunately creating the VM was fairly time consuming and I still managed to delete some important things (VMware tools for one thing) without which the VM can't be used. So I'm going to abandon this line of investigation unless someone else sees a lot of value in it.

Right now I'm going to reach out to Replicated to see if they have any ideas.

scjody · 2017-12-13T22:16:18Z

Also I'm definitely in favour of reaching out to folks like Mikola who may be able to help with this.

scjody · 2017-12-13T22:20:29Z

Note that the deadline for solving this issue in time for On-Prem 2.3.0 is 5 PM on Friday. Any later and we won't be able to do enough testing for a 2017 release.

scjody · 2017-12-13T23:13:18Z

From Replicated, here are some profiles for running apps like Chrome and Slack (which uses Electron) under Docker. I'm going to work through these and see if any make a difference.

scjody · 2017-12-13T23:22:44Z

Running with a huge number of Docker options from that list allows a full gl* test run to complete (at least with an external X server).

docker run -p 9091:9091 -p 5900:5900 --name isdebug -ti --net=host -e DISPLAY --privileged --cap-add=ALL --security-opt seccomp=unconfined -v /etc/localtime:/etc/localtime:ro -v /dev/shm:/dev/shm --device /dev/snd --device /dev/dri --device /dev/video0 --device /dev/usb --device /dev/bus/usb --group-add audio --group-add video isdebug

I'm now going to try Xvfb, and also bisect to see what options are actually needed.

Note that this may not provide a complete solution to the problem because not all these options are available in Replicated and Kubernetes.

scjody · 2017-12-13T23:30:08Z

Xvfb works. I'm going to continue testing with it since it's closer to what we need in production.

etpinard · 2017-12-14T00:27:40Z

I confirm that running the container with the docker options ⤴️ makes node test/image/make_baseline.js gl* complete successfully 🎉

In the next few hours, I'll try to dissect which of the option(s) above make this work.

scjody · 2017-12-14T00:39:42Z

Got it: docker run -p 9091:9091 --name isdebug -ti -v /dev/shm:/dev/shm isdebug

Everything except -v /dev/shm:/dev/shm is for my debugging convenience and shouldn't affect the issue.

So mapping in the shared memory device works around the issue. I'll check if this is possible in On-Prem and GKE but it seems unlikely especially with GKE 😿. It might be just that this is a good clue towards what's going on with Electron or Chrome... at first glance this really feels like a Chrome bug but it's too early to say.

scjody · 2017-12-14T00:46:44Z

Just to confirm, it works with docker run -p 9091:9091 -v /dev/shm:/dev/shm isdebug

This is the minimal set of arguments:

-p 9091:9091 to map the port used, so I can access it from my machine
-v /dev/shm:/dev/shm to map in the /dev/shm device
isdebug the name of the image I'm using

etpinard · 2017-12-14T03:12:34Z

-v /dev/shm:/dev/shm to map in the /dev/shm device

This will cause problems on CircleCI as CircleCI 2.0 doesn't allow mounting volumes, see plotly/plotly.js#1798 (comment) of more info.

scjody · 2017-12-14T17:01:11Z

There's nothing special about the host /dev/shm other than size. On my Ubuntu system it defaults to 2 GiB, whereas Docker defaults to 64 MiB. Docker 1.10 makes shm size configurable using the --shm-size option, and I was able to do a successful run with 128 MiB.

On the Chrome side, the underlying issue has just been fixed in Chromium but there's no Chrome release yet, and it's certainly not in Electron... https://bugs.chromium.org/p/chromium/issues/detail?id=736452

Also Chrome may need up to 512 MiB so I suggest that as a --shm-size value if possible.

My next steps are to see if we can increase /dev/shm size (via --shm-size, mounting the host's /dev/shm, or mounting a new volume there inside the container) in GKE and Replicated.

scjody · 2017-12-14T17:39:40Z

In GKE, you can't use the --shm-size option (Kubernetes doesn't support it yet) but you can mount a Memory device on /dev/shm, which creates a 2 GiB device.

root@imageserver-3438547728-h88cd:/var/www/image-exporter# df -k /dev/shm --si
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           2.0G  652k  2.0G   1% /dev/shm

With this, I am able to run gl* successfully. (I scaled back to 1 pod to make sure all requests hit the same pod.)

Replicated is next...

scjody · 2017-12-14T18:40:11Z

With Replicated --shm-size is not currently supported. They're checking on a timeframe for adding it.

Mounting the host's shm device into the imageserver container works, but the path that needs to be used is different on CentOS/RHEL vs. Ubuntu/Debian: /dev/shm vs. /run/shm. That would require the administrator to configure their shm path, which will cause confusion among some customers. Again Replicated could fix this so it doesn't need to be configured, but they're checking on the timeframe.

I'll start a discussion about our options once I have more information from Replicated.

scjody · 2017-12-15T17:54:34Z

The Replicated workaround is no longer relevant since we're not shipping this in On-Prem 2.3.0 😿. They will definitely implement the --shm-size option in time for On-Prem 2.4.0, and in any case the underlying issue in Chrome may be fixed by then.

etpinard · 2017-12-15T23:04:17Z

Looking at puppeteer docs, I came across puppeteer/puppeteer#1603 - so yeah we're not the only one having issue with docker + chromium 🙃

scjody · 2017-12-15T23:17:40Z

Yeah, I found at least a dozen cases throughout my travels yesterday.

chriddyp · 2018-01-08T19:50:26Z

Looking at puppeteer docs, I came across puppeteer/puppeteer#1603 - so yeah we're not the only one having issue with docker + chromium 🙃

I've been watching this issue, looks like it got closed / fixed today

scjody · 2018-01-08T21:15:58Z

The thing we're waiting for is for this to get into Electron (or whatever we use if we switch away from that).

According to https://bugs.chromium.org/p/chromium/issues/detail?id=736452#c61 it will be in Chrome 65.0.3299.6 but even the latest beta Electron is still on 59.0.3071.115 😿

Anyway, we do have a workaround for this and it will work in all the environments we care about:

On-Prem: A replicated.yaml change is needed (in the streambed repo). Let's do this once we start work on On-Prem 2.4.0.
prod/stage: Create a memory-backed directory and mount it into the container. PR coming.
developer workstations: share the host shm device by adding this to the Docker command line: --volume=/dev/shm:/dev/shm

jackparmer · 2018-01-08T22:00:33Z

What does this mean (if anything) for a Python distribution of this app? E.g. if we were to try to distribute this app through pip with a Python interface for running it? (Essentially a solution for this issue: plotly/plotly.py#880)

etpinard · 2018-01-08T22:04:13Z

What does this mean (if anything) for a Python distribution of this app?

It doesn't mean anything unless we expect python users to use a docker container (which I don't think is a good idea).

scjody added the type: bug label Dec 8, 2017

monfera self-assigned this Dec 8, 2017

scjody mentioned this issue Dec 12, 2017

Debug hang #44

Closed

etpinard mentioned this issue Dec 12, 2017

Trying a few things #46

Closed

scjody assigned scjody and unassigned monfera Dec 14, 2017

scjody removed their assignment Dec 15, 2017

scjody mentioned this issue Jan 8, 2018

Work around Chrome SHM issue (causes hang) #50

Merged

scjody closed this as completed in #50 Feb 7, 2018

Hang after too many gl requests in Docker #39

Hang after too many gl requests in Docker #39

Comments

scjody commented Dec 8, 2017

scjody commented Dec 8, 2017

monfera commented Dec 8, 2017

monfera commented Dec 8, 2017

scjody commented Dec 8, 2017

scjody commented Dec 8, 2017

monfera commented Dec 8, 2017

scjody commented Dec 8, 2017

scjody commented Dec 12, 2017

etpinard commented Dec 12, 2017

scjody commented Dec 13, 2017 • edited Loading

monfera commented Dec 13, 2017

scjody commented Dec 13, 2017

etpinard commented Dec 13, 2017

monfera commented Dec 13, 2017

scjody commented Dec 13, 2017

scjody commented Dec 13, 2017

monfera commented Dec 13, 2017 • edited Loading

monfera commented Dec 13, 2017

etpinard commented Dec 13, 2017

scjody commented Dec 13, 2017

scjody commented Dec 13, 2017

scjody commented Dec 13, 2017

scjody commented Dec 13, 2017

scjody commented Dec 13, 2017

scjody commented Dec 13, 2017

etpinard commented Dec 14, 2017 • edited Loading

scjody commented Dec 14, 2017

scjody commented Dec 14, 2017

etpinard commented Dec 14, 2017

scjody commented Dec 14, 2017

scjody commented Dec 14, 2017

scjody commented Dec 14, 2017

scjody commented Dec 15, 2017

etpinard commented Dec 15, 2017

scjody commented Dec 15, 2017

chriddyp commented Jan 8, 2018

scjody commented Jan 8, 2018

jackparmer commented Jan 8, 2018

etpinard commented Jan 8, 2018

scjody commented Dec 13, 2017 •

edited

Loading

monfera commented Dec 13, 2017 •

edited

Loading

etpinard commented Dec 14, 2017 •

edited

Loading