Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang after too many gl requests in Docker #39

Closed
scjody opened this issue Dec 8, 2017 · 39 comments
Closed

Hang after too many gl requests in Docker #39

scjody opened this issue Dec 8, 2017 · 39 comments

Comments

@scjody
Copy link
Contributor

scjody commented Dec 8, 2017

From the tests documented starting at https://github.com/plotly/streambed/issues/9865#issuecomment-349995119:

When image-exporter is run as an imageserver in Docker, after a small number of gl requests (30-40), the image-exporter hangs completely. The request in progress times out, and the server won't accept any more connections and must be restarted.

Two examples:

  • If test/image-make_baseline.js is used to run gl*, it hangs at gl3d_chrisp-nan-1.json.
  • If test/image-make_baseline.js is used to run gl3d*, it makes it past gl3d_chrisp-nan-1.json and hangs at gl3d_snowden_altered.json.

This means that the issue is unlikely to be specific to any one plot, but rather some resource becomes exhausted or something builds up to the point where image generation can't proceed.

@etpinard @monfera FYI

@scjody
Copy link
Contributor Author

scjody commented Dec 8, 2017

Note that this blocks using the new imageservers in prod, in On-Prem, and also blocks plotly/plotly.js#1972

@monfera
Copy link
Contributor

monfera commented Dec 8, 2017

I plugged a bunch of memory leaks in gl plots about a year ago and we got information that those worked but there were other leaks too, so I'm not too surprised. As there are many types of gl plots I wonder if periodically restarting the server would be a feasible short-term option, as hacky as it sounds. Also, in the future, there may always come a new plot type that has unintended leaks so it'd be a good preventive step. Maybe the real solution would be a plotly.js test bench, separate from the rest due to its resource needs, that would go through a large number of plot API calls and monotonic memory growth could be identified.

@monfera monfera self-assigned this Dec 8, 2017
@monfera
Copy link
Contributor

monfera commented Dec 8, 2017

Self-assigned it on info from @scjody that @jackparmer suggested I continue with the gl plot leak detection&fix work. Jody, when work is starting on this (I understand after the couple of issues assigned to me are fixed) I'll need some bag of representative plots that I can run the server through, to reproduce the issue locally, so that we don't miss something.

IOW running it through all public plots is an incredibly effective way for ferreting out all memory leaks that can happen on a single render pass (it doesn't solve leaks from interactions though).

@scjody
Copy link
Contributor Author

scjody commented Dec 8, 2017

@monfera The best way I've found is to run test/image/mocks/gl* or test/image/mocks/gl3d* (from plotly.js). Both sets of mocks will reproduce the issue, but at different places.

@scjody
Copy link
Contributor Author

scjody commented Dec 8, 2017

I wonder if periodically restarting the server would be a feasible short-term option

@monfera I'd like to do that as a workaround for #41, but I don't think it's feasible here since the server hangs so quickly. We observe hangs after between 30 and 40 gl plots, so to be safe we'd need to restart every 10 plots or so, and that would reduce server performance to an unacceptable level.

I also don't think this particular issue is caused by a memory leak. Memory usage does grow steadily (#41) but does not increase significantly after 30 to 40 plots (gl or otherwise). Some other kind of resource leak is certainly possible!

@monfera
Copy link
Contributor

monfera commented Dec 8, 2017

thanks @scjody for the added notes!

| Memory usage does grow steadily (#41) but does not increase significantly after 30 to 40 plots (gl or otherwise)

Reading #41 I got the impression that, even without the gl plots, memory consumption grows steadily throughout the first 1.5hrs you added, with the text above referring to 351 mocks, 0.8GB -> 1.6GB. I get your point about how the gl and non-gl may be separate issues, but thought the non-gl ones represent a problem for us too (the raison d'être of #41)

@scjody
Copy link
Contributor Author

scjody commented Dec 8, 2017

#41 is a problem, but I'm reasonably sure restarting every 1000 plots is an adequate workaround for that issue.

@scjody scjody mentioned this issue Dec 12, 2017
@scjody
Copy link
Contributor Author

scjody commented Dec 12, 2017

@monfera @etpinard I've added some debugging stuff to the Docker container in #44.

  • Xvfb runs with X errors sent to STDOUT (instead of being ignored), and auditing is turned on. This means it prints a message for every X connection and disconnection, which shows if Electron is connecting and disconnecting (spoiler alert: it isn't).
  • Xvfb's screen dimensions have been doubled.
  • I added a VNC server, window manager, and a wrapper script.

Usage:

  • Build the image (docker build -f deployment/Dockerfile -t isdebug .), or grab it from quay.
  • Run the image as container isdebug and expose the VNC and imageserver ports: docker rm -f isdebug ; docker run -p 9091:9091 -p 5900:5900 --name isdebug -ti isdebug
  • From a second window, connect and run the VNC wrapper: docker exec -ti isdebug /vnc
  • Connect to the VNC display localhost:0 using your favourite client, such as gvncviewer on Ubuntu or Chicken of the VNC on OS X.
  • From a third window, run whatever tests are needed, such as (in the plotly.js repo): node test/image/make_baseline.js gl* (you'll want to modify testContainerUrl in tasks/util/constants.js, like in https://github.com/plotly/plotly.js/tree/image-exporter-testing)

In my case the image exporter window where the work was occurring was in fluxbox tab 2. Some errors were printed during normal operation (at this point images were being generated successfully):

screen shot 2017-12-11 at 21 54 05

When the server hung, a different message was printed:

screen shot 2017-12-11 at 21 54 38

I have not had the chance to investigate the significance of either of these things.

@etpinard
Copy link
Contributor

I might be worth trying to listen to

https://electronjs.org/docs/api/app#event-gpu-process-crashed

in the app code. I'll give this a shot this afternoon.

@scjody
Copy link
Contributor Author

scjody commented Dec 13, 2017

I setup an Ubuntu 16.04 VM (since that's what @etpinard uses as a desktop and I can't easily get gl renders working with XQuartz on OS X), opened up the X server to TCP connections, and ran image-export-server using that X server to see if the issues were caused by Xvfb.

With image-export-server running in Docker the tests hung at the usual place (testing gl* caused a hang at gl3d-chrisp-nan-1). With image-export-server running on my development machine the tests completed successfully.

So the issue is unlikely to be caused by Xvfb. What's different about the app (or Electron) when it's built and run inside Docker vs. outside?

@monfera
Copy link
Contributor

monfera commented Dec 13, 2017

@scjody just jumping in to see if I can help with this stuff but may just state the obvious (perhaps incorrectly assuming that your development machine uses the graphics drivers of that machine for WebGL content rendering.)

| What's different about the app (or Electron) when it's built and run inside Docker vs. outside?

One difference is that when Electron is used on a desktop, the WebGL API calls go through whatever graphics chip driver the desktop has, eg. from Nvidia, AMD or intel integrated graphics. In a Docker container or on CI systems in general, it's running in headless mode, still expecting a display driver, in this case, Xvfb.

Many of our WebGL plots run on stack.gl and gl-vis which, unlike regl, don't provide automatic resource management. The typical WebGL resources are the shader programs, and buffer bindings that link typed JS arrays to contents on the GPU, used by the shaders (uniforms, attributes, textures etc.). As there's no automatic resource management, a small issue in one specific trace type may yield state inconsistency, eg. no buffer is bound to an enabled attribute (see on your log) or sometimes the other way around.

Various drivers, by extension, Xvfb too, have different idiosyncrasies and tolerances for handling slightly out of spec WebGL state. As we did encounter such things in the past, it might be that Xvfb is more sensitive to some of those.

Something else: apparently we're running tests with ignore-gpu-blacklist which in turn might disable some WebGL extensions depending on the graphics stack; if we rely on one of these and a related warning gets swallowed, we might get seemingly unrelated state inconsistency reports just like the above. A candidate is OES_vertex_array_object. We seem to only use it via a polyfill but the same graphics card may process real VAO and polyfill differently, or not properly 'disusing' the polyfill may lead to inconsistencies too.

@scjody
Copy link
Contributor Author

scjody commented Dec 13, 2017

@monfera If you can think of experiments to try (related to ignore-gpu-blacklist or otherwise), please try them and let us know!

I realize Xvfb has different properties from an X server running with a real graphics card. That's why I tried it yesterday using an X server running on Ubuntu 16.04 (with access to my laptop's graphics card). To summarize, with image-export-server running in Docker using an external Ubuntu X server, the hang still occurred. With it running on my development machine using the same external Ubuntu X server, no hang occurred. Is it possible that Electron is accessing the graphics card in some way that does not go through the X11 protocol? This seems unlikely but if so it could explain the difference.

My next experiment is to put an externally-built (and non-hanging) image-exporter-server and Electron into a Docker container and test that.

@etpinard
Copy link
Contributor

Oh I should mention. I tried running the gl image tests in docker without the ignore-gpu-blacklist flag yesterday:. Without that flag the gl images fail to generate.

@monfera
Copy link
Contributor

monfera commented Dec 13, 2017

@scjody I started making a WebGL resetting tool for at least testing, then felt like maybe there's one already :-) https://github.com/stackgl/gl-reset

I'm about to make an ubuntu VM like you to test with it. It'd help if we had a minimal case that triggers the error. What would be a minimal run? You mention it fails at gl3d-chrisp-nan-1 - maybe it fails if this single thing is executed, or it fails if only the previous plot (or a previous plot) and this are executed (ie. it's likely it doesn't need to go through all that came before).

@scjody
Copy link
Contributor Author

scjody commented Dec 13, 2017

@monfera gl-reset looks good!

Note that we've only been able to reproduce the hang when it's running in a Docker container, not directly on an Ubuntu VM.

I don't have any more information than what's in #39 (comment) regarding reproducing this issue. Trying gl3d_chrisp-nan-1 repeatedly is on my list, and if you want to try that (or anything similar) go for it.

@scjody
Copy link
Contributor Author

scjody commented Dec 13, 2017

If I copy a working image-export-server.js and its entire source + build tree to Docker, I'm still able to reproduce the hang. (I'm still using the Ubuntu 16.04 X server just to keep that constant too, so it's not related to Xvfb.) So there's something different about running in Docker, or the packages installed in the Docker image, that causes the hang.

My next step is to create an Ubuntu VM that's as close as possible to the Docker container (in terms of packages installed and versions) and see if I can reproduce the issue there.

Incidentally I ran gl3d_bunny 200 times then gl3d_chrisp-nan-1 200 times successfully so I'm still no closer to finding a single mock that reproduces the issue. Running with gl* still reliably reproduces the issue for me though.

@monfera
Copy link
Contributor

monfera commented Dec 13, 2017

@scjody based on the nature of the WebGL resource management, I also suspect that no single mock will reproduce the issue, but a (perhaps N times repeated) transition between two mocks might. There may be many such pairs; the easiest is to use the mock where it breaks as one of the mocks, and a preceding mock as the other mock, and alternate between the two. If we're lucky, it's the directly preceding mock. Eg. gl3d-chrisp-nan-1 where it broke in your run, and whatever the preceding one was.

If that doesn't reproduce the issue, bisecting the preceding ones may help, eg. if gl3d-chrisp-nan-1 is the 20th on the set, and running all, ie. 1..20 will cause the error, then maybe 10..20 will cause it too; if eg. 10..20 causes it but 15..20 doesn't, then maybe 10..15 plus 20 will (perhaps not on a 1st run but after 5 runs or something), it sucks but if this transitioning hypothesis is true then we might be lucky and have a plot pair in logarithmic time that reproduces it.

@monfera
Copy link
Contributor

monfera commented Dec 13, 2017

(btw. working on the setup w/ your previous help but may not be around for too long as it's getting late here so you have a better chance to try with the directly preceding one)

@etpinard
Copy link
Contributor

Maybe we should try pinging Mikola or one of his friends (e.g. Hugh Kennedy) about this topic? Perhaps they came across the same problems before.

cc @bpostlethwaite

@scjody
Copy link
Contributor Author

scjody commented Dec 13, 2017

Using an Ubuntu 16.04 VM with as close a copy of the Docker image as I could create (for obvious reasons directories like /boot needed to be preserved in their VM state), the error does not occur. (I'm still using my external Ubuntu 16.04 X server.)

This suggests that the issue is extremely specific to running under Docker. There really shouldn't be any significant differences there (unless Electron is somehow accessing the graphics hardware directly rather than through X).

Unfortunately creating the VM was fairly time consuming and I still managed to delete some important things (VMware tools for one thing) without which the VM can't be used. So I'm going to abandon this line of investigation unless someone else sees a lot of value in it.

Right now I'm going to reach out to Replicated to see if they have any ideas.

@scjody
Copy link
Contributor Author

scjody commented Dec 13, 2017

Also I'm definitely in favour of reaching out to folks like Mikola who may be able to help with this.

@scjody
Copy link
Contributor Author

scjody commented Dec 13, 2017

Note that the deadline for solving this issue in time for On-Prem 2.3.0 is 5 PM on Friday. Any later and we won't be able to do enough testing for a 2017 release.

@scjody
Copy link
Contributor Author

scjody commented Dec 13, 2017

From Replicated, here are some profiles for running apps like Chrome and Slack (which uses Electron) under Docker. I'm going to work through these and see if any make a difference.

@scjody
Copy link
Contributor Author

scjody commented Dec 13, 2017

Running with a huge number of Docker options from that list allows a full gl* test run to complete (at least with an external X server).

docker run -p 9091:9091 -p 5900:5900 --name isdebug -ti --net=host -e DISPLAY --privileged --cap-add=ALL --security-opt seccomp=unconfined -v /etc/localtime:/etc/localtime:ro -v /dev/shm:/dev/shm --device /dev/snd --device /dev/dri --device /dev/video0 --device /dev/usb --device /dev/bus/usb --group-add audio --group-add video isdebug

I'm now going to try Xvfb, and also bisect to see what options are actually needed.

Note that this may not provide a complete solution to the problem because not all these options are available in Replicated and Kubernetes.

@scjody
Copy link
Contributor Author

scjody commented Dec 13, 2017

Xvfb works. I'm going to continue testing with it since it's closer to what we need in production.

@etpinard
Copy link
Contributor

etpinard commented Dec 14, 2017

I confirm that running the container with the docker options ⤴️ makes node test/image/make_baseline.js gl* complete successfully 🎉

In the next few hours, I'll try to dissect which of the option(s) above make this work.

@scjody
Copy link
Contributor Author

scjody commented Dec 14, 2017

Got it: docker run -p 9091:9091 --name isdebug -ti -v /dev/shm:/dev/shm isdebug

Everything except -v /dev/shm:/dev/shm is for my debugging convenience and shouldn't affect the issue.

So mapping in the shared memory device works around the issue. I'll check if this is possible in On-Prem and GKE but it seems unlikely especially with GKE 😿. It might be just that this is a good clue towards what's going on with Electron or Chrome... at first glance this really feels like a Chrome bug but it's too early to say.

@scjody
Copy link
Contributor Author

scjody commented Dec 14, 2017

Just to confirm, it works with docker run -p 9091:9091 -v /dev/shm:/dev/shm isdebug

This is the minimal set of arguments:

  • -p 9091:9091 to map the port used, so I can access it from my machine
  • -v /dev/shm:/dev/shm to map in the /dev/shm device
  • isdebug the name of the image I'm using

@etpinard
Copy link
Contributor

-v /dev/shm:/dev/shm to map in the /dev/shm device

This will cause problems on CircleCI as CircleCI 2.0 doesn't allow mounting volumes, see plotly/plotly.js#1798 (comment) of more info.

@scjody
Copy link
Contributor Author

scjody commented Dec 14, 2017

There's nothing special about the host /dev/shm other than size. On my Ubuntu system it defaults to 2 GiB, whereas Docker defaults to 64 MiB. Docker 1.10 makes shm size configurable using the --shm-size option, and I was able to do a successful run with 128 MiB.

On the Chrome side, the underlying issue has just been fixed in Chromium but there's no Chrome release yet, and it's certainly not in Electron... https://bugs.chromium.org/p/chromium/issues/detail?id=736452

Also Chrome may need up to 512 MiB so I suggest that as a --shm-size value if possible.

My next steps are to see if we can increase /dev/shm size (via --shm-size, mounting the host's /dev/shm, or mounting a new volume there inside the container) in GKE and Replicated.

@scjody
Copy link
Contributor Author

scjody commented Dec 14, 2017

In GKE, you can't use the --shm-size option (Kubernetes doesn't support it yet) but you can mount a Memory device on /dev/shm, which creates a 2 GiB device.

root@imageserver-3438547728-h88cd:/var/www/image-exporter# df -k /dev/shm --si
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           2.0G  652k  2.0G   1% /dev/shm

With this, I am able to run gl* successfully. (I scaled back to 1 pod to make sure all requests hit the same pod.)

Replicated is next...

@scjody
Copy link
Contributor Author

scjody commented Dec 14, 2017

With Replicated --shm-size is not currently supported. They're checking on a timeframe for adding it.

Mounting the host's shm device into the imageserver container works, but the path that needs to be used is different on CentOS/RHEL vs. Ubuntu/Debian: /dev/shm vs. /run/shm. That would require the administrator to configure their shm path, which will cause confusion among some customers. Again Replicated could fix this so it doesn't need to be configured, but they're checking on the timeframe.

I'll start a discussion about our options once I have more information from Replicated.

@scjody scjody assigned scjody and unassigned monfera Dec 14, 2017
@scjody
Copy link
Contributor Author

scjody commented Dec 15, 2017

The Replicated workaround is no longer relevant since we're not shipping this in On-Prem 2.3.0 😿. They will definitely implement the --shm-size option in time for On-Prem 2.4.0, and in any case the underlying issue in Chrome may be fixed by then.

@scjody scjody removed their assignment Dec 15, 2017
@etpinard
Copy link
Contributor

Looking at puppeteer docs, I came across puppeteer/puppeteer#1603 - so yeah we're not the only one having issue with docker + chromium 🙃

@scjody
Copy link
Contributor Author

scjody commented Dec 15, 2017

Yeah, I found at least a dozen cases throughout my travels yesterday.

@chriddyp
Copy link
Member

chriddyp commented Jan 8, 2018

Looking at puppeteer docs, I came across puppeteer/puppeteer#1603 - so yeah we're not the only one having issue with docker + chromium 🙃

I've been watching this issue, looks like it got closed / fixed today

@scjody
Copy link
Contributor Author

scjody commented Jan 8, 2018

The thing we're waiting for is for this to get into Electron (or whatever we use if we switch away from that).

According to https://bugs.chromium.org/p/chromium/issues/detail?id=736452#c61 it will be in Chrome 65.0.3299.6 but even the latest beta Electron is still on 59.0.3071.115 😿

Anyway, we do have a workaround for this and it will work in all the environments we care about:

  • On-Prem: A replicated.yaml change is needed (in the streambed repo). Let's do this once we start work on On-Prem 2.4.0.
  • prod/stage: Create a memory-backed directory and mount it into the container. PR coming.
  • developer workstations: share the host shm device by adding this to the Docker command line: --volume=/dev/shm:/dev/shm

@jackparmer
Copy link
Contributor

What does this mean (if anything) for a Python distribution of this app? E.g. if we were to try to distribute this app through pip with a Python interface for running it? (Essentially a solution for this issue: plotly/plotly.py#880)

@etpinard
Copy link
Contributor

etpinard commented Jan 8, 2018

What does this mean (if anything) for a Python distribution of this app?

It doesn't mean anything unless we expect python users to use a docker container (which I don't think is a good idea).

@scjody scjody closed this as completed in #50 Feb 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants