yamux: disable yamux keep alive in server channel #263

devimc · 2018-06-09T01:13:14Z

yamux client runs in the proxy side, sometimes the client is handling
other requests and it's not able to response to the ping sent by the
server and the communication is closed. To avoid IO timeouts in the
communication between agent and proxy, keep alive should be disabled.

fixes kata-containers/proxy#70
fixes #231

Signed-off-by: Julio Montes [email protected]

jodh-intel · 2018-06-11T07:19:42Z

Hi @devimc - could you add a little more detail to the commit? For example, what is the implication of the proxy/agent connection being severed (is it that it can never be re-opened and the system will then hang?) This behaviour seems counter-intuitive to a "keep alive" protocol so extra details would help.

In fact, I'd also add a brief comment in both files explaining why the keep alive option is being disabled as it would seem to be a useful feature to have enabled.

grahamwhaley · 2018-06-11T08:34:40Z

I'm thinking the CI failures are unrelated and probably already known about. For completeness, from the 17-10 logs (/cc @chavafg ):

Summarizing 3 Failures:

[Fail] Update CPU constraints Update CPU set [It] cpuset should be equal to 0,3 
/home/jenkins/jenkins_slave/workspace/kata-containers-agent-fedora-27-PR/go/src/github.com/kata-containers/tests/integration/docker/cpu_test.go:388

[Fail] Hot plug CPUs with a parent cgroup [It] should not fail with parent cgroup: abc/xyz/rgb 
/home/jenkins/jenkins_slave/workspace/kata-containers-agent-fedora-27-PR/go/src/github.com/kata-containers/tests/integration/docker/cpu_test.go:211

[Fail] Hot plug CPUs container with CPU constraint [It] should have 1 CPUs 
/home/jenkins/jenkins_slave/workspace/kata-containers-agent-fedora-27-PR/go/src/github.com/kata-containers/tests/integration/docker/cpu_test.go:106

codecov · 2018-06-11T19:23:21Z

Codecov Report

Merging #263 into master will increase coverage by 0.02%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #263      +/-   ##
==========================================
+ Coverage   43.15%   43.18%   +0.02%     
==========================================
  Files          14       14              
  Lines        2266     2267       +1     
==========================================
+ Hits          978      979       +1     
  Misses       1157     1157              
  Partials      131      131

Impacted Files	Coverage Δ
channel.go	`47.36% <100%> (+0.46%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 17b44df...5f89c07. Read the comment docs.

bergwolf · 2018-06-12T07:43:39Z

lgtm!

sboeuf · 2018-06-12T14:19:27Z

@devimc looks good, and I can confirm I ran into those issues when running a Kata Containers container for too long. It would be great to have some proper testing of this, but I am not sure how to perform it though.
Please address @jodh-intel comment and fix the CI, and we can merge this !

WeiZhang555 · 2018-06-13T07:51:06Z

LGTM 👍

devimc · 2018-06-13T13:48:11Z

@sboeuf now I see a race condition, kata-shim never finish, I'm debugging this issue

sboeuf · 2018-06-20T19:33:07Z

Any update on this @devimc ?

devimc · 2018-06-22T20:38:15Z

exec process never finish and it's killed (timeout), this issue doesn't happen if the exec process has a tty (docker exec -t ....)

------------------------------
docker exec check stdout forwarded using exec 
  should displayed it
  /tmp/jenkins/workspace/kata-containers-agent-fedora-27-PR/go/src/github.com/kata-containers/tests/integration/docker/exec_test.go:61
Running command '/usr/bin/docker [docker run --runtime kata-runtime -td --name CZ4k1mAabI2d5jMgpuMFzBnQpW87qJ busybox sh]'
[docker run --runtime kata-runtime -td --name CZ4k1mAabI2d5jMgpuMFzBnQpW87qJ busybox sh]
Timeout: 60 seconds
Exit Code: 0
Stdout: ffb7746578159035b981f18e752c285a8f89ab3a160401cf17d88a0562ff80ea

Stderr: 
Running command '/usr/bin/docker [docker exec CZ4k1mAabI2d5jMgpuMFzBnQpW87qJ sh -c ls /etc/resolv.conf 2>/dev/null]'
Killing process timeout reached '60' seconds
Running command '/usr/bin/docker [docker rm -f CZ4k1mAabI2d5jMgpuMFzBnQpW87qJ]'
[docker rm -f CZ4k1mAabI2d5jMgpuMFzBnQpW87qJ]
Timeout: 60 seconds
Exit Code: 0
Stdout: CZ4k1mAabI2d5jMgpuMFzBnQpW87qJ

Stderr: 
Running command '/usr/bin/docker [docker ps -a -f name=CZ4k1mAabI2d5jMgpuMFzBnQpW87qJ --format {{.Status}}]'
[docker ps -a -f name=CZ4k1mAabI2d5jMgpuMFzBnQpW87qJ --format {{.Status}}]
Timeout: 60 seconds
Exit Code: 0
Stdout: 
Stderr: 

• Failure [62.969 seconds]
docker exec
/tmp/jenkins/workspace/kata-containers-agent-fedora-27-PR/go/src/github.com/kata-containers/tests/integration/docker/exec_test.go:19
  check stdout forwarded using exec
  /tmp/jenkins/workspace/kata-containers-agent-fedora-27-PR/go/src/github.com/kata-containers/tests/integration/docker/exec_test.go:60
    should displayed it [It]
    /tmp/jenkins/workspace/kata-containers-agent-fedora-27-PR/go/src/github.com/kata-containers/tests/integration/docker/exec_test.go:61

    Expected
        <int>: -1
    to equal
        <int>: 0

    /tmp/jenkins/workspace/kata-containers-agent-fedora-27-PR/go/src/github.com/kata-containers/tests/integration/docker/exec_test.go:64

sboeuf · 2018-06-25T18:09:51Z

@devimc I am actively debugging this FYI, but I don't think this is related to TTY since I have been able to reproduce this same behavior with a command using TTY.

sboeuf · 2018-06-30T00:02:32Z

@devimc @jodh-intel @bergwolf @WeiZhang555
Ok I have spent a good amount of time this week investigating this issue. But let's first take a step back because I wanted to mention that the issue we're trying to solve here might actually be considered as a non-issue. Problem here is that when we pause the container, we pause the entire VM, which is not really the expected behavior IMO. I am even wondering if the use case of pausing/resuming a VM is valid from an OCI command perspective. From this observation, I am wondering if it's worth it to disable keepalive from Yamux, and instead, we should work on making sure that a docker pause will translate into a container process to be paused instead of getting the VM paused.

Now, I also understand that we might run into some issues if, in the context of VM templating, we let the kernel boot and the agent start before we pause the VM. And for this reason, we might want to disable keepalive.

So here we go about the debug part. The environment to reproduce is pretty simple, we need to run a simple container, something like:

docker run -it busybox
#

and then from another terminal you start a loop of exec:

while :; do docker exec mycontainer ls /etc/resolv.conf 2> /dev/null

so you get regular output:

/etc/resolv.conf
/etc/resolv.conf
...
/etc/resolv.conf

until it hangs.

Looking at the whole chain, what happens is the shim (after it got all the output from the agent and transmitted all its input to the agent) calls into WaitProcess(), and I can see this call going through the proxy but it never makes it to the agent, and in particular to the gRPC server. I have heavily instrumented the code and checked what was happening, and I came to the conclusion that Yamux was not reading anything. I dumped a bunch of logs from the Yamux code, and I confirm that when this issue happens, Yamux is still blocked reading here: https://github.com/hashicorp/yamux/blob/master/session.go#L453 while it should be reading something.
Also, when this happens, a very interesting thing to notice is that we can unblock the hanging shim by simply entering one character from the container shell. The behavior is very weird but basically it seems that we have the WaitProcess() message buffered in between and never pushed through the VM, which makes Yamux not able to read the expected amount of bytes. But when something is triggering more input through the socket/serial_port, it magically unblocks everything.

This is where I am at and I don't know how to continue on this. Do you have any idea if this behavior could have been caused by the VM port that might be "buggy" ?
HELP NEEDED :)

grahamwhaley · 2018-07-03T12:50:51Z

Nice work so far @devimc @sboeuf btw. A couple of thoughts then:

You talk about pausing the container, but with your busybox/ls resolv example, are we pausing? Maybe I'm just missing something there in the container state machine in my head?
That hang up being released by a single character smells very much like an I/O race with somebody (at or near the yamux end) missing a delivery, and blocking. Hmmm
If we think yamux may be part of the issue, there are (many!) other multiplexers out there (like muxado), and even a wrapper for a number of them: https://github.com/libp2p/go-stream-muxer OK, most of the time debugging the problem you have in hand is the best way forwards - but if we get stuck and need to compare/contrast, maybe they might be useful.

I'm going to go stare at yamux etc. just in case something pops out at me.

grahamwhaley · 2018-07-03T13:55:05Z

@sboeuf I tried to reproduce your busybox/resolv test, but it's still going at 4000+ execs for me. Was there anything 'special' one has to do or patches one should/not have in place?
I was going to mention you see that I think our vendor'd yamux is out of date, and there might be some relevant fixes in the upstream:

4c2fe0d (origin/b-consul-3040) Dont output keepalive error when the session is closed
f21aae5 Make sure to drain the timer channel on defer, and a clarifying comment
601ccd8 Make receive window update logic a bit cleaner
02d320c Uses timer pool in sendNoWait, like in waitForSendErr
cf433c5 window update unit test for partial read; benchmark large buffer
ca8dfd0 improve memory utilization in receive buffer, fix flow control
683f491 Fix race around read and write deadlines in Stream (#52)
40b86b2 Add public session CloseChan method (#44)

sboeuf · 2018-07-03T14:51:21Z

@grahamwhaley thanks for taking a look at this too.

To clarify, the issue solved by this PR was to be able to pause a container (pause the VM) without having issues regarding the Yamux keepalive (heartbeat) feature. Now, relying on this PR, we see some issues when running simple exec test.
About Yamux version, I also realized this last week, and I am testing with the latest version Yamux for both proxy and agent, but this didn't solve it...
Pretty weird that you didn't get the tests failing yet...

sboeuf · 2018-07-04T01:39:07Z

@grahamwhaley more debug info on this.
I have tried on my baremetal machine to compare with your setup. It is harder to reproduce (way harder), but at some point I hit the issue. Also, it's been easier to reproduce by forcing virtio-serial-pci to disable-modern=true, making use of virtio 0.9 instead of 1.0.

We need to instrument client side of yamux (proxy), and server side of yamux (agent) in order to understand why the command from the shim is not transmitted/received. I said I had checked the proxy receives the request from the shim, but I haven't checked that yamux client actually sends this to the agent... This is the next thing to try.

devimc · 2018-07-04T20:54:01Z

After reading golang/go#24942 , I installed golang 1.11 (beta) to check if this issue is gone, unfortunately it didn't, issue still occurs.

sboeuf · 2018-07-13T19:35:16Z

@devimc @grahamwhaley
I have spent more time on this issue, and I think the data is simply not making it through the emulated serial port, meaning the agent never gets to read what's sent. Based on my previous explanation about how triggering the communication by entering a random key in the container console was actually unblocking the situation, I have found an interesting issue here: https://bugzilla.redhat.com/show_bug.cgi?id=1435432 leading to a cloned one: https://bugzilla.redhat.com/show_bug.cgi?id=1438539 basically saying that glib2 might have some issues regarding the way the loop might not be properly woken up.
@devimc could you try on a Fedora with at least glib2-2.52.2-2.fc26 version and let me know if you can reproduce the issue ?

sboeuf · 2018-07-18T18:39:45Z

FYI, I have added Depends-on: github.com/kata-containers/proxy#91 to the commit, and the CI passes properly !
@bergwolf @grahamwhaley @devimc wdyt ?

stefanha · 2018-07-19T10:29:08Z

@grahamwhaley I haven't been able to reproduce the hang with qemu-2.11.2-1.fc28. My Kata includes the keep-alive patch.

I did hit another problem though: sometime after 10,000 execs Go gets a pthread_create() failure, Go backtraces are printed, and I guess the agent terminates because the QEMU process goes away and docker realizes the container is gone.

grahamwhaley · 2018-07-19T10:39:47Z

Thanks @stefanha - I could not produce the hang on bare metal, but it happens much more quickly (for me anywhere between 80 an 1100 iterations) inside a VM - if that helps...
For the 10k exec death - ouch!, and thanks for reporting. @GabyCT - you have an exec soak test I believe - does it go up to 10k? - could you give that a try to see if we can reproduce?

GabyCT · 2018-07-19T19:54:27Z

@grahamwhaley one question, so I need to apply this PR and then run the test? or just run the test what we have currently?

sboeuf · 2018-07-19T20:04:50Z

@GabyCT I think you don't need to apply this PR since the issue mentioned by @stefanha is/seems unrelated.

GabyCT · 2018-07-19T20:06:32Z

@sboeuf thanks for the clarification :)

GabyCT · 2018-07-19T22:02:43Z

@grahamwhaley , I ran the distress test where we perform several execs in a running container and so far there has been more than 15k execs and the VM has not died. Here it is the configuration that I am using

kata-runtime  : 1.1.0
commit   : e0010619fcebbad28be29ade111257d477ecbad3
OCI specs: 1.0.1
kata-agent version 1.1.0-17b44df35a174c09966cf9d2ccc8472a597feda1
kata-proxy version 1.1.0-7e2a93d1931d814debfcff2846b1514af1b9601d
kata-shim version 1.1.0-6ddca0334d0e8dee6e1e00f0c120fde4b8c1c176

grahamwhaley · 2018-07-20T08:41:38Z

I concur the patch feels unrelated to the death - but, given you've run it without the patch and it didn't die, I think we should try a couple of things (probably in this order):

run it again with the patch, just in case...
run it longer - leave it overnight and see if it still going in the morning :-)
if all that fails, somebody else can try to reproduce (I can do that), and we should probably gather exact details from @stefanha - which distro, was it nested/VM, what kata components (a kata-env would be good). We should probably then do that on a new Issue to separate the >10k exec issue from this yamux one.

bergwolf · 2018-07-22T04:05:36Z

@sboeuf

Particularly with the metrics CI, we're seeing some docker rm hangs because of this

Could you please give some pointers to these issues? I get it that it is broken by design to have keepalive when pausing the VM. But what errors did we get about docker rm that has its roots on this?

Basically what you did in kata-containers/proxy/pull/91 is copying yamux session.keepalive() implementation to kata-proxy -- the same effect as enabling keepalive on kata-proxy side. I do not understand why it would solve the whole issue. If we do need that, we can just enable keepalvie on kata-proxy side, don't we?

sboeuf · 2018-07-23T16:35:55Z

@grahamwhaley do you have any pointers about the root cause of the docker rm hangs for @bergwolf ?

@bergwolf about the proxy patch I have introduced, I agree it is almost the same as keepalive. The main difference being that I don't care/check for error. This simply maintains a regular communication every second, but will never error out. Does that make sense?

grahamwhaley · 2018-07-23T16:47:16Z

Sure. Hi @bergwolf . More complete thread is over at: Over on kata-containers/runtime#406 (comment)

I believe the basic issue is that if you issue something such as:

$ docker rm -f $(docker ps -qa)

and have a reasonable number of containers running (say >20 - I have seen this happen with 20, and see it often with 110 containers...), then afaict Docker will issue all of those rms in parallel. This takes time for kata, and eventually it seems we run into a yamux timeout, which then kills that container. That also then leaves us in an untidy state, and requires hand cleanup or reboot of Docker to get us back to a working system :-(

My gut feeling is that on a parallel rm our runtimes may race somewhat on the lock files down in /var/.../sb . @sboeuf did a bunch of rework on that a few months back, and it is better than it used to be - but I still suspect we have some race/locks, maybe. Unproven though, and possibly unavoidable.

GabyCT · 2018-07-24T13:43:15Z

@grahamwhaley, regarding to

run it longer - leave it overnight and see if it still going in the morning :-)

I ran the stress test (where we performed several execs) for 20 hours and it is still working so far we have more than 150k execs done successfully.

grahamwhaley · 2018-07-24T13:46:09Z

@GabyCT - great, thanks for the confirmation.
@stefanha - looks like we will need a few more details about the system and versions etc. if we are to re-create that >10k exec failure.

stefanha · 2018-07-24T13:57:16Z

On Tue, Jul 24, 2018 at 2:46 PM, Graham Whaley ***@***.***> wrote: @stefanha - looks like we will need a few more details about the system and versions etc. if we are to re-create that >10k exec failure.

It might be related to my Fedora initramfs. It's possible that the guest is running out of memory because something is writing to the initramfs (i.e. RAM), like systemd-journald. I'll investigate.

sboeuf · 2018-07-24T16:38:23Z

@bergwolf have you seen my answer ?

about the proxy patch I have introduced, I agree it is almost the same as keepalive(). The main difference being that I don't care/check for error. This simply maintains a regular communication every second, but will never error out. Does that make sense?

sboeuf · 2018-07-25T17:11:48Z

@bergwolf ping ;)

bergwolf · 2018-07-27T02:41:23Z

LGTM. Sorry for the delay. I guess we'll have to carry this workaround (and kata-containers/proxy#91) since we cannot wait to find out the root cause forever...

sboeuf · 2018-07-27T13:37:49Z

kata-containers/proxy#91 has been merged. We can go ahead and merge this one.

Now that this issue has been solved kata-containers/agent#263, it is necessary to remove the workaround of sleeping while shutting down containers. Fixes kata-containers#635 Signed-off-by: Gabriela Cervantes <[email protected]>

Now that this issue has been solved kata-containers/agent#263, it is necessary to remove the workaround of sleeping. Fixes kata-containers#644 Signed-off-by: Gabriela Cervantes <[email protected]>

devimc mentioned this pull request Jun 9, 2018

yamux: keepalive failed: i/o deadline reached kata-containers/proxy#70

Closed

devimc force-pushed the yamux/disableKeepAlive branch from 61f7654 to 570e716 Compare June 9, 2018 01:16

devimc changed the title ~~yamux: disable keep alive~~ yamux: disable yamux keep alive in server channel Jun 11, 2018

devimc force-pushed the yamux/disableKeepAlive branch from 570e716 to b57f8db Compare June 11, 2018 19:23

sboeuf approved these changes Jun 12, 2018

View reviewed changes

devimc added the do-not-merge label Jun 13, 2018

devimc force-pushed the yamux/disableKeepAlive branch from b57f8db to 83d7906 Compare June 18, 2018 14:11

devimc mentioned this pull request Jun 21, 2018

Proxy leaks socket connections kata-containers/proxy#77

Closed

devimc force-pushed the yamux/disableKeepAlive branch from 83d7906 to 1b9136a Compare June 26, 2018 21:21

grahamwhaley mentioned this pull request Jul 2, 2018

docker rm multi-rm can hang up runtime (with sb lock held) kata-containers/runtime#406

Closed

sboeuf force-pushed the yamux/disableKeepAlive branch from 1b9136a to 33e3c6c Compare July 3, 2018 17:01

This was referenced Jul 5, 2018

metrics: temporarily work around parallel 'docker rm' hangs kata-containers/tests#475

Closed

yamux: keepalive failed: i/o deadline reached for proxybuiltin #294

Closed

devimc mentioned this pull request Jul 20, 2018

CI: docker cpu hotplug test failed kata-containers/runtime#486

Closed

sboeuf removed the do-not-merge label Jul 27, 2018

sboeuf approved these changes Jul 27, 2018

View reviewed changes

sboeuf merged commit c6ac153 into kata-containers:master Jul 27, 2018

grahamwhaley mentioned this pull request Jul 30, 2018

docker rm sometimes hangs up kata-containers/runtime#515

Closed

GabyCT mentioned this pull request Aug 22, 2018

Remove sleep while shutting down multiple containers kata-containers/tests#635

Closed

This was referenced Aug 22, 2018

metrics: Remove sleep while shutting down multiple containers kata-containers/tests#636

Merged

Remove sleep from the footprint script kata-containers/tests#644

Closed

GabyCT mentioned this pull request Aug 23, 2018

metrics: Remove sleep for the footprint script kata-containers/tests#645

Merged

yamux: disable yamux keep alive in server channel #263

yamux: disable yamux keep alive in server channel #263

Conversation

devimc commented Jun 9, 2018 • edited Loading

jodh-intel commented Jun 11, 2018

grahamwhaley commented Jun 11, 2018

codecov bot commented Jun 11, 2018 • edited Loading

Codecov Report

bergwolf commented Jun 12, 2018 • edited by amshinde Loading

sboeuf commented Jun 12, 2018

WeiZhang555 commented Jun 13, 2018 • edited by amshinde Loading

devimc commented Jun 13, 2018

sboeuf commented Jun 20, 2018

devimc commented Jun 22, 2018 • edited Loading

sboeuf commented Jun 25, 2018

sboeuf commented Jun 30, 2018

grahamwhaley commented Jul 3, 2018

grahamwhaley commented Jul 3, 2018

sboeuf commented Jul 3, 2018

sboeuf commented Jul 4, 2018

devimc commented Jul 4, 2018

sboeuf commented Jul 13, 2018

sboeuf commented Jul 18, 2018

stefanha commented Jul 19, 2018

grahamwhaley commented Jul 19, 2018

GabyCT commented Jul 19, 2018

sboeuf commented Jul 19, 2018

GabyCT commented Jul 19, 2018

GabyCT commented Jul 19, 2018

grahamwhaley commented Jul 20, 2018

bergwolf commented Jul 22, 2018

sboeuf commented Jul 23, 2018

grahamwhaley commented Jul 23, 2018

GabyCT commented Jul 24, 2018

grahamwhaley commented Jul 24, 2018

stefanha commented Jul 24, 2018 via email

sboeuf commented Jul 24, 2018

sboeuf commented Jul 25, 2018

bergwolf commented Jul 27, 2018

sboeuf commented Jul 27, 2018

devimc commented Jun 9, 2018 •

edited

Loading

codecov bot commented Jun 11, 2018 •

edited

Loading

bergwolf commented Jun 12, 2018 •

edited by amshinde

Loading

WeiZhang555 commented Jun 13, 2018 •

edited by amshinde

Loading

devimc commented Jun 22, 2018 •

edited

Loading