Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

yamux: disable yamux keep alive in server channel #263

Merged
merged 1 commit into from
Jul 27, 2018

Conversation

devimc
Copy link

@devimc devimc commented Jun 9, 2018

yamux client runs in the proxy side, sometimes the client is handling
other requests and it's not able to response to the ping sent by the
server and the communication is closed. To avoid IO timeouts in the
communication between agent and proxy, keep alive should be disabled.

fixes kata-containers/proxy#70
fixes #231

Signed-off-by: Julio Montes [email protected]

@jodh-intel
Copy link
Contributor

Hi @devimc - could you add a little more detail to the commit? For example, what is the implication of the proxy/agent connection being severed (is it that it can never be re-opened and the system will then hang?) This behaviour seems counter-intuitive to a "keep alive" protocol so extra details would help.

In fact, I'd also add a brief comment in both files explaining why the keep alive option is being disabled as it would seem to be a useful feature to have enabled.

@grahamwhaley
Copy link
Contributor

I'm thinking the CI failures are unrelated and probably already known about. For completeness, from the 17-10 logs (/cc @chavafg ):

Summarizing 3 Failures:

[Fail] Update CPU constraints Update CPU set [It] cpuset should be equal to 0,3 
/home/jenkins/jenkins_slave/workspace/kata-containers-agent-fedora-27-PR/go/src/github.com/kata-containers/tests/integration/docker/cpu_test.go:388

[Fail] Hot plug CPUs with a parent cgroup [It] should not fail with parent cgroup: abc/xyz/rgb 
/home/jenkins/jenkins_slave/workspace/kata-containers-agent-fedora-27-PR/go/src/github.com/kata-containers/tests/integration/docker/cpu_test.go:211

[Fail] Hot plug CPUs container with CPU constraint [It] should have 1 CPUs 
/home/jenkins/jenkins_slave/workspace/kata-containers-agent-fedora-27-PR/go/src/github.com/kata-containers/tests/integration/docker/cpu_test.go:106

@devimc devimc changed the title yamux: disable keep alive yamux: disable yamux keep alive in server channel Jun 11, 2018
@devimc devimc force-pushed the yamux/disableKeepAlive branch from 570e716 to b57f8db Compare June 11, 2018 19:23
@codecov
Copy link

codecov bot commented Jun 11, 2018

Codecov Report

Merging #263 into master will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #263      +/-   ##
==========================================
+ Coverage   43.15%   43.18%   +0.02%     
==========================================
  Files          14       14              
  Lines        2266     2267       +1     
==========================================
+ Hits          978      979       +1     
  Misses       1157     1157              
  Partials      131      131
Impacted Files Coverage Δ
channel.go 47.36% <100%> (+0.46%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 17b44df...5f89c07. Read the comment docs.

@bergwolf
Copy link
Member

bergwolf commented Jun 12, 2018

lgtm!

Approved with PullApprove

@sboeuf
Copy link

sboeuf commented Jun 12, 2018

@devimc looks good, and I can confirm I ran into those issues when running a Kata Containers container for too long. It would be great to have some proper testing of this, but I am not sure how to perform it though.
Please address @jodh-intel comment and fix the CI, and we can merge this !

@WeiZhang555
Copy link
Member

WeiZhang555 commented Jun 13, 2018

LGTM 👍

Approved with PullApprove

@devimc
Copy link
Author

devimc commented Jun 13, 2018

@sboeuf now I see a race condition, kata-shim never finish, I'm debugging this issue

@devimc devimc force-pushed the yamux/disableKeepAlive branch from b57f8db to 83d7906 Compare June 18, 2018 14:11
@sboeuf
Copy link

sboeuf commented Jun 20, 2018

Any update on this @devimc ?

@devimc
Copy link
Author

devimc commented Jun 22, 2018

exec process never finish and it's killed (timeout), this issue doesn't happen if the exec process has a tty (docker exec -t ....)

------------------------------
docker exec check stdout forwarded using exec 
  should displayed it
  /tmp/jenkins/workspace/kata-containers-agent-fedora-27-PR/go/src/github.com/kata-containers/tests/integration/docker/exec_test.go:61
Running command '/usr/bin/docker [docker run --runtime kata-runtime -td --name CZ4k1mAabI2d5jMgpuMFzBnQpW87qJ busybox sh]'
[docker run --runtime kata-runtime -td --name CZ4k1mAabI2d5jMgpuMFzBnQpW87qJ busybox sh]
Timeout: 60 seconds
Exit Code: 0
Stdout: ffb7746578159035b981f18e752c285a8f89ab3a160401cf17d88a0562ff80ea

Stderr: 
Running command '/usr/bin/docker [docker exec CZ4k1mAabI2d5jMgpuMFzBnQpW87qJ sh -c ls /etc/resolv.conf 2>/dev/null]'
Killing process timeout reached '60' seconds
Running command '/usr/bin/docker [docker rm -f CZ4k1mAabI2d5jMgpuMFzBnQpW87qJ]'
[docker rm -f CZ4k1mAabI2d5jMgpuMFzBnQpW87qJ]
Timeout: 60 seconds
Exit Code: 0
Stdout: CZ4k1mAabI2d5jMgpuMFzBnQpW87qJ

Stderr: 
Running command '/usr/bin/docker [docker ps -a -f name=CZ4k1mAabI2d5jMgpuMFzBnQpW87qJ --format {{.Status}}]'
[docker ps -a -f name=CZ4k1mAabI2d5jMgpuMFzBnQpW87qJ --format {{.Status}}]
Timeout: 60 seconds
Exit Code: 0
Stdout: 
Stderr: 

• Failure [62.969 seconds]
docker exec
/tmp/jenkins/workspace/kata-containers-agent-fedora-27-PR/go/src/github.com/kata-containers/tests/integration/docker/exec_test.go:19
  check stdout forwarded using exec
  /tmp/jenkins/workspace/kata-containers-agent-fedora-27-PR/go/src/github.com/kata-containers/tests/integration/docker/exec_test.go:60
    should displayed it [It]
    /tmp/jenkins/workspace/kata-containers-agent-fedora-27-PR/go/src/github.com/kata-containers/tests/integration/docker/exec_test.go:61

    Expected
        <int>: -1
    to equal
        <int>: 0

    /tmp/jenkins/workspace/kata-containers-agent-fedora-27-PR/go/src/github.com/kata-containers/tests/integration/docker/exec_test.go:64

@sboeuf
Copy link

sboeuf commented Jun 25, 2018

@devimc I am actively debugging this FYI, but I don't think this is related to TTY since I have been able to reproduce this same behavior with a command using TTY.

@devimc devimc force-pushed the yamux/disableKeepAlive branch from 83d7906 to 1b9136a Compare June 26, 2018 21:21
@sboeuf
Copy link

sboeuf commented Jun 30, 2018

@devimc @jodh-intel @bergwolf @WeiZhang555
Ok I have spent a good amount of time this week investigating this issue. But let's first take a step back because I wanted to mention that the issue we're trying to solve here might actually be considered as a non-issue. Problem here is that when we pause the container, we pause the entire VM, which is not really the expected behavior IMO. I am even wondering if the use case of pausing/resuming a VM is valid from an OCI command perspective. From this observation, I am wondering if it's worth it to disable keepalive from Yamux, and instead, we should work on making sure that a docker pause will translate into a container process to be paused instead of getting the VM paused.

Now, I also understand that we might run into some issues if, in the context of VM templating, we let the kernel boot and the agent start before we pause the VM. And for this reason, we might want to disable keepalive.

So here we go about the debug part. The environment to reproduce is pretty simple, we need to run a simple container, something like:

docker run -it busybox
#

and then from another terminal you start a loop of exec:

while :; do docker exec mycontainer ls /etc/resolv.conf 2> /dev/null

so you get regular output:

/etc/resolv.conf
/etc/resolv.conf
...
/etc/resolv.conf

until it hangs.

Looking at the whole chain, what happens is the shim (after it got all the output from the agent and transmitted all its input to the agent) calls into WaitProcess(), and I can see this call going through the proxy but it never makes it to the agent, and in particular to the gRPC server. I have heavily instrumented the code and checked what was happening, and I came to the conclusion that Yamux was not reading anything. I dumped a bunch of logs from the Yamux code, and I confirm that when this issue happens, Yamux is still blocked reading here: https://github.com/hashicorp/yamux/blob/master/session.go#L453 while it should be reading something.
Also, when this happens, a very interesting thing to notice is that we can unblock the hanging shim by simply entering one character from the container shell. The behavior is very weird but basically it seems that we have the WaitProcess() message buffered in between and never pushed through the VM, which makes Yamux not able to read the expected amount of bytes. But when something is triggering more input through the socket/serial_port, it magically unblocks everything.

This is where I am at and I don't know how to continue on this. Do you have any idea if this behavior could have been caused by the VM port that might be "buggy" ?
HELP NEEDED :)

@grahamwhaley
Copy link
Contributor

Nice work so far @devimc @sboeuf btw. A couple of thoughts then:

  • You talk about pausing the container, but with your busybox/ls resolv example, are we pausing? Maybe I'm just missing something there in the container state machine in my head?
  • That hang up being released by a single character smells very much like an I/O race with somebody (at or near the yamux end) missing a delivery, and blocking. Hmmm
  • If we think yamux may be part of the issue, there are (many!) other multiplexers out there (like muxado), and even a wrapper for a number of them: https://github.com/libp2p/go-stream-muxer OK, most of the time debugging the problem you have in hand is the best way forwards - but if we get stuck and need to compare/contrast, maybe they might be useful.

I'm going to go stare at yamux etc. just in case something pops out at me.

@grahamwhaley
Copy link
Contributor

@sboeuf I tried to reproduce your busybox/resolv test, but it's still going at 4000+ execs for me. Was there anything 'special' one has to do or patches one should/not have in place?
I was going to mention you see that I think our vendor'd yamux is out of date, and there might be some relevant fixes in the upstream:

4c2fe0d (origin/b-consul-3040) Dont output keepalive error when the session is closed
f21aae5 Make sure to drain the timer channel on defer, and a clarifying comment
601ccd8 Make receive window update logic a bit cleaner
02d320c Uses timer pool in sendNoWait, like in waitForSendErr
cf433c5 window update unit test for partial read; benchmark large buffer
ca8dfd0 improve memory utilization in receive buffer, fix flow control
683f491 Fix race around read and write deadlines in Stream (#52)
40b86b2 Add public session CloseChan method (#44)

@sboeuf
Copy link

sboeuf commented Jul 3, 2018

@grahamwhaley thanks for taking a look at this too.

  • To clarify, the issue solved by this PR was to be able to pause a container (pause the VM) without having issues regarding the Yamux keepalive (heartbeat) feature. Now, relying on this PR, we see some issues when running simple exec test.
  • About Yamux version, I also realized this last week, and I am testing with the latest version Yamux for both proxy and agent, but this didn't solve it...
  • Pretty weird that you didn't get the tests failing yet...

@sboeuf sboeuf force-pushed the yamux/disableKeepAlive branch from 1b9136a to 33e3c6c Compare July 3, 2018 17:01
@sboeuf
Copy link

sboeuf commented Jul 4, 2018

@grahamwhaley more debug info on this.
I have tried on my baremetal machine to compare with your setup. It is harder to reproduce (way harder), but at some point I hit the issue. Also, it's been easier to reproduce by forcing virtio-serial-pci to disable-modern=true, making use of virtio 0.9 instead of 1.0.

We need to instrument client side of yamux (proxy), and server side of yamux (agent) in order to understand why the command from the shim is not transmitted/received. I said I had checked the proxy receives the request from the shim, but I haven't checked that yamux client actually sends this to the agent... This is the next thing to try.

@devimc
Copy link
Author

devimc commented Jul 4, 2018

After reading golang/go#24942 , I installed golang 1.11 (beta) to check if this issue is gone, unfortunately it didn't, issue still occurs.

@sboeuf
Copy link

sboeuf commented Jul 13, 2018

@devimc @grahamwhaley
I have spent more time on this issue, and I think the data is simply not making it through the emulated serial port, meaning the agent never gets to read what's sent. Based on my previous explanation about how triggering the communication by entering a random key in the container console was actually unblocking the situation, I have found an interesting issue here: https://bugzilla.redhat.com/show_bug.cgi?id=1435432 leading to a cloned one: https://bugzilla.redhat.com/show_bug.cgi?id=1438539 basically saying that glib2 might have some issues regarding the way the loop might not be properly woken up.
@devimc could you try on a Fedora with at least glib2-2.52.2-2.fc26 version and let me know if you can reproduce the issue ?

@sboeuf
Copy link

sboeuf commented Jul 18, 2018

FYI, I have added Depends-on: github.com/kata-containers/proxy#91 to the commit, and the CI passes properly !
@bergwolf @grahamwhaley @devimc wdyt ?

@stefanha
Copy link

@grahamwhaley I haven't been able to reproduce the hang with qemu-2.11.2-1.fc28. My Kata includes the keep-alive patch.

I did hit another problem though: sometime after 10,000 execs Go gets a pthread_create() failure, Go backtraces are printed, and I guess the agent terminates because the QEMU process goes away and docker realizes the container is gone.

@grahamwhaley
Copy link
Contributor

Thanks @stefanha - I could not produce the hang on bare metal, but it happens much more quickly (for me anywhere between 80 an 1100 iterations) inside a VM - if that helps...
For the 10k exec death - ouch!, and thanks for reporting. @GabyCT - you have an exec soak test I believe - does it go up to 10k? - could you give that a try to see if we can reproduce?

@GabyCT
Copy link
Contributor

GabyCT commented Jul 19, 2018

@grahamwhaley one question, so I need to apply this PR and then run the test? or just run the test what we have currently?

@sboeuf
Copy link

sboeuf commented Jul 19, 2018

@GabyCT I think you don't need to apply this PR since the issue mentioned by @stefanha is/seems unrelated.

@GabyCT
Copy link
Contributor

GabyCT commented Jul 19, 2018

@sboeuf thanks for the clarification :)

@GabyCT
Copy link
Contributor

GabyCT commented Jul 19, 2018

@grahamwhaley , I ran the distress test where we perform several execs in a running container and so far there has been more than 15k execs and the VM has not died. Here it is the configuration that I am using

kata-runtime  : 1.1.0
commit   : e0010619fcebbad28be29ade111257d477ecbad3
OCI specs: 1.0.1
kata-agent version 1.1.0-17b44df35a174c09966cf9d2ccc8472a597feda1
kata-proxy version 1.1.0-7e2a93d1931d814debfcff2846b1514af1b9601d
kata-shim version 1.1.0-6ddca0334d0e8dee6e1e00f0c120fde4b8c1c176

@grahamwhaley
Copy link
Contributor

I concur the patch feels unrelated to the death - but, given you've run it without the patch and it didn't die, I think we should try a couple of things (probably in this order):

  • run it again with the patch, just in case...
  • run it longer - leave it overnight and see if it still going in the morning :-)
  • if all that fails, somebody else can try to reproduce (I can do that), and we should probably gather exact details from @stefanha - which distro, was it nested/VM, what kata components (a kata-env would be good). We should probably then do that on a new Issue to separate the >10k exec issue from this yamux one.

@bergwolf
Copy link
Member

@sboeuf

Particularly with the metrics CI, we're seeing some docker rm hangs because of this

Could you please give some pointers to these issues? I get it that it is broken by design to have keepalive when pausing the VM. But what errors did we get about docker rm that has its roots on this?

Basically what you did in kata-containers/proxy/pull/91 is copying yamux session.keepalive() implementation to kata-proxy -- the same effect as enabling keepalive on kata-proxy side. I do not understand why it would solve the whole issue. If we do need that, we can just enable keepalvie on kata-proxy side, don't we?

@sboeuf
Copy link

sboeuf commented Jul 23, 2018

@grahamwhaley do you have any pointers about the root cause of the docker rm hangs for @bergwolf ?

@bergwolf about the proxy patch I have introduced, I agree it is almost the same as keepalive. The main difference being that I don't care/check for error. This simply maintains a regular communication every second, but will never error out. Does that make sense?

@grahamwhaley
Copy link
Contributor

Sure. Hi @bergwolf . More complete thread is over at: Over on kata-containers/runtime#406 (comment)

I believe the basic issue is that if you issue something such as:

$ docker rm -f $(docker ps -qa)

and have a reasonable number of containers running (say >20 - I have seen this happen with 20, and see it often with 110 containers...), then afaict Docker will issue all of those rms in parallel. This takes time for kata, and eventually it seems we run into a yamux timeout, which then kills that container. That also then leaves us in an untidy state, and requires hand cleanup or reboot of Docker to get us back to a working system :-(

My gut feeling is that on a parallel rm our runtimes may race somewhat on the lock files down in /var/.../sb . @sboeuf did a bunch of rework on that a few months back, and it is better than it used to be - but I still suspect we have some race/locks, maybe. Unproven though, and possibly unavoidable.

@GabyCT
Copy link
Contributor

GabyCT commented Jul 24, 2018

@grahamwhaley, regarding to

run it longer - leave it overnight and see if it still going in the morning :-)

I ran the stress test (where we performed several execs) for 20 hours and it is still working so far we have more than 150k execs done successfully.

@grahamwhaley
Copy link
Contributor

@GabyCT - great, thanks for the confirmation.
@stefanha - looks like we will need a few more details about the system and versions etc. if we are to re-create that >10k exec failure.

@stefanha
Copy link

stefanha commented Jul 24, 2018 via email

@sboeuf
Copy link

sboeuf commented Jul 24, 2018

@bergwolf have you seen my answer ?

about the proxy patch I have introduced, I agree it is almost the same as keepalive(). The main difference being that I don't care/check for error. This simply maintains a regular communication every second, but will never error out. Does that make sense?

@sboeuf
Copy link

sboeuf commented Jul 25, 2018

@bergwolf ping ;)

@bergwolf
Copy link
Member

LGTM. Sorry for the delay. I guess we'll have to carry this workaround (and kata-containers/proxy#91) since we cannot wait to find out the root cause forever...

@sboeuf
Copy link

sboeuf commented Jul 27, 2018

kata-containers/proxy#91 has been merged. We can go ahead and merge this one.

@sboeuf sboeuf merged commit c6ac153 into kata-containers:master Jul 27, 2018
GabyCT added a commit to GabyCT/tests-1 that referenced this pull request Aug 22, 2018
Now that this issue has been solved kata-containers/agent#263,
it is necessary to remove the workaround of sleeping while shutting down containers.

Fixes kata-containers#635

Signed-off-by: Gabriela Cervantes <[email protected]>
GabyCT added a commit to GabyCT/tests-1 that referenced this pull request Aug 23, 2018
Now that this issue has been solved kata-containers/agent#263,
it is necessary to remove the workaround of sleeping.

Fixes kata-containers#644

Signed-off-by: Gabriela Cervantes <[email protected]>
GabyCT added a commit to GabyCT/tests-1 that referenced this pull request Aug 23, 2018
Now that this issue has been solved kata-containers/agent#263,
it is necessary to remove the workaround of sleeping.

Fixes kata-containers#644

Signed-off-by: Gabriela Cervantes <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

yamux: keepalive failed: i/o deadline reached yamux: keepalive failed: i/o deadline reached
8 participants