-
Notifications
You must be signed in to change notification settings - Fork 112
yamux: disable yamux keep alive in server channel #263
Conversation
61f7654
to
570e716
Compare
Hi @devimc - could you add a little more detail to the commit? For example, what is the implication of the proxy/agent connection being severed (is it that it can never be re-opened and the system will then hang?) This behaviour seems counter-intuitive to a "keep alive" protocol so extra details would help. In fact, I'd also add a brief comment in both files explaining why the keep alive option is being disabled as it would seem to be a useful feature to have enabled. |
I'm thinking the CI failures are unrelated and probably already known about. For completeness, from the 17-10 logs (/cc @chavafg ):
|
570e716
to
b57f8db
Compare
Codecov Report
@@ Coverage Diff @@
## master #263 +/- ##
==========================================
+ Coverage 43.15% 43.18% +0.02%
==========================================
Files 14 14
Lines 2266 2267 +1
==========================================
+ Hits 978 979 +1
Misses 1157 1157
Partials 131 131
Continue to review full report at Codecov.
|
@devimc looks good, and I can confirm I ran into those issues when running a Kata Containers container for too long. It would be great to have some proper testing of this, but I am not sure how to perform it though. |
@sboeuf now I see a race condition, kata-shim never finish, I'm debugging this issue |
b57f8db
to
83d7906
Compare
Any update on this @devimc ? |
exec process never finish and it's killed (timeout), this issue doesn't happen if the exec process has a tty (
|
@devimc I am actively debugging this FYI, but I don't think this is related to TTY since I have been able to reproduce this same behavior with a command using TTY. |
83d7906
to
1b9136a
Compare
@devimc @jodh-intel @bergwolf @WeiZhang555 Now, I also understand that we might run into some issues if, in the context of VM templating, we let the kernel boot and the agent start before we pause the VM. And for this reason, we might want to disable keepalive. So here we go about the debug part. The environment to reproduce is pretty simple, we need to run a simple container, something like:
and then from another terminal you start a loop of
so you get regular output:
until it hangs. Looking at the whole chain, what happens is the shim (after it got all the output from the agent and transmitted all its input to the agent) calls into This is where I am at and I don't know how to continue on this. Do you have any idea if this behavior could have been caused by the VM port that might be "buggy" ? |
Nice work so far @devimc @sboeuf btw. A couple of thoughts then:
I'm going to go stare at yamux etc. just in case something pops out at me. |
@sboeuf I tried to reproduce your busybox/resolv test, but it's still going at 4000+ execs for me. Was there anything 'special' one has to do or patches one should/not have in place?
|
@grahamwhaley thanks for taking a look at this too.
|
1b9136a
to
33e3c6c
Compare
@grahamwhaley more debug info on this. We need to instrument client side of yamux (proxy), and server side of yamux (agent) in order to understand why the command from the shim is not transmitted/received. I said I had checked the proxy receives the request from the shim, but I haven't checked that yamux client actually sends this to the agent... This is the next thing to try. |
After reading golang/go#24942 , I installed golang 1.11 (beta) to check if this issue is gone, unfortunately it didn't, issue still occurs. |
@devimc @grahamwhaley |
FYI, I have added |
@grahamwhaley I haven't been able to reproduce the hang with qemu-2.11.2-1.fc28. My Kata includes the keep-alive patch. I did hit another problem though: sometime after 10,000 execs Go gets a pthread_create() failure, Go backtraces are printed, and I guess the agent terminates because the QEMU process goes away and docker realizes the container is gone. |
Thanks @stefanha - I could not produce the hang on bare metal, but it happens much more quickly (for me anywhere between 80 an 1100 iterations) inside a VM - if that helps... |
@grahamwhaley one question, so I need to apply this PR and then run the test? or just run the test what we have currently? |
@sboeuf thanks for the clarification :) |
@grahamwhaley , I ran the
|
I concur the patch feels unrelated to the death - but, given you've run it without the patch and it didn't die, I think we should try a couple of things (probably in this order):
|
Could you please give some pointers to these issues? I get it that it is broken by design to have keepalive when pausing the VM. But what errors did we get about Basically what you did in kata-containers/proxy/pull/91 is copying yamux |
@grahamwhaley do you have any pointers about the root cause of the @bergwolf about the proxy patch I have introduced, I agree it is almost the same as keepalive. The main difference being that I don't care/check for error. This simply maintains a regular communication every second, but will never error out. Does that make sense? |
Sure. Hi @bergwolf . More complete thread is over at: Over on kata-containers/runtime#406 (comment) I believe the basic issue is that if you issue something such as:
and have a reasonable number of containers running (say >20 - I have seen this happen with 20, and see it often with 110 containers...), then afaict Docker will issue all of those My gut feeling is that on a parallel |
@grahamwhaley, regarding to
I ran the stress test (where we performed several execs) for 20 hours and it is still working so far we have more than 150k execs done successfully. |
On Tue, Jul 24, 2018 at 2:46 PM, Graham Whaley ***@***.***> wrote:
@stefanha - looks like we will need a few more details about the system and versions etc. if we are to re-create that >10k exec failure.
It might be related to my Fedora initramfs. It's possible that the
guest is running out of memory because something is writing to the
initramfs (i.e. RAM), like systemd-journald. I'll investigate.
|
@bergwolf have you seen my answer ?
|
@bergwolf ping ;) |
LGTM. Sorry for the delay. I guess we'll have to carry this workaround (and kata-containers/proxy#91) since we cannot wait to find out the root cause forever... |
kata-containers/proxy#91 has been merged. We can go ahead and merge this one. |
Now that this issue has been solved kata-containers/agent#263, it is necessary to remove the workaround of sleeping while shutting down containers. Fixes kata-containers#635 Signed-off-by: Gabriela Cervantes <[email protected]>
Now that this issue has been solved kata-containers/agent#263, it is necessary to remove the workaround of sleeping. Fixes kata-containers#644 Signed-off-by: Gabriela Cervantes <[email protected]>
Now that this issue has been solved kata-containers/agent#263, it is necessary to remove the workaround of sleeping. Fixes kata-containers#644 Signed-off-by: Gabriela Cervantes <[email protected]>
yamux client runs in the proxy side, sometimes the client is handling
other requests and it's not able to response to the ping sent by the
server and the communication is closed. To avoid IO timeouts in the
communication between agent and proxy, keep alive should be disabled.
fixes kata-containers/proxy#70
fixes #231
Signed-off-by: Julio Montes [email protected]