-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix docker-healthcheck to work around Docker bug. #6448
Conversation
Hi @tsuna. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @justinsb |
/ok-to-test |
e2e test due to what appears to be an infrastructure issue — this hit another one of my PRs as well:
|
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Thanks for the LGTM! |
It’s not aws but the test infra that is having some issues lately. Just a matter of time till the system is restored. |
Can we /retest plz |
/retest |
JFYI: I just rebased the change on top of the latest master. |
Hi @chrisz100, could you please re-instate the LGTM so this can be merged? I didn't change anything in the code other than rebasing. Thanks in advance. :) |
Looks good! /lgtm |
This is a workaround to better detect moby/moby#38642 when Docker starts up and remains stuck. In this case `docker ps` will return nothing and exit 0, but no container can actually start. A better (but more expensive and more intrusive test) would be to `docker run --rm` some cheap test container to confirm we can actually start a container.
Rebased... |
@chrisz100 could you reinstate the LGTM please? 😄 |
🙏 |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: chrisz100, tsuna The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@tsuna how did you go about updating the healthcheck to this new version? we are experiencing this bug 4/5 times a day in several of our clusters and it's currently causing outages, so would love to get this deployed! |
You need to build and deploy and use (you need to have Bazel installed prior to running this) go get -d -u k8s.io/kops
cd $GOPATH/src/k8s.io/kops
export S3_BUCKET_NAME=<a public S3 bucket you created>
export KOPS_BASE_URL=https://${S3_BUCKET_NAME}.s3.amazonaws.com/kops/1.12.0-alpha.1/
make kops kops-install dev-upload UPLOAD_DEST=s3://${S3_BUCKET_NAME} and then use Do you believe you're running into moby/moby#38642 as well? Or is it a different underlying Docker bug? |
@tsuna thanks - I'll give it a go. I'm not 100% sure what our exact issue is, but it does sound similar to moby/moby#38642 We're finding ourselves having to restart Docker (18.06.1) on multiple machines daily, due to RPC/sandbox/network errors. |
In the case of the bug I was running into you could tell for sure by looking at the output of But whatever bug you're running into, is it also the case that |
After further inspection, I'm not 100% certain if we are experiencing the same problem. It's definitely network related, but I have a feeling it's a bug in the CNI amongst other things. |
Ok, I take the above comment back. I think we are experiencing the same issue. We see Docker just stop on a regular basis. After SSHing onto a node I see no networks:
|
Yeah that's a hallmark of the Docker bug I filed. Do you happen to know if the Docker daemon restarted shortly before the problem started? Maybe due to the healthcheck failing, due to the host being overloaded? Or due to a kernel OOM? If you can add any supporting evidence to moby/moby#38642 to help move the bug forward, that would be nice, thanks! |
This is a workaround to better detect moby/moby#38642 when Docker starts
up and remains stuck. In this case
docker ps
will return nothing andexit 0, but no container can actually start. A better (but more expensive
and more intrusive test) would be to
docker run --rm
some cheap testcontainer to confirm we can actually start a container.