Skip to content
This repository has been archived by the owner on Jan 11, 2023. It is now read-only.

Change dns healthcheck to look at external domain #3282

Merged
merged 2 commits into from
Jul 25, 2018

Conversation

carlpett
Copy link
Contributor

What this PR does / why we need it: The healthcheck for kube-dns should ensure it is possible with recursive queries to detect network issues outside the cluster. Related to #2971.

Special notes for your reviewer: Implements the suggested solution from this comment

@acs-bot
Copy link

acs-bot commented Jun 15, 2018

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: carlpett
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: cecilerobertmichon

Assign the PR to them by writing /assign @cecilerobertmichon in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@carlpett
Copy link
Contributor Author

/assign @CecileRobertMichon

@jackfrancis
Copy link
Member

Thanks @carlpett We should probably make this an additive change, instead of replacing the previous checks. @khenidak thoughts?

@khenidak
Copy link
Contributor

I think this is based on the temp workaround for intermitted DNS issues we had. I believe, We should do both. Internal and external lookups.

@jackfrancis
Copy link
Member

@carlpett could you kindly update the manifest accordingly? Thanks so much!

@khenidak
Copy link
Contributor

@carlpett and thanks a lot for this !

@carlpett
Copy link
Contributor Author

Yes, I thought a while about that too, but I don't think there's an (obvious) way to do both? Each command/domain lookup is tied to one url, which is then used as a livenessProbe. And we can't have multiple probes per container.
I suppose we could do nslookup <internal> && nslookup <external>, or wrap in in sh -c '...' if it isn't executed in a shell already, what do you think?

@khenidak
Copy link
Contributor

Yes - Exactly. The health prop is on the command exit code, it does not matter how many we execute.

@codecov
Copy link

codecov bot commented Jun 17, 2018

Codecov Report

Merging #3282 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #3282   +/-   ##
=======================================
  Coverage   53.31%   53.31%           
=======================================
  Files         104      104           
  Lines       15574    15574           
=======================================
  Hits         8304     8304           
  Misses       6537     6537           
  Partials      733      733

@carlpett
Copy link
Contributor Author

I found the readability a bit poor when chaining it with &&, so I made it loop over a list of domains instead. Do you find this more or less readable?

@carlpett
Copy link
Contributor Author

@jackfrancis Anything else needed here?

@jackfrancis
Copy link
Member

@carlpett I'm kicking E2E to see how it likes these changes. Thanks for you patience—

@carlpett
Copy link
Contributor Author

carlpett commented Jul 2, 2018

@jackfrancis If I read the test output correctly, this is the cause:

2018/07/02 18:21:47 Error while connecting to Windows dashboard:exit status 7
2018/07/02 18:21:47 Warning: Permanently added 'kubernetes-southeastasia-63593.southeastasia.cloudapp.azure.com,13.67.76.16' (ECDSA) to the list of known hosts.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to 10.240.0.34 port 31687: Connection refused

I'm not sure how I could have caused connection refused on Windows nodes... Is it flaky?

Copy link
Member

@jackfrancis jackfrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jackfrancis jackfrancis merged commit 513ed45 into Azure:master Jul 25, 2018
PaulCharlton added a commit to ElementAnalytics/acs-engine that referenced this pull request Jul 26, 2018
* 'master' of https://github.com/Azure/acs-engine: (59 commits)
  Docs: Update user guide list to include Windows, update description of clusters (Azure#3473)
  update to Azure CNI v1.0.10 (Azure#3551)
  Adding 'make dev' equivalent for Windows (Azure#3471)
  print out ubuntu ver in e2e (Azure#3555)
  fix an issue where networkPlugin was not defined correctly when using calico or cilium (Azure#3271)
  Bump ginkgo to a tagged release (Azure#3554)
  Reenable AzureFile tests for Windows on K8s 1.11.1, resolves Azure#3439 (Azure#3496)
  removing rbac error checking from merge fn (Azure#3530)
  Change dns healthcheck to look at external domain (Azure#3282)
  DOCUMENTATION: Fix Documented Default Value for clusterSubnet (Azure#3474)
  Document required manual calico 2.6.3 -> calico 3.1.1 upgrade when upgrading from < 0.17.0-provisioned clusters (Azure#3208)
  revert --image-pull-policy=IfNotPresent for win (Azure#3553)
  --image-pull-policy=IfNotPresent for kubectl run commands (Azure#3552)
  Kubernetes: --max-pods=30 should be Azure CNI-only (Azure#3543)
  disable Azure CNI network monitor addon default (Azure#3550)
  only do az vm list for k8s (Azure#3540)
  Retire Swarm E2E for PR test coverage (Azure#3539)
  retire Azure CDN for container image repository proxying (Azure#3535)
  removed datadisk to allow scale after upgrade (Azure#3482)
  Pump k8s-azure-kms version (Azure#3531)
  ...
PaulCharlton added a commit to ElementAnalytics/acs-engine that referenced this pull request Jul 26, 2018
* master: (59 commits)
  Docs: Update user guide list to include Windows, update description of clusters (Azure#3473)
  update to Azure CNI v1.0.10 (Azure#3551)
  Adding 'make dev' equivalent for Windows (Azure#3471)
  print out ubuntu ver in e2e (Azure#3555)
  fix an issue where networkPlugin was not defined correctly when using calico or cilium (Azure#3271)
  Bump ginkgo to a tagged release (Azure#3554)
  Reenable AzureFile tests for Windows on K8s 1.11.1, resolves Azure#3439 (Azure#3496)
  removing rbac error checking from merge fn (Azure#3530)
  Change dns healthcheck to look at external domain (Azure#3282)
  DOCUMENTATION: Fix Documented Default Value for clusterSubnet (Azure#3474)
  Document required manual calico 2.6.3 -> calico 3.1.1 upgrade when upgrading from < 0.17.0-provisioned clusters (Azure#3208)
  revert --image-pull-policy=IfNotPresent for win (Azure#3553)
  --image-pull-policy=IfNotPresent for kubectl run commands (Azure#3552)
  Kubernetes: --max-pods=30 should be Azure CNI-only (Azure#3543)
  disable Azure CNI network monitor addon default (Azure#3550)
  only do az vm list for k8s (Azure#3540)
  Retire Swarm E2E for PR test coverage (Azure#3539)
  retire Azure CDN for container image repository proxying (Azure#3535)
  removed datadisk to allow scale after upgrade (Azure#3482)
  Pump k8s-azure-kms version (Azure#3531)
  ...
jackfrancis pushed a commit to jackfrancis/acs-engine that referenced this pull request Aug 3, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants