Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent failure of DNS resolution when running garden agains domain based k8s #2255

Closed
Chipcius opened this issue Feb 18, 2021 · 12 comments · Fixed by #2386
Closed

Intermittent failure of DNS resolution when running garden agains domain based k8s #2255

Chipcius opened this issue Feb 18, 2021 · 12 comments · Fixed by #2386
Assignees
Labels

Comments

@Chipcius
Copy link

Chipcius commented Feb 18, 2021

Bug

Garden intermittently fails to resolve the DNS server name of my K8s cluster within my kubernetes provider

Current Behavior

During CI/CD in drone using docker runner

kind: pipeline
type: docker
name: garden

steps:
  - name: build
    image: gardendev/garden:0.12.16
    pull: if-not-exists
    commands:
      - ./ci/kubectl-init-1.sh
      - garden -e staging build
    environment:
      GARDEN_LOG_LEVEL: silly # set the log level to your preference here
      GARDEN_LOGGER_TYPE: basic # this is important, since the default logger doesn't play nice with CI :)
      GARDEN_DEV_KUBECTL:
        from_secret: GARDEN_DEV_KUBECTL

I get the following output intermittently. At this point it has already build 4+ other services in the k8s cluster using kaniko on the cluster rancher.<redacted>

Example 1:

...
2007 | ✖ web                       → Building image web:v-1b8a684d8a... | 182s
2008 |   | 182s
2009 | Failed building web. Here is the output: | 182s
2010 | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ | 182s
2011 | getaddrinfo ENOTFOUND rancher.<redacted> | 182s
2012 | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ | 182s
2013 | getaddrinfo ENOTFOUND rancher.<redacted> | 182s
2014 |   | 182s
2015 | Error Details: | 182s
2016 |   | 182s
2017 | errno: ENOTFOUND | 182s
2018 | code: ENOTFOUND | 182s
2019 | syscall: getaddrinfo | 182s
2020 | hostname: rancher.<redacted>
...

Example 2:

...
611 | Failed getting status for service 'mysql-dotnet' (from module 'mysql-dotnet'). Here is the output: | 14s
612 | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ | 14s
613 | Error: getaddrinfo ENOTFOUND rancher.<redacted> | 14s
614 | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ | 14s
615 | Error: getaddrinfo ENOTFOUND rancher.<redacted> | 14s
616 |   | 14s
617 | Error Details: | 14s
618 |   | 14s
619 | name: RequestError | 14s
620 | options: | 14s
621 | url: >- | 14s
622 | https://rancher.<redacted>/k8s/clusters/c-snrh4/api/v1/namespaces/<redacted>/services/mysql-dotnet | 14s
623 | method: GET | 14s
624 | json: true | 14s
625 | resolveWithFullResponse: true | 14s
626 | headers: | 14s
627 | Authorization: >- | 14s
628 | Bearer | 14s
629 | kubeconfig-u-udzztfyxtx:9h<redacted> | 14s
630 | simple: true | 14s
631 | transform2xxOnly: false
...

Expected behavior

As the code has already resolved the DNS name correctly before in the code it should not fail to resolve it after the 1st time

Reproducible example

Intermittently means it is not easily replaceable.
This only happened on our CI/CD system and not our workstations so resource limitations might be at play

Workaround

This started to working more consistently after we disabled all but 1 kubernetes provider in our project.garden.yaml file but we had 5 active provders with different configs before

(Updated)
Adding the host record into /etc/hosts works as a hotfix.

echo "123.123.123.123 myhost.mydomain.com" |sudo tee -a /etc/hosts

Suggested solution(s)

Take a look at re-using the DNS resolutions better in the codebase

Additional context

  • Full logs of CI/CD output is availalable by personal request with silly log level. Not going to put it in the public domain

Your environment

  • OS: Linux, Drone docker runner using official image from Garden.io (see drone config above)
  • How I'm running Kubernetes: EKS (we are using other but the EKS is the cluster our CI/CD is deploying to and failling)
  • Garden version: 0.12.16
@edvald
Copy link
Collaborator

edvald commented Feb 18, 2021

Hmm that's interesting. We don't do anything particular in terms of DNS resolution, just let the underlying runtime take care of that (which just uses the standard getaddrinfo syscall). I can take a look to see if this is something Node.js allows us to work with, in some reasonable manner. I guess this doesn't often come up because kubeconfigs frequently use direct IPs instead of hostnames, but I can see how these queries would start to get throttled because interacting with Kubernetes necessarily involves a lot of individual REST queries.

An alternative might be to have a DNS cache outside the process (dnsmasq or the like) but obviously that's punting the problem back your way :) I'm not quite familiar enough with Drone to know what options are on the table at the CI runner level.

In any case, I'll take a quick look, see if there's a reasonable way to do this at our code level.

@Chipcius
Copy link
Author

Yeah strange ... the host should also have a cached version of the DNS requests in most cases so this is very strange to me. Might be some race condition? 🤔
We did an exhaustive look to make sure this was not a problem with our build machines before reporting the issue. They had no problems doing the DNS lookup

@edvald
Copy link
Collaborator

edvald commented Feb 18, 2021

Yeah it must be something triggered by the sheer number of concurrent lookups. That would be my guess at least, based on your report. I'll do a bit of digging on my side later today or tomorrow at latest, and see what we can come up with. I guess a crude workaround would be to just hardcode the IP in the kubeconfig while you kick this around, not sure how feasible that is in your setup?

@Chipcius
Copy link
Author

@edvald That will not work without some major changes in how we host our rancher setup

✖ providers                 → Error

Failed resolving provider kubernetes. Here is the output:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Unable to connect to Kubernetes cluster. Got error:

Error [ERR_TLS_CERT_ALTNAME_INVALID]: Hostname/IP does not match certificate's altnames: IP: 34.<redacted> is not in the cert's list:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Failed resolving one or more providers:
- kubernetes

@edvald
Copy link
Collaborator

edvald commented Feb 18, 2021

Ah yes, let's not go down that road then :P I'll poke around on our side and try to come up with something sensible. Only remaining thing to try outside of Garden would then be some DNS caching on the build nodes, not sure if that's tractable either, depending on how it's all set up.

@Chipcius
Copy link
Author

Adding the host record in /etc/hosts looks to be working as a hot fix for this issue.

@edvald
Copy link
Collaborator

edvald commented Feb 22, 2021

Ok cool. I've been exploring code-level solutions for this, and there seems to be a general consensus in the Node community that this is best solved at the OS level with something like dnsmasq. All the application-level options would involve some level of monkey-patching which feels risky to me, and may not actually catch all instances of DNS lookups. Is a DNS cache on your build nodes something relatively easy to rig up on your side?

@Chipcius
Copy link
Author

That would mean custom docker images for building the garden setup. I would rather use upstream images without changes. This is likely to be an issue for other users as well in the future.

Can't we just cache the DNS lookups in the app? It is usually not anywhere near the first lookup that fails during the execution. The first few succeed without fail 99% of the tiem. Usually it is mid way through building when the DNS lookup fails.
Kind of strange to be resolving the same DNS name often during a single build

@edvald
Copy link
Collaborator

edvald commented Apr 6, 2021

Is this resolved for you now @Chipcius?

@Chipcius
Copy link
Author

Issue is still present in the latest version

@Chipcius Chipcius changed the title Intermittent failure of DNS resolution in CI system (Drone.io) Intermittent failure of DNS resolution when running garden agains domain based k8s Apr 28, 2021
@Chipcius
Copy link
Author

This is present on the garden command in general and is happening quite often these days.

I just confirmed that the problem persists on Windows and Ubuntu 20.04 using the latest garden version 0.12.21
Selection_441

@eysi09
Copy link
Collaborator

eysi09 commented May 4, 2021

Hi @Chipcius

We're working on improving how we interact with the K8s API which should fix these type of issues. Essentially the plan is to catch common errors like these, and retry the request (with exp backoff) until hopefully it succeeds.

This is hi-pri for us and we're hoping to ship it with our next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants