Intermittent failure of DNS resolution when running garden agains domain based k8s #2255

Chipcius · 2021-02-18T15:43:07Z

Bug

Garden intermittently fails to resolve the DNS server name of my K8s cluster within my kubernetes provider

Current Behavior

During CI/CD in drone using docker runner

kind: pipeline
type: docker
name: garden

steps:
  - name: build
    image: gardendev/garden:0.12.16
    pull: if-not-exists
    commands:
      - ./ci/kubectl-init-1.sh
      - garden -e staging build
    environment:
      GARDEN_LOG_LEVEL: silly # set the log level to your preference here
      GARDEN_LOGGER_TYPE: basic # this is important, since the default logger doesn't play nice with CI :)
      GARDEN_DEV_KUBECTL:
        from_secret: GARDEN_DEV_KUBECTL

I get the following output intermittently. At this point it has already build 4+ other services in the k8s cluster using kaniko on the cluster rancher.<redacted>

Example 1:

...
2007 | ✖ web                       → Building image web:v-1b8a684d8a... | 182s
2008 |   | 182s
2009 | Failed building web. Here is the output: | 182s
2010 | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ | 182s
2011 | getaddrinfo ENOTFOUND rancher.<redacted> | 182s
2012 | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ | 182s
2013 | getaddrinfo ENOTFOUND rancher.<redacted> | 182s
2014 |   | 182s
2015 | Error Details: | 182s
2016 |   | 182s
2017 | errno: ENOTFOUND | 182s
2018 | code: ENOTFOUND | 182s
2019 | syscall: getaddrinfo | 182s
2020 | hostname: rancher.<redacted>
...

Example 2:

...
611 | Failed getting status for service 'mysql-dotnet' (from module 'mysql-dotnet'). Here is the output: | 14s
612 | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ | 14s
613 | Error: getaddrinfo ENOTFOUND rancher.<redacted> | 14s
614 | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ | 14s
615 | Error: getaddrinfo ENOTFOUND rancher.<redacted> | 14s
616 |   | 14s
617 | Error Details: | 14s
618 |   | 14s
619 | name: RequestError | 14s
620 | options: | 14s
621 | url: >- | 14s
622 | https://rancher.<redacted>/k8s/clusters/c-snrh4/api/v1/namespaces/<redacted>/services/mysql-dotnet | 14s
623 | method: GET | 14s
624 | json: true | 14s
625 | resolveWithFullResponse: true | 14s
626 | headers: | 14s
627 | Authorization: >- | 14s
628 | Bearer | 14s
629 | kubeconfig-u-udzztfyxtx:9h<redacted> | 14s
630 | simple: true | 14s
631 | transform2xxOnly: false
...

Expected behavior

As the code has already resolved the DNS name correctly before in the code it should not fail to resolve it after the 1st time

Reproducible example

Intermittently means it is not easily replaceable.
This only happened on our CI/CD system and not our workstations so resource limitations might be at play

Workaround

~~This started to working more consistently after we disabled all but 1 kubernetes provider in our project.garden.yaml file but we had 5 active provders with different configs before~~

(Updated)
Adding the host record into /etc/hosts works as a hotfix.

echo "123.123.123.123 myhost.mydomain.com" |sudo tee -a /etc/hosts

Additional context

Full logs of CI/CD output is availalable by personal request with silly log level. Not going to put it in the public domain

Your environment

OS: Linux, Drone docker runner using official image from Garden.io (see drone config above)
How I'm running Kubernetes: EKS (we are using other but the EKS is the cluster our CI/CD is deploying to and failling)
Garden version: 0.12.16

The text was updated successfully, but these errors were encountered:

edvald · 2021-02-18T15:52:19Z

Hmm that's interesting. We don't do anything particular in terms of DNS resolution, just let the underlying runtime take care of that (which just uses the standard getaddrinfo syscall). I can take a look to see if this is something Node.js allows us to work with, in some reasonable manner. I guess this doesn't often come up because kubeconfigs frequently use direct IPs instead of hostnames, but I can see how these queries would start to get throttled because interacting with Kubernetes necessarily involves a lot of individual REST queries.

An alternative might be to have a DNS cache outside the process (dnsmasq or the like) but obviously that's punting the problem back your way :) I'm not quite familiar enough with Drone to know what options are on the table at the CI runner level.

In any case, I'll take a quick look, see if there's a reasonable way to do this at our code level.

Chipcius · 2021-02-18T15:57:17Z

Yeah strange ... the host should also have a cached version of the DNS requests in most cases so this is very strange to me. Might be some race condition? 🤔
We did an exhaustive look to make sure this was not a problem with our build machines before reporting the issue. They had no problems doing the DNS lookup

edvald · 2021-02-18T16:14:19Z

Yeah it must be something triggered by the sheer number of concurrent lookups. That would be my guess at least, based on your report. I'll do a bit of digging on my side later today or tomorrow at latest, and see what we can come up with. I guess a crude workaround would be to just hardcode the IP in the kubeconfig while you kick this around, not sure how feasible that is in your setup?

Chipcius · 2021-02-18T16:36:30Z

@edvald That will not work without some major changes in how we host our rancher setup

✖ providers                 → Error

Failed resolving provider kubernetes. Here is the output:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Unable to connect to Kubernetes cluster. Got error:

Error [ERR_TLS_CERT_ALTNAME_INVALID]: Hostname/IP does not match certificate's altnames: IP: 34.<redacted> is not in the cert's list:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Failed resolving one or more providers:
- kubernetes

edvald · 2021-02-18T16:44:42Z

Ah yes, let's not go down that road then :P I'll poke around on our side and try to come up with something sensible. Only remaining thing to try outside of Garden would then be some DNS caching on the build nodes, not sure if that's tractable either, depending on how it's all set up.

Chipcius · 2021-02-22T10:22:26Z

Adding the host record in /etc/hosts looks to be working as a hot fix for this issue.

edvald · 2021-02-22T10:27:59Z

Ok cool. I've been exploring code-level solutions for this, and there seems to be a general consensus in the Node community that this is best solved at the OS level with something like dnsmasq. All the application-level options would involve some level of monkey-patching which feels risky to me, and may not actually catch all instances of DNS lookups. Is a DNS cache on your build nodes something relatively easy to rig up on your side?

Chipcius · 2021-02-22T12:27:26Z

That would mean custom docker images for building the garden setup. I would rather use upstream images without changes. This is likely to be an issue for other users as well in the future.

Can't we just cache the DNS lookups in the app? It is usually not anywhere near the first lookup that fails during the execution. The first few succeed without fail 99% of the tiem. Usually it is mid way through building when the DNS lookup fails.
Kind of strange to be resolving the same DNS name often during a single build

edvald · 2021-04-06T10:59:42Z

Is this resolved for you now @Chipcius?

Chipcius · 2021-04-21T20:37:05Z

Issue is still present in the latest version

Chipcius · 2021-04-28T14:17:49Z

This is present on the garden command in general and is happening quite often these days.

I just confirmed that the problem persists on Windows and Ubuntu 20.04 using the latest garden version 0.12.21

eysi09 · 2021-05-04T09:56:45Z

Hi @Chipcius

We're working on improving how we interact with the K8s API which should fix these type of issues. Essentially the plan is to catch common errors like these, and retry the request (with exp backoff) until hopefully it succeeds.

This is hi-pri for us and we're hoping to ship it with our next release.

Chipcius changed the title ~~Intermittent failure of DNS resolution in CI system (Drone.io)~~ Intermittent failure of DNS resolution when running garden agains domain based k8s Apr 28, 2021

eysi09 added the bug label May 4, 2021

eysi09 assigned thsig May 4, 2021

thsig mentioned this issue May 12, 2021

fix(k8s): automatic retry for failed API requests #2386

Merged

thsig closed this as completed in #2386 May 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent failure of DNS resolution when running garden agains domain based k8s #2255

Intermittent failure of DNS resolution when running garden agains domain based k8s #2255

Chipcius commented Feb 18, 2021 •

edited

Loading

edvald commented Feb 18, 2021

Chipcius commented Feb 18, 2021

edvald commented Feb 18, 2021

Chipcius commented Feb 18, 2021

edvald commented Feb 18, 2021

Chipcius commented Feb 22, 2021

edvald commented Feb 22, 2021

Chipcius commented Feb 22, 2021

edvald commented Apr 6, 2021

Chipcius commented Apr 21, 2021

Chipcius commented Apr 28, 2021

eysi09 commented May 4, 2021

Intermittent failure of DNS resolution when running garden agains domain based k8s #2255

Intermittent failure of DNS resolution when running garden agains domain based k8s #2255

Comments

Chipcius commented Feb 18, 2021 • edited Loading

Bug

Current Behavior

Expected behavior

Reproducible example

Workaround

Suggested solution(s)

Additional context

Your environment

edvald commented Feb 18, 2021

Chipcius commented Feb 18, 2021

edvald commented Feb 18, 2021

Chipcius commented Feb 18, 2021

edvald commented Feb 18, 2021

Chipcius commented Feb 22, 2021

edvald commented Feb 22, 2021

Chipcius commented Feb 22, 2021

edvald commented Apr 6, 2021

Chipcius commented Apr 21, 2021

Chipcius commented Apr 28, 2021

eysi09 commented May 4, 2021

Chipcius commented Feb 18, 2021 •

edited

Loading