-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent failure of DNS resolution when running garden agains domain based k8s #2255
Comments
Hmm that's interesting. We don't do anything particular in terms of DNS resolution, just let the underlying runtime take care of that (which just uses the standard getaddrinfo syscall). I can take a look to see if this is something Node.js allows us to work with, in some reasonable manner. I guess this doesn't often come up because kubeconfigs frequently use direct IPs instead of hostnames, but I can see how these queries would start to get throttled because interacting with Kubernetes necessarily involves a lot of individual REST queries. An alternative might be to have a DNS cache outside the process (dnsmasq or the like) but obviously that's punting the problem back your way :) I'm not quite familiar enough with Drone to know what options are on the table at the CI runner level. In any case, I'll take a quick look, see if there's a reasonable way to do this at our code level. |
Yeah strange ... the host should also have a cached version of the DNS requests in most cases so this is very strange to me. Might be some race condition? 🤔 |
Yeah it must be something triggered by the sheer number of concurrent lookups. That would be my guess at least, based on your report. I'll do a bit of digging on my side later today or tomorrow at latest, and see what we can come up with. I guess a crude workaround would be to just hardcode the IP in the kubeconfig while you kick this around, not sure how feasible that is in your setup? |
@edvald That will not work without some major changes in how we host our rancher setup
|
Ah yes, let's not go down that road then :P I'll poke around on our side and try to come up with something sensible. Only remaining thing to try outside of Garden would then be some DNS caching on the build nodes, not sure if that's tractable either, depending on how it's all set up. |
Adding the host record in |
Ok cool. I've been exploring code-level solutions for this, and there seems to be a general consensus in the Node community that this is best solved at the OS level with something like dnsmasq. All the application-level options would involve some level of monkey-patching which feels risky to me, and may not actually catch all instances of DNS lookups. Is a DNS cache on your build nodes something relatively easy to rig up on your side? |
That would mean custom docker images for building the garden setup. I would rather use upstream images without changes. This is likely to be an issue for other users as well in the future. Can't we just cache the DNS lookups in the app? It is usually not anywhere near the first lookup that fails during the execution. The first few succeed without fail 99% of the tiem. Usually it is mid way through building when the DNS lookup fails. |
Is this resolved for you now @Chipcius? |
Issue is still present in the latest version |
Hi @Chipcius We're working on improving how we interact with the K8s API which should fix these type of issues. Essentially the plan is to catch common errors like these, and retry the request (with exp backoff) until hopefully it succeeds. This is hi-pri for us and we're hoping to ship it with our next release. |
Bug
Garden intermittently fails to resolve the DNS server name of my K8s cluster within my
kubernetes
providerCurrent Behavior
During CI/CD in drone using docker runner
I get the following output intermittently. At this point it has already build 4+ other services in the k8s cluster using kaniko on the cluster
rancher.<redacted>
Example 1:
Example 2:
Expected behavior
As the code has already resolved the DNS name correctly before in the code it should not fail to resolve it after the 1st time
Reproducible example
Intermittently means it is not easily replaceable.
This only happened on our CI/CD system and not our workstations so resource limitations might be at play
Workaround
This started to working more consistently after we disabled all but 1 kubernetes provider in our project.garden.yaml file but we had 5 active provders with different configs before(Updated)
Adding the host record into
/etc/hosts
works as a hotfix.Suggested solution(s)
Take a look at re-using the DNS resolutions better in the codebase
Additional context
Your environment
The text was updated successfully, but these errors were encountered: