Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unreliable DNS during Container Builds #101

Open
tpanum opened this issue Dec 11, 2023 · 12 comments
Open

Unreliable DNS during Container Builds #101

tpanum opened this issue Dec 11, 2023 · 12 comments

Comments

@tpanum
Copy link

tpanum commented Dec 11, 2023

I have a GitHub Actions pipeline that roughly process like so:

  1. Connect to a private Tailscale network using this Github Action.
  2. Start a docker build of a multi-stage Dockerfile where an internal python package index[1] is accessed to pull dependencies.

[1]: This package index is available within the Tailscale network and is discovered using DNS of an internal DNS server configured in Tailscale.

Step 2 occasionally fail due to DNS lookups for the package index failing, thus DNS resolving not working reliably.

@mloberg
Copy link

mloberg commented Jan 12, 2024

Are you using buildx? I was having this issue, but it seems to be resolved by setting network=host in driver-opts

@tpanum
Copy link
Author

tpanum commented Jan 13, 2024

I have been using buildx, yeah. I have tried using the network=host in the past, but did not make it more reliable from my experiences. I ended up doing dig example.com > $IP and then use --add-host.

@henworth
Copy link

I was dealing with basically the same issue--container builds that rely on internal resources failing to resolve dns--and the thing that finally fixed it in my case was to configure buildx with the internal dns hosts:

- name: Setup buildx with internal DNS
  uses: docker/setup-buildx-action@v3
  with:
    config-inline: |
      [dns]
        nameservers="<comma separated list>"

@kdpuvvadi
Copy link

kdpuvvadi commented Mar 26, 2024

Even that is also incosistant, it works sometimes. Mostly doesn't.

For some reason, it's using 168.63.129.16 for dns resolution.

@tpanum
Copy link
Author

tpanum commented Mar 27, 2024

Following @henworth's example, I had to change it slightly to get it working:

with:
  buildkitd-config-inline: |
    [dns]
      nameservers=["..."]

@kdpuvvadi
Copy link

kdpuvvadi commented Mar 27, 2024

Still same though @tpanum.

mines looks like this

- name: Set up Docker Buildx
        uses: docker/[email protected]
        with:
          buildkitd-config-inline: |
            [dns]
              nameservers=["100.78.xx.xx","100.78.xx.xx"]

And errors out

#45 ERROR: failed to push git.local.puvvadi.net/***/blog:e234247: failed to do request: Head "https://git.local.puvvadi.net/v2/***/blog/blobs/sha256:bbba97e7b63ba8e2a28aa20a0a10b6ba491f29a395e8f7cc5bdf6ff4fe783000": dial tcp: lookup git.local.puvvadi.net on 168.63.129.16:53: no such host

@marcelofernandez
Copy link

We've also seen this same issue on docker run github actions steps, so it's not a docker build issue only. For example, a job containing this bash step:

runner@fv-az1797-395:~/work/repo/repo$ docker run -it --rm \
                --env ENV_VAR \
                <aws_account_id>.dkr.ecr.us-east-2.amazonaws.com/repo:latest /bin/bash
root@35f3013f9caf:/opt/code# cat /etc/resolv.conf
# This is /run/systemd/resolve/resolv.conf managed by man:systemd-resolved(8).
# Do not edit.
#
# This file might be symlinked as /etc/resolv.conf. If you're looking at
# /etc/resolv.conf and seeing this text, you have followed the symlink.
#
# This is a dynamic resolv.conf file for connecting local clients directly to
# all known uplink DNS servers. This file lists all configured search domains.
#
# Third party programs should typically not access this file directly, but only
# through the symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a
# different way, replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.

nameserver 168.63.129.16
nameserver 100.100.100.100
search vnw05d5vvvpeplv1mpaxmbipab.bx.internal.cloudapp.net tail1abc1.ts.net
root@35f3013f9caf:/opt/code#

This way we had many issues resolving Tailscale hosts inside this container. Fortunately, we managed to fix it using the --dns docker run option:

runner@fv-az1797-395:~/work/repo/repo$ docker run --dns=100.100.100.100 -it --rm \
                --env ENV_VAR \
                <aws_account_id>.dkr.ecr.us-east-2.amazonaws.com/repo:latest /bin/bash
root@35f3013f9caf:/opt/code# cat /etc/resolv.conf
# This is /run/systemd/resolve/resolv.conf managed by man:systemd-resolved(8).
# Do not edit.
#
# This file might be symlinked as /etc/resolv.conf. If you're looking at
# /etc/resolv.conf and seeing this text, you have followed the symlink.
#
# This is a dynamic resolv.conf file for connecting local clients directly to
# all known uplink DNS servers. This file lists all configured search domains.
#
# Third party programs should typically not access this file directly, but only
# through the symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a
# different way, replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.

nameserver 100.100.100.100
search vnw05d5vvvpeplv1mpaxmbipab.bx.internal.cloudapp.net tail1abc1.ts.net
root@35f3013f9caf:/opt/code#

Regards

@jaxxstorm
Copy link
Contributor

This also creates a scenarion where any action with

runs:
  using: docker

is unreliable

@jaxxstorm
Copy link
Contributor

jaxxstorm commented Jul 7, 2024

I think I've discovered why this happened, to me at least.

The GitHub actions runners are on 172.17.0.0/16

\n===== Network Interfaces =====
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host proto kernel_lo 
       valid_lft forever preferred_lft forever
4: eth0@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0
       valid_lft forever preferred_lft forever
\n===== Routing Table =====
default via 172.17.0.1 dev eth0 
172.17.0.0/16 dev eth0 proto kernel scope link src 172.17.0.2 

I have a subnet router that was advertising that same cidr. As soon as I connected Tailscale via actions, any docker based workload could no longer get to the DNS server inside the runner

Status: Downloaded newer image for nicolaka/netshoot:latest
===== DNS Configuration =====
# Generated by Docker Engine.
# This file can be edited; Docker Engine will not make further changes once it
# has been modified.

nameserver 168.63.129.16
search grvplcbrxqwulonopmolb0o12f.dx.internal.cloudapp.net tail9e93b.ts.net

# Based on host file: '/run/systemd/resolve/resolv.conf' (legacy)
# Overrides: []
\n===== Network Interfaces =====
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen [100](https://github.com/jaxxstorm/tailscale-actions-example/actions/runs/9829717610/job/27135072514#step:5:101)0
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host proto kernel_lo 
       valid_lft forever preferred_lft forever
5: eth0@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0
       valid_lft forever preferred_lft forever
\n===== Routing Table =====
default via 172.17.0.1 dev eth0 
172.17.0.0/16 dev eth0 proto kernel scope link src 172.17.0.2 
\n===== DNS Resolution Test =====
;; communications error to 168.63.129.16#53: timed out
;; communications error to 168.63.129.16#53: timed out
;; communications error to 168.63.129.16#53: timed out
;; no servers could be reached


;; communications error to 168.63.129.16#53: timed out
;; communications error to 168.63.129.16#53: timed out
;; communications error to 168.63.129.16#53: timed out
;; no servers could be reached

@youp-augur
Copy link

Still same though @tpanum.

mines looks like this

  • name: Set up Docker Buildx
    uses: docker/[email protected]
    with:
    buildkitd-config-inline: |
    [dns]
    nameservers=["100.78.xx.xx","100.78.xx.xx"]
    And errors out
#45 ERROR: failed to push git.local.puvvadi.net/***/blog:e234247: failed to do request: Head "https://git.local.puvvadi.net/v2/***/blog/blobs/sha256:bbba97e7b63ba8e2a28aa20a0a10b6ba491f29a395e8f7cc5bdf6ff4fe783000": dial tcp: lookup git.local.puvvadi.net on 168.63.129.16:53: no such host

@kdpuvvadi I had a similar issue and the solution was to set network=host on the Docker BuildX setup step

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
        with:
          driver-opts: |
            network=host

@kdpuvvadi
Copy link

@kdpuvvadi I had a similar issue and the solution was to set network=host on the Docker BuildX setup step

  - name: Set up Docker Buildx
    uses: docker/setup-buildx-action@v3
    with:
      driver-opts: |
        network=host

This is not reliable at all. Hit or miss. switched to self-hosted runners and local dns worked always.

@ndekross
Copy link

I'm having a similar issue trying to connect to a database via GH action through tailscale I sometimes get Temporary failure in name resolution error which indicates DNS issues. Sometimes I don't get it. But it's frustrating and holding our CI/CD pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants