Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Newly started container sometimes fails to resolve hostname of another container on the same podman net #13983

Closed
jiridanek opened this issue Apr 23, 2022 · 13 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. network Networking related issue or feature

Comments

@jiridanek
Copy link

/kind bug

Description

In my podman net called mynet, I have a server in container named skupperrouter and I have a client running the image docker.io/summerwind/h2spec:2.6.0 with parameters

"-h", "skupperrouter", "-p", "24162", "--verbose", "--insecure", "--timeout", "10"

Usually, this works. Sometimes, the h2spec fails to connect, with error message

Error: dial tcp: lookup skupperrouter on 192.168.86.1:53: no such host

The 192.168.86.1:53 is DNS on my home wifi router, so the dns request apparently left the podman realm at that point, and seeks fulfillment on the vastness of open plains of the Internet.

I was checking contents of /run/user/1000/containers/networks/aardvark-dns/mynet and I do see the skupperrouter pod and address listed there pretty much right after I start that container.

When I rerun the failed container with podman start h2spec, it always then connects and runs fine. So I am guessing there is a dead window of time when the dns does not work, for some reason.

I am using Fedora 36 beta with Podman 4.0.2 running as a systemctl --user service. (I define DOCKER_HOST=/run/user/1000/podman/podman.sock and use podman through that with an API library).

Describe the results you expected:

Issue does not happen.

My container names don't end up leaked outside of my machine in DNS queries.

Additional information you deem important (e.g. issue happens only occasionally):

Issue happens only occasionally.

Output of podman version:

Client:       Podman Engine
Version:      4.0.2
API Version:  4.0.2
Go Version:   go1.18beta2

Built:      Thu Mar  3 15:56:09 2022
OS/Arch:    linux/amd64

Output of podman info --debug:

host:
  arch: amd64
  buildahVersion: 1.24.1
  cgroupControllers:
  - cpu
  - io
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.0-2.fc36.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.0, commit: '
  cpus: 12
  distribution:
    distribution: fedora
    variant: workstation
    version: "36"
  eventLogger: journald
  hostname: fedora
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 5.17.2-300.fc36.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 318631936
  memTotal: 33403359232
  networkBackend: netavark
  ociRuntime:
    name: crun
    package: crun-1.4.4-1.fc36.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.4.4
      commit: 6521fcc5806f20f6187eb933f9f45130c86da230
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  remoteSocket:
    exists: true
    path: /run/user/1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.0-0.2.beta.0.fc36.x86_64
    version: |-
      slirp4netns version 1.2.0-beta.0
      commit: 477db14a24ff1a3de3a705e51ca2c4c1fe3dda64
      libslirp: 4.6.1
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.3
  swapFree: 1231269888
  swapTotal: 8589930496
  uptime: 163h 50m 52.53s (Approximately 6.79 days)
plugins:
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  volume:
  - local
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
  - quay.io
store:
  configFile: /home/jdanek/.config/containers/storage.conf
  containerStore:
    number: 3
    paused: 0
    running: 2
    stopped: 1
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/jdanek/.local/share/containers/storage
  graphStatus:
    Backing Filesystem: btrfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 309
  runRoot: /run/user/1000/containers
  volumePath: /home/jdanek/.local/share/containers/storage/volumes
version:
  APIVersion: 4.0.2
  Built: 1646319369
  BuiltTime: Thu Mar  3 15:56:09 2022
  GitCommit: ""
  GoVersion: go1.18beta2
  OsArch: linux/amd64
  Version: 4.0.2

Package info (e.g. output of rpm -q podman or apt list podman):

podman-4.0.2-1.fc36.x86_64
@openshift-ci openshift-ci bot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 23, 2022
@Luap99 Luap99 added the network Networking related issue or feature label Apr 24, 2022
@Luap99
Copy link
Member

Luap99 commented Apr 25, 2022

I think this is expected. The current design simply updates the container record for aardvark in a container file and triggers a reload for aardvark via SIGHUP. There is no logic to make sure aardvark has actually updated its in memory records before we start the container.
When a name is not found it will forward it to your host resolver.

@jiridanek
Copy link
Author

jiridanek commented Apr 25, 2022

Let me check if I understand correctly. My situation is that I start container A, then I start container B, and container B queries the DNS for hostname of container A and it fails to resolve. You are proposing that this is because the information about A has not yet been reflected in aardvark.

To check, I tried to run

watch --interval 0.1 podman unshare --rootless-netns nslookup skupperrouter 10.89.11.1

and then

podman start skupperrouter

to see how long it takes for the new name to be resolved. It often happens almost immediately, but it can also take 5 seconds or even more. I either need a reliable way to check that DNS has been updated, or I have to work without DNS.

I can do the first by running a probe container. As for the second, in some situations I can query server container for IP and configure it in client. podman inspect can find out the IP, and I am sure the API client can get the same information

               "Networks": {
                    "mynet": {
                         "EndpointID": "",
                         "Gateway": "10.89.11.1",
                         "IPAddress": "10.89.11.71",

I could possibly run my own DNS server (as another container in the network), in the hope that I can implement faster update myself.

@Luap99
Copy link
Member

Luap99 commented Apr 26, 2022

I think we should fix this race eventually but I don't think it is a high priority at the moment.

If it actually takes more than 5 seconds to update the records something else is broken IMO, unless your system is under incredible load it should never take that long. Can you attach strace to the aardvark-dns process to see what it is doing for 5 seconds.

@jiridanek
Copy link
Author

jiridanek commented May 5, 2022

I haven't been able to reproduce the >5 dns info propagation delay when I tried it. I did a dnf upgrade and few other things in the meantime, so who knows what fixed it.

When I tried this, I was running three terminals

ps -ef | grep dns
sudo strace -p 1118584 |& ts '[%H:%M:%.S]'
while true; do timeout 1 podman unshare --rootless-netns nslookup nghttpd 10.89.11.1 |& ts '[%H:%M:%.S]'; sleep 1; done

(and third, manually, to add and remove container)

date; podman start nghttpd
# wait until nslookup succeeds...
podman stop nghttpd

I always saw update in under 2s, which is about the time resolution of the sleep 1 loop. I will update this if I get the problems again and I am able to collect strace for it.

In any case, I am glad I learned about the window of opportunity for a race now, when I can debug more easily, and not later when tests start failing in a CI.

@github-actions
Copy link

github-actions bot commented Jun 5, 2022

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Jun 6, 2022

@Luap99 What is the state of this issue?

@github-actions
Copy link

github-actions bot commented Jul 7, 2022

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Jul 7, 2022

@Luap99 Ping

@rhatdan
Copy link
Member

rhatdan commented Jul 7, 2022

@flouthoc PTAL

@lukasmrtvy
Copy link

lukasmrtvy commented Jul 18, 2022

I have a similar issue / DNS resolving does not work until reboot /, Fedora CoreOS 36.20220618.3.1 and podman 4.1.0, whats weird is that /usr/libexec/podman/aardvark-dns --config /run/containers/networks/aardvark-dns -p 53 run is running twice and journald produces aardvark-dns[2288]: Unable to start server unable to start CoreDns server: Address already in use (os error 98) , but theres nothing else running on 53 port when problem kicks in. It happens only on freshly bootstrapped instances ( AWS EC2 via Ignition ), reboot always fixes it.

/run/containers/networks/aardvark-dns/<net> file is populated correctly, some containers are able to resolve others and some are not. All containers are running as systemd units ( root ) in the same network.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@flouthoc
Copy link
Collaborator

Could you please try again with aardvark and netavark v1.1.1 since in previous versions we had an issue where netavark was not waiting for aardvark to complete. I think following PR was attempt at fixing that containers/aardvark-dns#148 so i think retrying with latest version is worth a try.

@flouthoc
Copy link
Collaborator

I think this should be fixed in newer releases from the PR mentioned above, please re-open if this is seen in netavark + aardvark ( v1.1.0 + v1.1.0 )

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 18, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 18, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. network Networking related issue or feature
Projects
None yet
Development

No branches or pull requests

5 participants