Docker swarm overlay networking not working after --force-new-cluster #495

kylewuolle · 2018-11-21T23:12:50Z

This is a bug report
This is a feature request
I searched existing issues before opening this one

Expected behavior

After a --force-new-cluster and subsequently adding a new node to the cluster the tasks.servicename should be resolved by internal docker dns and containers on the same overlay network should be able to reach each other.

Actual behavior

On the node on which --force-new-cluster was executed the tasks.servicename endpoint will not resolve. On the added node, the tasks.servicename does resolve but it will only resolve to the container on the one node. Also, the containers on the same overlay network cannot reach each other by their ips.

Steps to reproduce the behavior

Using the following Docker file build an image on each node called demo.

FROM ubuntu

RUN apt update
RUN apt install dnsutils -y

CMD /bin/bash -c "while true; do nslookup tasks.demo; sleep 2; done"

execute swarm init on one of the nodes
create a network docker network create --scope swarm --driver overlay --attachable test
create a service docker service create --network test --mode global --name demo demo
verify that the tasks.demo endpoint resolves to two ip addresses docker service logs demo
now execute docker swarm init --force-new-cluster on one of the nodes
demote and remove the other node and also, remove the service and network
recreate the service and network on the remaining node
have a third node join the remaining node
at this point node 3 will resolve tasks.demo to be it's container's ip but the tasks.demo will not resolve on the first node. Also the container on each node cannot reach the container on the other node using it's ip.

Restarting the docker daemon on the first node does resolve the issue.

Output of docker version:

`Client:
 Version:      18.03.1-ce
 API version:  1.37
 Go version:   go1.9.5
 Git commit:   9ee9f40
 Built:        Wed Jun 20 21:43:51 2018
 OS/Arch:      linux/amd64
 Experimental: false
 Orchestrator: swarm

Server:
 Engine:
  Version:      18.03.1-ce
  API version:  1.37 (minimum version 1.12)
  Go version:   go1.9.5
  Git commit:   9ee9f40
  Built:        Wed Jun 20 21:42:00 2018
  OS/Arch:      linux/amd64
  Experimental: false

Output of docker info:

Containers: 1
 Running: 1
 Paused: 0
 Stopped: 0
Images: 8
Server Version: 18.03.1-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: itrdsuwlqi234atk1nwc8foha
 Is Manager: true
 ClusterID: ysq5qap98z4gbilfi4z3o60j3
 Managers: 2
 Nodes: 2
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 10
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 10.138.0.16
 Manager Addresses:
  10.138.0.11:2377
  35.227.182.132:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 773c489c9c1b21a6d78b5c538cd395416ec50f88
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.15.0-1024-gcp
Operating System: Ubuntu 18.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 3.607GiB
Name: instance-9
ID: TEPO:ELY7:EYOT:LPCS:OQ4B:DKKA:FK2U:XJ52:RXF7:7CGN:GEXO:YLAN
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.)
gcloud instances. Have reproduced the same behavior on 18.09 ce as well.

The text was updated successfully, but these errors were encountered:

kylewuolle · 2018-11-24T00:44:54Z

One thing I've discovered via debugging is that a change introduced in this commit might be responsible moby/libnetwork@5008b0c

If line 259 of controller.go is changed to simply be

if provider == nil {
		return
	}

Then the problem goes away. This is because at some point the agent is stopped and it's never restarted in the case of swarm init with force new cluster. Maybe there could be some other way to prevent this race condition? Maybe checking to see if the agent is really active? I will do more digging and add what I find.

thaJeztah · 2019-04-01T18:01:51Z

the libnetwork fix was included in Docker 18.09.4 through docker-archive/engine#169 ; should this one be closed?

thaJeztah · 2019-04-01T18:06:08Z

oh, sorry, it was not yet in 18.09; cherry-picking now

full diff: moby/libnetwork@c902989...872f0a8 - moby/libnetwork#2354 [18.09 backport] Cleanup the cluster provider when the agent is closed - backport of moby/libnetwork#2307 Fix for problem where agent is stopped and does not restart - fixes docker/for-linux#495 Docker swarm overlay networking not working after --force-new-cluster - moby/libnetwork#2369 [18.09 BACKPORT] Pick a random host port if the user does not specify a host port - backport of moby/libnetwork#2368 (windows) Pick a random host port if the user does not specify a host port Signed-off-by: Sebastiaan van Stijn <[email protected]>

full diff: moby/libnetwork@c902989...872f0a8 - moby/libnetwork#2354 [18.09 backport] Cleanup the cluster provider when the agent is closed - backport of moby/libnetwork#2307 Fix for problem where agent is stopped and does not restart - fixes docker/for-linux#495 Docker swarm overlay networking not working after --force-new-cluster - moby/libnetwork#2369 [18.09 BACKPORT] Pick a random host port if the user does not specify a host port - backport of moby/libnetwork#2368 (windows) Pick a random host port if the user does not specify a host port Signed-off-by: Sebastiaan van Stijn <[email protected]> Upstream-commit: 5354408039681020f9ad6afe4bf696fc90f9ce69 Component: engine

ycordier-pro · 2020-10-08T14:31:06Z

Hello everyone, I'm wondering if this issue is really resolved as I seem to be facing the same kind of name resolution problem after issuing a "docker swarm init --force-new-cluster" on an "isolated" manager.

One big difference in my scenario is that I'm NOT deploying services thru Swarm, I'm deploying containers thru classic docker-compose and just make use of an overlay network managed by Swarm onto which I'm attaching containers in docker-compose.
Basically my setup is 2 nodes joined in with both manager role, and an overlay network created manually with the "--attachable" flag. Then on the 2 nodes I'm starting some containers using a simple docker-compose deployment (no swarm/service deploy), but attaching them to the overlay network I've created.

Things are working fine, containers are able to communicate, but now on the 2 manager nodes, let's say one fails.
On the one that survive, all containers seem to be still running fine even if docker swarm is in "no quorum/isolated" state (error message "The swarm does not have a leader" in reply to swarm commands).

At this point I have to "docker swarm init --force-new-cluster" on the survivor, but as soon as I issue the command, I can see in containers logs that they become unable to resolve each others names (I get "Name or service not known" errors).
And the DNS resolution seems to be broken forever, even if the previously failed node gets restored and join again the swarm, at this point the only solution is to restart the whole docker-compose stack on the survivor. Weird thing is that on the new node that just joined in replacement of the previously failed one, things are going fine.

Based on my tests, the name resolution only works again when I restart the container I'm trying to resolve on the survived node. It looks like if on startup the container was somehow registering itself again on the "new" swarm overlay network that was recreated when I issued the "force-new-cluster" command.

Here's an example of the issue. Just after issuing the "force-new-cluster" command on the survivor, on the survived containers I can't resolve any of the other containers names:

[root]# docker-compose exec -u root one-container bash
root@one-container:/# ping another-container
ping: another-container: Name or service not known

Now if I just restart "another-container":

[root]# docker-compose restart another-container
Restarting another-container ... done

From the first one name resolution works again:

[root]# docker-compose exec -u root one-container bash
root@one-container:/# ping another-container
PING another-container (172.20.0.15) 56(84) bytes of data.
64 bytes from awq02-master-another-container.my-overlay-network (172.20.0.15): icmp_seq=1 ttl=64 time=0.034 ms
64 bytes from awq02-master-another-container.my-overlay-network (172.20.0.15): icmp_seq=2 ttl=64 time=0.046 ms
64 bytes from awq02-master-another-container.my-overlay-network (172.20.0.15): icmp_seq=3 ttl=64 time=0.034 ms
^C
--- another-container ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.034/0.038/0.046/0.005 ms

Any idea if this issue could be related to the fact that I'm just attaching containers to the overlay network using docker-compose and not really managing them thru plain swarm ?

Thanks for your time !

kylewuolle mentioned this issue Nov 26, 2018

Fix for problem where agent is stopped and does not restart moby/libnetwork#2307

Merged

kylewuolle mentioned this issue Jan 23, 2019

Fix for DNS name resolution after performing init with --force-new-cluster moby/moby#38626

Closed

thaJeztah mentioned this issue Apr 23, 2019

[18.09] bump libnetwork 872f0a83c98add6cae255c8859e29532febc0039 (18.09 branch) docker-archive/engine#200

Closed

andrewhsu mentioned this issue Apr 23, 2019

[18.09] bump libnetwork 872f0a8 docker-archive/engine#201

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docker swarm overlay networking not working after --force-new-cluster #495

Docker swarm overlay networking not working after --force-new-cluster #495

kylewuolle commented Nov 21, 2018 •

edited

Loading

kylewuolle commented Nov 24, 2018 •

edited

Loading

thaJeztah commented Apr 1, 2019

thaJeztah commented Apr 1, 2019

ycordier-pro commented Oct 8, 2020

Docker swarm overlay networking not working after --force-new-cluster #495

Docker swarm overlay networking not working after --force-new-cluster #495

Comments

kylewuolle commented Nov 21, 2018 • edited Loading

Expected behavior

Actual behavior

Steps to reproduce the behavior

kylewuolle commented Nov 24, 2018 • edited Loading

thaJeztah commented Apr 1, 2019

thaJeztah commented Apr 1, 2019

ycordier-pro commented Oct 8, 2020

kylewuolle commented Nov 21, 2018 •

edited

Loading

kylewuolle commented Nov 24, 2018 •

edited

Loading