-
Notifications
You must be signed in to change notification settings - Fork 881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot connect to a container in an overlay network from a different swarm node: could not resolve peer \"<nil>\": timed out resolving peer by querying the cluster
#962
Comments
could not resolve peer \"<nil>\": timed out resolving peer by querying the cluster
could not resolve peer \"<nil>\": timed out resolving peer by querying the cluster
I'm seeing the same thing: moby/moby#22144 |
Hit this issue as well, restarting the container seems to fix it, but isn't ideal. We're using:
|
I have a same problem. Just comment to see if there is any update on this. My env:
|
Another me too. Some weirdness with ARP? |
No, ARP seems to be fine, at the container level. I have in front of me a "blocked" container and I can reproduce it quite often by running Spark clusters. I would say that one cluster out of five comes out bad with some worker unable to talk to the master. Since we do a lot of automated tests and performance measurements, we get many outliers and basically makes Swarm with overlay networks too unreliable. |
There is something fishy in the overlay network namespace. Here is the vxlan fdb on the host on which the "blocked" container is running:
Any idea, @mavenugo ? |
I'm also getting this issue too. Unfortunately I know virtually nothing about networking. My setup is pretty much identical to @paulcadman. I've actually had the issue for some time over the last few docker releases. It got so bad at one point I recreated the whole overlay network and containers and its been good for a few weeks, but its happening again :( If there's anything I can debug let me know. We're running on EC2 is that helps.
|
On a new deployment with 25 machines this problem blocks network communication for about half of the containers that I create, making it completely worthless. Any news? |
I have face to with same issue. I think overlay network is unreliable in ubuntu. Server Version: swarm/1.2.3 Client: Server: |
We have the same issue in a docker swarm with thousands of containers. In all the docker nodes (1.10.3) we have an nginx containers which can communicate with different application containers in the overlay network (using consul). Sometimes one of the nginx containers cannot connect to an app container in a different node, receiving the same message in the log:
Additionally we are seeing that the app containers which fail always have the same IP addresses. Restarting the app container doesn't work if the container received the same IP address. What work for us is:
We tried to debug where the packets are going without success. There is not activity in the remote node so we think the packet is not leaving the node where the nginx container is running. |
We have the exact same issue, with a different environment:
|
As mentioned above, we have been working around this issue by periodically running
This seems to keep the VXLAN working (and/or resolve the issue when it does happen) without having to recreate containers or restart anything. |
While we indeed managed to get network working again by doing this (pinging in the opposite direction) before, we also encounter a case where we cannot avoid rebooting the machine. Docker Swarm 1.2.6 with docker-ce-17-03-.1 is so not production-ready (eg 5 machines KO in a cluster of 8). |
@tfasz @antoinetran this behavior got fixed by #1792 will be available on 17.06 and also will be backported to 17.03 |
Dear all, We migrated 3 Swarm clusters to docker-ce-17.06.0, 2 weeks ago, and this seems to work fine, until now. I just reproduced this error once. I had to ping back to have the connectivity again. But it seems this error is more rare now. Any info/log someone want? |
@antoinetran To make sure we understand what issue is clearly, can you provide the details on exactly what the connectivity issue is and what the trigger is ? |
Environment: Connectivity issue: What the trigger is |
@fcrisciani I reproduced this issue a small number of time. Here is my new environment: all latest Same symptom as my last post. |
@antoinetran |
I don't know what you mean by slow but if the ping does not work, I wait maybe a few seconds to be sure.
To be more precise, the DNS resolution always work. It is really the ping that does not work (no pong). |
@antoinetran I was thinking that C1 was still not ready, but if you can exec into it should not be the case. |
Hard to say, kernel/daemon logs are the first thing I look. There are recurrent errors that does not seems to be errors:
I will try to archive these logs when the event happen. Right now I forgot when it happened. |
@antoinetran I see memberlist complaining about one node and marking it as suspect, this means that the physical network is failing delivering the health checks. if the networkdb is not able to communicate with the other nodes, that will explain why the ping fails, because the configuration is not being propagated so c1 is not aware of c2. |
@fcrisciani Ok! Thank you for your diagnostic. This event is probably due to an IP collision we had today (from docker default network when we do compose up). That explains a lot. It might also be a cause of network lost during my other post in old environments. |
@antoinetran
These are printed every 5 min and is 1 line per network. This tells you that on the specific network there is 3 nodes that have container deployed on it. Also if there is connectivity issue you will see a line mentioning |
Same issue with 17.12.1 |
make sure your firewall is open for port needed for overlay networks
|
Description of problem:
Very rarely (observed twice after using 1000s of containers) we start a new container
into an overlay network in a docker swarm. Existing containers in the overlay network that
are on different nodes cannot connect to the new container. However containers in the
overlay network on the same node as the new container are able to connect.
The new container receives an IP address in the overlay network subnet, but this does not
seem to work correctly when resolved from a different node.
The second time this happened we fixed the problem by stopping and starting the new
container.
We haven't found a way to reliably reproduce this problem. Is there any other debugging
I can provide that would help diagnose this issue?
The error message is the same as the one reported on #617.
docker version
:docker info
:uname -a
:Environment details (AWS, VirtualBox, physical, etc.):
Physical - docker swarm cluster.
How reproducible:
Rare - happened 2 times after creating/starting 1000s of containers.
Steps to Reproduce:
A process in the container is listening on port 80 and this port is exposed to the overlay network.
Actual Results:
Get a connection timeout. For example with the golang http client:
10.158.0.60 is the address of the container in step 2 in the overlay network subnet.
The docker logs on the swarm node that launched the container in step 2 contain (from
journalctl -u docker
):We see a line like this for each failed request between the containers.
When we make the same request from a container in the overlay network on the same swarm node as the
container running the http server the expected connection is established and a response is received.
Expected Results:
The http client receieves a response from the container its trying to connect to.
Additional info:
The second time this occurred we fixed the problem by stopping and starting the container running
the http server.
We are using Consul as the KV store of the overlay network and swarm.
When removing the container that cannot be connected to, docker logs (
journalctl -u docker
) contain the line:The docker log lines are emitted by https://github.com/docker/libnetwork/blob/master/drivers/overlay/ov_serf.go#L180. I can't find an existing issue tracking this.
The text was updated successfully, but these errors were encountered: