-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New WARN in 1.10.0 caused by shuffling the servers in the gRPC ClientConn pool #10603
Comments
Hi, based on https://discuss.hashicorp.com/t/grpc-warning-on-consul-1-10-0/26237 it sounds like this issue is not specific to Kubernetes. I'm going to move this to |
Thank you for reporting this issue! I was just running a Consul agent locally to debug a different issue and I noticed this problem happens at the same time as these 2 debug lines:
The problem seems to be that when we rebalance the servers the active transport is cancelled, which causes this error to be printed. |
Is the issue here that the behavior is potentially incorrect, or that a common occurrence is erroneously categorized at |
I've installed in a couple other locations with the same chart/values as above and in the same datacenter the warn messages are for the other consul-servers in the cluster. This occurs if the cluster is WAN federated or not, that doesn't appear to have an impact. Currently, trying to track a couple of network issues I have been experiencing in consul 1.10. I am trying to obtain more evidence but I deleted the 1.10 cluster and went back to 1.8.4 and it did not appear to have the WARN. Can this be ignored? Not sure yet. |
seeing an exact mirror of of this problem on a small development cluster running on Raspberry PI 4's in a very basic configuration all running consul 1.10.1 the errors in my case are the server taking to itself, eg: my 3 raft servers are made up of 3 nodes - called: nog Node ID Address State Voter RaftProtocol in nog's log the IP address 10.11.216.182 is actually the IP address of the host 'nog' - so the error is talking to itself on the host 'jake' the log shows the same failure to connect to the host nog on the host wesley (leader) Jul 21 10:19:19 wesley consul[15814]: 2021-07-21T10:19:19.779Z [WARN] agent: grpc: addrConn.createTransport failed to connect to {10.11.216.81:8300 0 wesley.no-dns.co.uk.bathstable }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.11.216.81:8300: operation was canceled". Reconnecting... the 10.11.216.81 IP address it's failing to talk to is wesley - itself in this case. |
Yeah, that is what I am seeing. It's the local cluster that is emitting these messages and FLOODING the logs |
I see the same issue. Three node cluster running on VMs (Cent 8). Consul v1.10.0 All three nodes sporadically spit out an error about connecting to one of the other master nodes.
|
I went ahead and did more installation tests. I installed Consul 1.10.1 and Chart 0.32.1 and backed down the Consul and Chart version all the way to 1.8.4 and 0.26.0 as I am also experiencing other problems that are not related to this issue. The WARNS appear in the latest 1.10.x versions and they are emitted in both the local datacenter as well as a federated environment using mesh gateways. |
We are also seeing similar errors but there seems to be no issue with the cluster itself, we are running consul v1.10.1 on OL8 VM's. |
@ikonia I'm running Consul on two Ras Pis, and ran into this issue couple of days ago as well. In case you haven't found a solution, I seems to find the cause for my issue. I first noticed the inconsistency in member status shown on each Pi. As you can see from snippets below, Pi 1 seems to think that 02 is leaving the cluster, and 02 thinks it's still in it. So I restarted the consul service on 02 and that fixed the issue. I think the problem was caused due to starting services on both Pis at the same time and the nodes didn't negotiate properly and somehow that caused this weird bug. I use Ansible for staging the node and manage configs on them, and whenever I changed the configs, it restarts the services at the same time on all nodes and that's not a good idea .... (duh ... ). I'm not sure what you setup is, but maybe try to spin up the nodes one by one, which solved the problem for me.
|
Restarts in the way you describe unfortunately does not solve the problem here. The clusters appear to be healthy otherwise, but this is flooding logs and I do not think we have received a response as to if the messages are indicative of an issue or is something that can be ignored and waiting for a patch. |
I believe these messages can be ignored. We periodically rebalance servers in the connect pool, and it looks like doing so is causing gRPC to emit these warnings. It seems like gRPC is reconnecting after the rebalance, so likely we can move these messages to INFO instead of WARN, but we'll need to do more investigation to be sure. |
+1 It seems like not ok. consul v1.10.1 |
we're having the same issue on ent version, will try to raise support ticket there |
💯 this should be moved to an info level log, normal system behavior that doesn't result in any degradation and self heals should not be something we are warned about. |
I have the same issue after upgraded to 1.10.2. |
@dnephin Is there any chance to fix this in an upcoming release? |
To clarify our current understanding of this: this is not a bug, but instead a misclassified log message (that shouldn't be Per @dnephin:
In this case, the aforementioned "need to do more investigation" is about how to make the change to reduce verbosity, not about the cause or whether there's a bug. The change requires some investigation because the message is emitted by gRPC, not Consul. |
Exactly this same error on 1.11.1 bare metal/ centos 7 |
If this log message was coming directly from Consul this would be much easier to fix. Unfortunately the log message is coming from a library (gRPC), which makes it a bit harder to fix. I think we have two options for addressing this:
Option 1 is pretty safe, but I'm not sure if it fixes much. There will still be an INFO log message that is printed periodically. I guess it is slightly better to print this as an INFO than a WARN. The downside is that other gRPC WARN messages may not be visible enough in logs at INFO level. Option 2 is much more involved, but is likely a safer long term fix. I believe the cause of this warning is this code: consul/agent/grpc/resolver/resolver.go Lines 283 to 287 in d20230f
If we trace those |
We're seeing a pretty healthy amount of these messages as well across our clusters. Keeping our eyes on this. Given the above, option 2 is definitely our preference. Not sure we want to even get a message in this case unless it's something to be concerned about |
Still happening on 1.11.2 running on my homelab. +1 to Option 2 |
Can we ignore this error if everything is working as expected ? or Do we need to concern about this warning/error ? |
Hey @nagender1005 Yes you can ignore this warning if everything is working as expected. Per earlier in this thread:
Hope this helps! |
Still happening in |
Still happening in Consul v1.12.0 . |
PR #15701 has been merged and should land in the next patch versions for My PR should fix the periodic WARN logs during server shuffling which occurs every ~2 mins by default. Note that you may continue to encounter some WARNs on agent startup and on very infrequent occasions. This is a related but separate issue #15821 |
same on consul 1.14.2
|
Hi @kong62, The issue is expected to still be present in 1.14.2. It will be fixed as of the next set of patch releases (1.14.4, 1.13.6, and 1.12.9). |
thank you |
Can confirm that after upgrading to |
Thanks will go ahead and close this issue as it looks like this is fixed for 1.14.4 and later. Note that you may continue to encounter some WARNs on agent startup and on very infrequent occasions. This is a related but separate issue #15821 |
I've installed 1.12.9 and I can still see this messages. Not that frequent but still
|
v1.14.4 This one is still reproducible: |
I have same problem on my Consul Cluster [WARN] agent: [core][Channel #1 SubChannel #15] grpc: addrConn.createTransport failed to connect to { |
@urosgruber @webant @mexdevops Do those warns occur regularly? We still expect to see those logs as an agent starts up but they should become infrequent over time. #15821 for reference |
Yep, i have this messages from time to time and i can't fix this. |
It appears every 2-3 days for each agent. Not related to any startup, quite constant flow of those with several hundreds of agents with ~100 of those errors every day. |
With 3 servers, no agents connected, fresh install about 10 messages a day on random. |
Could you provide log fragments with lines before and after you see these WARNs? I'd like to confirm that they are not caused by this issue (server shuffling every ~2 mins) and would like to move the investigation to either #15821 or a new issue if necessary. |
Do you need INFO level? I believe there is nothing else in case I set it to WARN |
Yes, INFO would be appreciated. |
|
From my side, no info logs around those for the whole cluster. A single agent impacted at a time. |
can we confirm if this is fixed, it's marked as fixed but there are still reports in this thread of the problem |
Hello everyone, I created issue #17842 to track the problem which I consider separate from this one. Although they are related, it is helpful for us to triage the issue separately given that the WARN logs are no longer frequent or regular. If you are still seeing WARN logs ending with |
Note from @lkysow: I'm moving this to hashicorp/consul because the discuss post shows a user on EC2 also saw this error.
Overview of the Issue
New 1.10.0 on New K8s Cluster results in
[WARN] agent: grpc: addrConn.createTransport failed to connect to {10.200.65.16:8300 0 consul-server-2.primary <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.200.65.16:8300: operation was canceled". Reconnecting...
These WARNS appear in both the server and clients.
Reproduction Steps
values.yml
:Expected behavior
WARNS should not be flooding the log and connections should be over 8301 not 8300
Environment details
If not already included, please provide the following:
consul-k8s
version:0 .26.0consul-helm
version: 0.32.1values.yaml
used to deploy the helm chart: see aboveAdditional Context
It seems others are experiencing the same problem.
https://discuss.hashicorp.com/t/grpc-warning-on-consul-1-10-0/26237
The text was updated successfully, but these errors were encountered: