-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Templates referencing Consul service pools stop receiving updates #11558
Comments
Thanks for the report @dpn. I can't think of anything from the top of my head that could be causing this problem, so we would need time to investigate it further. As you mentioned, this happens sporadically, which makes it harder to reproduce. Checking the Consul logs the next time this happens may be helpful as well. It could be that the local Consul agent is losing connection with its peers. |
We just landed some improvements to Going to mark this as waiting for reply to see if @dpn is able to produce logs, but otherwise hopefully 1.2.4 will provide them with the knobs to fix the underlying problem. |
Thanks guys. As usual apologies for the delayed response. We have built alerting around this issue and have not seen it reproduce since the initial report, so I think we're good to go ahead and close this issue for now as I don't have logs to provide from the original incident any more. In the event we see this again I will be sure to capture them and update this issue. Thanks again! |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Operating system and Environment details
RHEL 7.7 running on both colocated bare metals and VMs, as well as AWS EC2 instances.
Issue
When a template references a Consul service, sometimes the Consul Template integration fails to render when updates occur to the service pool.
We use this all over the place in our cluster and we have seen reports for this 4 times across the past few years of running Nomad: Once 2 years ago, and now 3 times in the past 2 months. It seems to be a very rare occurrence, but we didn't have visibility into this exact issue until recently, so with this new alerting we may be able to catch this more often and provide better diagnostic information.
Reproduction steps
Lets say we have a job with a count of 5, allocations with short IDs of:
00000000
-0000004
and thatconsul-service-a
ends up being the service affected by this bug.consul-service-a
pool members into maintenance. Note that all allocations properly render the remaining service pool members to disk and areSIGHUP
dconsul-service-a
pool member. Note that all allocations properly render the remaining service pool members to disk and areSIGHUP
d00000000
is no longer loggingTemplate re-rendered
messages in allocation logs in the UI although its sibling allocations continue to do so.00000000
, note that the node entries for serviceconsul-service-a
are not the same as to what is rendered for allocations00000001-00000004
consul-service-b
into maintenance. Note that all allocations correctly render out the node changes for theconsul-service-b
service pool. The rendered nodes forconsul-service-a
are still out of sync between allocation00000000
and the rest at this point.consul-service-c
, noting the same behavior.00000000
. When the agent comes back up, the node entries for all 3 consul service are in sync between all of the allocations.Expected Result
The Consul template integration continues to render out Consul service pool changes for all of the services referenced in the template.
Actual Result
Updates for one of the services in the affected allocation stop until the local Nomad agent is bounced.
Job file (if appropriate)
Abridged, but boils down to something like:
Nomad Server logs (if appropriate)
N/A
Nomad Client logs (if appropriate)
We didn't have visibility into this occurring when the incident that triggered this bug report happened, so although I'm guessing logs from this time will be the most interesting we won't be able to gather them until we see another reproduction. At least with our increased visibility into this issue we should have a better shot at gathering these. Anything else besides logs be useful for us to gather?
Otherwise.. On working agents you'll see this log message during the breakagae:
On the broken agents, you won't.
The text was updated successfully, but these errors were encountered: