Templates referencing Consul service pools stop receiving updates #11558

dpn · 2021-11-22T22:26:55Z

Nomad version

/ # nomad version
Nomad v1.0.6 (592cd4565bf726408a03482da2c9fd8a3a1015cf)

Operating system and Environment details

RHEL 7.7 running on both colocated bare metals and VMs, as well as AWS EC2 instances.

Issue

When a template references a Consul service, sometimes the Consul Template integration fails to render when updates occur to the service pool.

We use this all over the place in our cluster and we have seen reports for this 4 times across the past few years of running Nomad: Once 2 years ago, and now 3 times in the past 2 months. It seems to be a very rare occurrence, but we didn't have visibility into this exact issue until recently, so with this new alerting we may be able to catch this more often and provide better diagnostic information.

Reproduction steps

Lets say we have a job with a count of 5, allocations with short IDs of: 00000000-0000004 and that consul-service-a ends up being the service affected by this bug.

Put one of the consul-service-a pool members into maintenance. Note that all allocations properly render the remaining service pool members to disk and are SIGHUPd
Remove maintenance mode from the consul-service-a pool member. Note that all allocations properly render the remaining service pool members to disk and are SIGHUPd
Wait. We're unsure what the underlying trigger is that's leading to this behavior.
Eventually notice that allocation 00000000 is no longer logging Template re-rendered messages in allocation logs in the UI although its sibling allocations continue to do so.
Inspect the rendered file for allocation 00000000, note that the node entries for service consul-service-a are not the same as to what is rendered for allocations 00000001-00000004
While in this broken state, put a member of consul-service-b into maintenance. Note that all allocations correctly render out the node changes for the consul-service-b service pool. The rendered nodes for consul-service-a are still out of sync between allocation 00000000 and the rest at this point.
Repeat previous step with consul-service-c, noting the same behavior.
Bounce the Nomad agent on the host running the allocation 00000000. When the agent comes back up, the node entries for all 3 consul service are in sync between all of the allocations.

Expected Result

The Consul template integration continues to render out Consul service pool changes for all of the services referenced in the template.

Actual Result

Updates for one of the services in the affected allocation stop until the local Nomad agent is bounced.

Job file (if appropriate)

Abridged, but boils down to something like:

~~~~~ 8< ~~~~~~

group "group" {
  count = 5

  template {
    change_mode = "signal"
    change_signal = "SIGHUP"
    data = << EOH
    {{ $nodes := service "consul-service-a" }}
    {{ if $nodes }}
      {{ range $nodes }}
    {{ endif }}
  
    {{ $nodes := service "consul-service-b" }}
    {{ if $nodes }}
      {{ range $nodes }}
    {{ endif }}
  
    {{ $nodes := service "consul-service-c" }}
    {{ if $nodes }}
      {{ range $nodes }}
    {{ endif }}
    EOH

    destination = "local/file"
  }
}

~~~~~ 8< ~~~~~~

Nomad Server logs (if appropriate)

N/A

Nomad Client logs (if appropriate)

We didn't have visibility into this occurring when the incident that triggered this bug report happened, so although I'm guessing logs from this time will be the most interesting we won't be able to gather them until we see another reproduction. At least with our increased visibility into this issue we should have a better shot at gathering these. Anything else besides logs be useful for us to gather?

Otherwise.. On working agents you'll see this log message during the breakagae:

Nov 22 14:39:24 a-nomad-client.some.dc nomad[3280]: 2021/11/22 14:39:24.334287 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/552216c9-3c59-bf41-166f-104a671c0815/group/local/file"

On the broken agents, you won't.

The text was updated successfully, but these errors were encountered:

lgfa29 · 2021-11-25T01:13:44Z

Thanks for the report @dpn.

I can't think of anything from the top of my head that could be causing this problem, so we would need time to investigate it further. As you mentioned, this happens sporadically, which makes it harder to reproduce.

Checking the Consul logs the next time this happens may be helpful as well. It could be that the local Consul agent is losing connection with its peers.

tgross · 2022-01-10T21:24:47Z

We just landed some improvements to template configuration in #11606 that will let you fine-tune waiting, stale, and retry behaviors. That'll ship in the next scheduled version of Nomad and may help out once we've got more detailed logs from an incident to work with.

Going to mark this as waiting for reply to see if @dpn is able to produce logs, but otherwise hopefully 1.2.4 will provide them with the knobs to fix the underlying problem.

dpn · 2022-03-17T17:27:07Z

Thanks guys. As usual apologies for the delayed response. We have built alerting around this issue and have not seen it reproduce since the initial report, so I think we're good to go ahead and close this issue for now as I don't have logs to provide from the original incident any more. In the event we see this again I will be sure to capture them and update this issue. Thanks again!

github-actions · 2022-10-10T02:44:26Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dpn added the type/bug label Nov 22, 2021

lgfa29 added stage/needs-investigation theme/consul-template labels Nov 25, 2021

tgross added stage/waiting-reply and removed stage/needs-investigation labels Jan 10, 2022

dpn closed this as completed Mar 17, 2022

github-actions bot locked as resolved and limited conversation to collaborators Oct 10, 2022

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Done in Nomad - Community Issues Triage Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Templates referencing Consul service pools stop receiving updates #11558

Templates referencing Consul service pools stop receiving updates #11558

dpn commented Nov 22, 2021

lgfa29 commented Nov 25, 2021

tgross commented Jan 10, 2022

dpn commented Mar 17, 2022

github-actions bot commented Oct 10, 2022

Templates referencing Consul service pools stop receiving updates #11558

Templates referencing Consul service pools stop receiving updates #11558

Comments

dpn commented Nov 22, 2021

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

lgfa29 commented Nov 25, 2021

tgross commented Jan 10, 2022

dpn commented Mar 17, 2022

github-actions bot commented Oct 10, 2022