Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of scheduler: fix reconciliation of reconnecting allocs into release/1.3.x #16645

Conversation

hc-github-team-nomad-core
Copy link
Contributor

Backport

This PR is auto-generated from #16609 to be assessed for backporting due to the inclusion of the label backport/1.3.x.

The below text is copied from the body of the original PR.


When a disconnect client reconnects the allocReconciler must find the
allocations that were created to replace the original disconnected
allocations.

This process was being done in only a subset of non-terminal untainted
allocations, meaning that, if the replacement allocations were not in
this state the reconciler didn't stop them, leaving the job in an
inconsistent state.

This inconsistency is only solved in a future job evaluation, but at
that point the allocation is considered reconnected and so the specific
reconnection logic was not applied, leading to unexpected outcomes.

This commit fixes the problem by running reconnecting allocation
reconciliation logic earlier into the process, leaving the rest of the
reconciler oblivious of reconnecting allocations.

It also uses the full set of allocations to search for replacements,
stopping them even if they are not in the untainted set.

The system SystemScheduler is not affected by this bug because
disconnected clients don't trigger replacements: every eligible client
is already running an allocation.

Closes #15483

tgross and others added 30 commits January 26, 2023 15:04
Scope the `footer` tag SCSS rule for the New Variable form to prevent it
from affecting other `<footer>` elements, such as the gutter menu Nomad
version section.
Also tightens up authentication for these endpoints by enforcing the server
certificate name is valid. We protect these endpoints currently by mTLS and
can't use an auth token because these endpoints are (uniquely) called by the
leader and followers for a given node won't have the leader's ephemeral ACL
token. Add a certificate name check that requests come from a server and not a
client, because no client should ever send these RPCs directly.
When the template hook Update() method is called it may recreate the
template manager if the Nomad or Vault token has been updated.

This caused the new template manager did not have a driver handler
because this was only being set on the Poststart hook, which is not
called for inplace updates.
* client: run alloc pre-kill hooks on last pass despite no live tasks

This PR fixes a bug where alloc pre-kill hooks were not run in the
edge case where there are no live tasks remaining, but it is also
the final update to process for the (terminal) allocation. We need
to run cleanup hooks here, otherwise they will not run until the
allocation gets garbage collected (i.e. via Destroy()), possibly
at a distant time in the future.

Fixes #15477

* client: do not run ar cleanup hooks if client is shutting down
The getting started Tutorial has a post-installation steps section that includes
installing CNI plugins. Many users will want to use `bridge` networking right
out of the gate, so adding these same post-install instructions to the main docs
will be a better Day 0 experience for them.
When registering a job with a service and 'consul.allow_unauthenticated=false',
we scan the given Consul token for an acceptable policy or role with an
acceptable policy, but did not scan for an acceptable service identity (which
is backed by an acceptable virtual policy). This PR updates our consul token
validation to also accept a matching service identity when registering a service
into Consul.

Fixes #15902
…#15939)

* build(deps): bump github.com/shoenig/test from 0.6.0 to 0.6.1 in /api

Bumps [github.com/shoenig/test](https://github.com/shoenig/test) from 0.6.0 to 0.6.1.
- [Release notes](https://github.com/shoenig/test/releases)
- [Commits](shoenig/test@v0.6.0...v0.6.1)

---
updated-dependencies:
- dependency-name: github.com/shoenig/test
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>

* deps: update test

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Seth Hoenig <[email protected]>
Bumps [github.com/docker/cli](https://github.com/docker/cli) from 20.10.22+incompatible to 20.10.23+incompatible.
- [Release notes](https://github.com/docker/cli/releases)
- [Commits](docker/cli@v20.10.22...v20.10.23)

---
updated-dependencies:
- dependency-name: github.com/docker/cli
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…15847)

Bumps [github.com/hashicorp/vault/api](https://github.com/hashicorp/vault) from 1.8.2 to 1.8.3.
- [Release notes](https://github.com/hashicorp/vault/releases)
- [Changelog](https://github.com/hashicorp/vault/blob/main/CHANGELOG.md)
- [Commits](hashicorp/vault@v1.8.2...v1.8.3)

---
updated-dependencies:
- dependency-name: github.com/hashicorp/vault/api
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
….1 (#15846)

Bumps [github.com/brianvoe/gofakeit/v6](https://github.com/brianvoe/gofakeit) from 6.19.0 to 6.20.1.
- [Release notes](https://github.com/brianvoe/gofakeit/releases)
- [Commits](brianvoe/gofakeit@v6.19.0...v6.20.1)

---
updated-dependencies:
- dependency-name: github.com/brianvoe/gofakeit/v6
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
… to 20.10.23+incompatible (#15848)

* build(deps): bump github.com/docker/docker

Bumps [github.com/docker/docker](https://github.com/docker/docker) from 20.10.21+incompatible to 20.10.23+incompatible.
- [Release notes](https://github.com/docker/docker/releases)
- [Commits](moby/moby@v20.10.21...v20.10.23)

---
updated-dependencies:
- dependency-name: github.com/docker/docker
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>

* changelog: add entry for docker/docker

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Seth Hoenig <[email protected]>
* Change api Fields for expose and paths

* Add changelog entry

* changelog: add deprecation notes about connect fields

* api: minor style tweaks

---------

Co-authored-by: Seth Hoenig <[email protected]>
* Ensure infra_image gets proper label used for reconciliation

Currently infra containers are not cleaned up as part of the dangling container
cleanup routine. The reason is that Nomad checks if a container is a Nomad owned
container by verifying the existence of the: `com.hashicorp.nomad.alloc_id` label.

Ensure we set this label on the infra container as well.

* fix unit test

* changelog: add entry

---------

Co-authored-by: Seth Hoenig <[email protected]>
While you can use any string value for a variable Item's key name
using characters that are outside of the set [unicode.Letter,
unicode.Number,`_`] will require the `index` function for direct
access.
… multiple tags (#15962)

* docker: set force=true on remove image to handle images referenced by multiple tags

This PR changes our call of docker client RemoveImage() to RemoveImageExtended with
the Force=true option set. This fixes a bug where an image referenced by more than
one tag could never be garbage collected by Nomad. The Force option only applies to
stopped containers; it does not affect running workloads.

* docker: add note about image_delay and multiple tags
DocAdam and others added 19 commits March 20, 2023 09:24
updated propogating to propagating
* cli: Add  and  flags to server members

* Update website/content/docs/commands/server/members.mdx

Co-authored-by: James Rasell <[email protected]>

* Update website/content/docs/commands/server/members.mdx

Co-authored-by: James Rasell <[email protected]>

* cli: update the server memebers tests to use must

* cli: add flags addition to changelog

---------

Co-authored-by: James Rasell <[email protected]>
* cli: add json and t flags to quota status command

* cli: add entry to changelog

* Update command/quota_status.go

Co-authored-by: James Rasell <[email protected]>

---------

Co-authored-by: James Rasell <[email protected]>
* Added  and  flag to  command

* cli[style]: small refactor to avoid confussion with tmpl variable

* Update inspect.mdx

* cli: add changelog entry

* Update .changelog/16478.txt

Co-authored-by: James Rasell <[email protected]>

* Update command/quota_inspect.go

Co-authored-by: James Rasell <[email protected]>

---------

Co-authored-by: James Rasell <[email protected]>
* Throw your mouse into traffic

* Add node metadata with a shortcut

* Re-labelled

* Adds a toast notification to job start/stop on keyboard shortcut

* Typo fix
Fixes #16517

Given a 3 Server cluster with at least 1 Client connected to Follower 1:

If a NodeMeta.{Apply,Read} for the Client request is received by
Follower 1 with `AllowStale = false` the Follower will forward the
request to the Leader.

The Leader, not being connected to the target Client, will forward the
RPC to Follower 1.

Follower 1, seeing AllowStale=false, will forward the request to the
Leader.

The Leader, not being connected to... well hoppefully you get the
picture: an infinite loop occurs.
The security fix in Go 1.20.2 does not apply to Nomad.
When a disconnect client reconnects the `allocReconciler` must find the
allocations that were created to replace the original disconnected
allocations.

This process was being done in only a subset of non-terminal untainted
allocations, meaning that, if the replacement allocations were not in
this state the reconciler didn't stop them, leaving the job in an
inconsistent state.

This inconsistency is only solved in a future job evaluation, but at
that point the allocation is considered reconnected and so the specific
reconnection logic was not applied, leading to unexpected outcomes.

This commit fixes the problem by running reconnecting allocation
reconciliation logic earlier into the process, leaving the rest of the
reconciler oblivious of reconnecting allocations.

It also uses the full set of allocations to search for replacements,
stopping them even if they are not in the `untainted` set.

The system `SystemScheduler` is not affected by this bug because
disconnected clients don't trigger replacements: every eligible client
is already running an allocation.
@hc-github-team-nomad-core hc-github-team-nomad-core requested a review from a team March 24, 2023 23:39
@hc-github-team-nomad-core hc-github-team-nomad-core requested a review from a team as a code owner March 24, 2023 23:39
@hc-github-team-nomad-core hc-github-team-nomad-core requested review from sarahethompson and claire-labry and removed request for a team March 24, 2023 23:39
@hc-github-team-nomad-core hc-github-team-nomad-core force-pushed the backport/b-reconciler-reconnect-fail/supposedly-ultimate-opossum branch from 06e200c to bcd49cb Compare March 24, 2023 23:39
@hc-github-team-nomad-core hc-github-team-nomad-core merged commit 55251b7 into release/1.3.x Mar 24, 2023
@hc-github-team-nomad-core hc-github-team-nomad-core deleted the backport/b-reconciler-reconnect-fail/supposedly-ultimate-opossum branch March 24, 2023 23:39
Copy link

github-actions bot commented Jan 7, 2025

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 7, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.