-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backport of scheduler: fix reconciliation of reconnecting allocs into release/1.3.x #16645
Merged
hc-github-team-nomad-core
merged 1,452 commits into
release/1.3.x
from
backport/b-reconciler-reconnect-fail/supposedly-ultimate-opossum
Mar 24, 2023
Merged
Backport of scheduler: fix reconciliation of reconnecting allocs into release/1.3.x #16645
hc-github-team-nomad-core
merged 1,452 commits into
release/1.3.x
from
backport/b-reconciler-reconnect-fail/supposedly-ultimate-opossum
Mar 24, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Scope the `footer` tag SCSS rule for the New Variable form to prevent it from affecting other `<footer>` elements, such as the gutter menu Nomad version section.
Also tightens up authentication for these endpoints by enforcing the server certificate name is valid. We protect these endpoints currently by mTLS and can't use an auth token because these endpoints are (uniquely) called by the leader and followers for a given node won't have the leader's ephemeral ACL token. Add a certificate name check that requests come from a server and not a client, because no client should ever send these RPCs directly.
When the template hook Update() method is called it may recreate the template manager if the Nomad or Vault token has been updated. This caused the new template manager did not have a driver handler because this was only being set on the Poststart hook, which is not called for inplace updates.
* client: run alloc pre-kill hooks on last pass despite no live tasks This PR fixes a bug where alloc pre-kill hooks were not run in the edge case where there are no live tasks remaining, but it is also the final update to process for the (terminal) allocation. We need to run cleanup hooks here, otherwise they will not run until the allocation gets garbage collected (i.e. via Destroy()), possibly at a distant time in the future. Fixes #15477 * client: do not run ar cleanup hooks if client is shutting down
The getting started Tutorial has a post-installation steps section that includes installing CNI plugins. Many users will want to use `bridge` networking right out of the gate, so adding these same post-install instructions to the main docs will be a better Day 0 experience for them.
When registering a job with a service and 'consul.allow_unauthenticated=false', we scan the given Consul token for an acceptable policy or role with an acceptable policy, but did not scan for an acceptable service identity (which is backed by an acceptable virtual policy). This PR updates our consul token validation to also accept a matching service identity when registering a service into Consul. Fixes #15902
…#15939) * build(deps): bump github.com/shoenig/test from 0.6.0 to 0.6.1 in /api Bumps [github.com/shoenig/test](https://github.com/shoenig/test) from 0.6.0 to 0.6.1. - [Release notes](https://github.com/shoenig/test/releases) - [Commits](shoenig/test@v0.6.0...v0.6.1) --- updated-dependencies: - dependency-name: github.com/shoenig/test dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> * deps: update test --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Seth Hoenig <[email protected]>
…` to `_` (#15940) Signed-off-by: dttung2905 <[email protected]>
Bumps [github.com/docker/cli](https://github.com/docker/cli) from 20.10.22+incompatible to 20.10.23+incompatible. - [Release notes](https://github.com/docker/cli/releases) - [Commits](docker/cli@v20.10.22...v20.10.23) --- updated-dependencies: - dependency-name: github.com/docker/cli dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…15847) Bumps [github.com/hashicorp/vault/api](https://github.com/hashicorp/vault) from 1.8.2 to 1.8.3. - [Release notes](https://github.com/hashicorp/vault/releases) - [Changelog](https://github.com/hashicorp/vault/blob/main/CHANGELOG.md) - [Commits](hashicorp/vault@v1.8.2...v1.8.3) --- updated-dependencies: - dependency-name: github.com/hashicorp/vault/api dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
….1 (#15846) Bumps [github.com/brianvoe/gofakeit/v6](https://github.com/brianvoe/gofakeit) from 6.19.0 to 6.20.1. - [Release notes](https://github.com/brianvoe/gofakeit/releases) - [Commits](brianvoe/gofakeit@v6.19.0...v6.20.1) --- updated-dependencies: - dependency-name: github.com/brianvoe/gofakeit/v6 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
… to 20.10.23+incompatible (#15848) * build(deps): bump github.com/docker/docker Bumps [github.com/docker/docker](https://github.com/docker/docker) from 20.10.21+incompatible to 20.10.23+incompatible. - [Release notes](https://github.com/docker/docker/releases) - [Commits](moby/moby@v20.10.21...v20.10.23) --- updated-dependencies: - dependency-name: github.com/docker/docker dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> * changelog: add entry for docker/docker --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Seth Hoenig <[email protected]>
* Change api Fields for expose and paths * Add changelog entry * changelog: add deprecation notes about connect fields * api: minor style tweaks --------- Co-authored-by: Seth Hoenig <[email protected]>
* Ensure infra_image gets proper label used for reconciliation Currently infra containers are not cleaned up as part of the dangling container cleanup routine. The reason is that Nomad checks if a container is a Nomad owned container by verifying the existence of the: `com.hashicorp.nomad.alloc_id` label. Ensure we set this label on the infra container as well. * fix unit test * changelog: add entry --------- Co-authored-by: Seth Hoenig <[email protected]>
While you can use any string value for a variable Item's key name using characters that are outside of the set [unicode.Letter, unicode.Number,`_`] will require the `index` function for direct access.
… multiple tags (#15962) * docker: set force=true on remove image to handle images referenced by multiple tags This PR changes our call of docker client RemoveImage() to RemoveImageExtended with the Force=true option set. This fixes a bug where an image referenced by more than one tag could never be garbage collected by Nomad. The Force option only applies to stopped containers; it does not affect running workloads. * docker: add note about image_delay and multiple tags
updated propogating to propagating
* cli: Add and flags to server members * Update website/content/docs/commands/server/members.mdx Co-authored-by: James Rasell <[email protected]> * Update website/content/docs/commands/server/members.mdx Co-authored-by: James Rasell <[email protected]> * cli: update the server memebers tests to use must * cli: add flags addition to changelog --------- Co-authored-by: James Rasell <[email protected]>
* cli: add json and t flags to quota status command * cli: add entry to changelog * Update command/quota_status.go Co-authored-by: James Rasell <[email protected]> --------- Co-authored-by: James Rasell <[email protected]>
* Added and flag to command * cli[style]: small refactor to avoid confussion with tmpl variable * Update inspect.mdx * cli: add changelog entry * Update .changelog/16478.txt Co-authored-by: James Rasell <[email protected]> * Update command/quota_inspect.go Co-authored-by: James Rasell <[email protected]> --------- Co-authored-by: James Rasell <[email protected]>
* Throw your mouse into traffic * Add node metadata with a shortcut * Re-labelled * Adds a toast notification to job start/stop on keyboard shortcut * Typo fix
Fixes #16517 Given a 3 Server cluster with at least 1 Client connected to Follower 1: If a NodeMeta.{Apply,Read} for the Client request is received by Follower 1 with `AllowStale = false` the Follower will forward the request to the Leader. The Leader, not being connected to the target Client, will forward the RPC to Follower 1. Follower 1, seeing AllowStale=false, will forward the request to the Leader. The Leader, not being connected to... well hoppefully you get the picture: an infinite loop occurs.
The security fix in Go 1.20.2 does not apply to Nomad.
When a disconnect client reconnects the `allocReconciler` must find the allocations that were created to replace the original disconnected allocations. This process was being done in only a subset of non-terminal untainted allocations, meaning that, if the replacement allocations were not in this state the reconciler didn't stop them, leaving the job in an inconsistent state. This inconsistency is only solved in a future job evaluation, but at that point the allocation is considered reconnected and so the specific reconnection logic was not applied, leading to unexpected outcomes. This commit fixes the problem by running reconnecting allocation reconciliation logic earlier into the process, leaving the rest of the reconciler oblivious of reconnecting allocations. It also uses the full set of allocations to search for replacements, stopping them even if they are not in the `untainted` set. The system `SystemScheduler` is not affected by this bug because disconnected clients don't trigger replacements: every eligible client is already running an allocation.
…timate-opossum
hc-github-team-nomad-core
requested review from
sarahethompson and
claire-labry
and removed request for
a team
March 24, 2023 23:39
hc-github-team-nomad-core
force-pushed
the
backport/b-reconciler-reconnect-fail/supposedly-ultimate-opossum
branch
from
March 24, 2023 23:39
06e200c
to
bcd49cb
Compare
hc-github-team-nomad-core
deleted the
backport/b-reconciler-reconnect-fail/supposedly-ultimate-opossum
branch
March 24, 2023 23:39
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport
This PR is auto-generated from #16609 to be assessed for backporting due to the inclusion of the label backport/1.3.x.
The below text is copied from the body of the original PR.
When a disconnect client reconnects the
allocReconciler
must find theallocations that were created to replace the original disconnected
allocations.
This process was being done in only a subset of non-terminal untainted
allocations, meaning that, if the replacement allocations were not in
this state the reconciler didn't stop them, leaving the job in an
inconsistent state.
This inconsistency is only solved in a future job evaluation, but at
that point the allocation is considered reconnected and so the specific
reconnection logic was not applied, leading to unexpected outcomes.
This commit fixes the problem by running reconnecting allocation
reconciliation logic earlier into the process, leaving the rest of the
reconciler oblivious of reconnecting allocations.
It also uses the full set of allocations to search for replacements,
stopping them even if they are not in the
untainted
set.The system
SystemScheduler
is not affected by this bug becausedisconnected clients don't trigger replacements: every eligible client
is already running an allocation.
Closes #15483