Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Upgrades and Alloc.TaskResources modification #6922

Merged
merged 9 commits into from
Jan 28, 2020

Conversation

notnoop
Copy link
Contributor

@notnoop notnoop commented Jan 9, 2020

In 0.9, Allocation resources tracking changed, such that Allocation.{TaskResources|Resources|SharedResources} was effectively moved to Allocation.AllocationedResources and we initially targetted 0.11 to drop checking old fields.

This strategy is not correct. Because the Allocation is a persisted struct in Raft and Client Store and persisted store may remain untouched despite many nomad upgrades (e.g. if alloc doesn't change), we cannot properly stop relying on the old fields. Removing the paths would cause a panic.

This change ensures that Allocation modification follows the pattern of Job and Node schema change. Namely, that we introduce a Canonicalize method that applies a struct schema migration on reads from a persisted store (Raft log, Raft snapshot, Client state store). This eases compatibility handling, as other components can start relying on Allocation.AllocatedResources immediately and no longer need to fallback to Allocation.TaskResources anymore. We used this Canonicalize pattern in Job.Namespace and Node.SchedulingEligibility introduction, but we missed it here.

It's worth noting that we must preserve Allocation.TaskResources effectively forever. Consider an installation with Nomad 0.1 (pre namespaces, etc) running a single alloc, and then upgraded sequentially through all intermediate versions until 0.11 (or a future version) in a short time. We cannot assure that the alloc or job was updated or that a Raft Snapshot was created with the latest schema. As such, we should consul the fallback paths and perform schema migration indefinitely. We may revisit this if we changed upgrade requirements in a way that enforces making a Snapshot with latest schema.

Though Allocation.TaskResources struct must be preserved, structs.Resources methods are deadcode now and can be removed. We only need to preserve the struct type and fields to allow migration.

To reproduce the issue

  1. Spin up a nomad 0.8 agent; single node cluster in non-dev mode suffices
  2. Run an example job
  3. Upgrade cluster to 0.9 then to 0.10
  4. Run nomad alloc status -json <alloc_id> and note that "AllocatedResources" is nil, though COMPAT comments suggests otherwise

Caveats

We don't anticipate schema migrations to add much overhead here. Canonicalization of already up-to-date allocation structs will be effectively a nil check.

It's unfortunate that Canonicalize in the structs package refer to upgrade path schema migration handling. Canonicalize methods in the api/ package to refer populating struct with defaults. This overload is a bit confusing. I opted to keep current pattern and we may consider renaming struct package method name to UpgradeSchema in follow up PR.

Mahmood Ali added 4 commits January 8, 2020 17:22
This commit ensures that Alloc.AllocatedResources is properly populated
when read from persistence stores (namely Raft and client state store).
The alloc struct may have been written previously by an arbitrary old
version that may only populate Alloc.TaskResources.
Now that alloc.Canonicalize() is called in all alloc sources in the
client (i.e. on state restore and RPC fetching), we no longer need to
check alloc.TaskResources.

alloc.AllocatedResources is always non-nil through alloc runner.
Though, early on, we check for alloc validity, so NewTaskRunner and
TaskEnv must still check.  `TestClient_AddAllocError` test validates
that behavior.
@notnoop notnoop added this to the 0.10.3 milestone Jan 9, 2020
@notnoop notnoop requested review from schmichael and tgross January 9, 2020 15:46
@notnoop notnoop self-assigned this Jan 9, 2020
client/allocrunner/taskrunner/task_runner.go Show resolved Hide resolved
@@ -603,7 +603,6 @@ func (b *Builder) setAlloc(alloc *structs.Allocation) *Builder {

tg := alloc.Job.LookupTaskGroup(alloc.TaskGroup)

// COMPAT(0.11): Remove in 0.11
b.otherPorts = make(map[string]string, len(tg.Tasks)*2)
if alloc.AllocatedResources != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this check anymore? Seems like if AllocatedResources is somehow nil here we'll end up with an incomplete task environment which would be harder to detect in testing than a panic.

Sadly we don't have a logger or error return value here, so our options are limited. If you think risking an incomplete environment is better than risking a panic, let's just add that as a comment here as the code will look kind of strange down the road.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept this conditional because TestClient_AddAllocError asserts that when an invalid alloc is passed, taskenv doesn't panic and NewTaskRunner returns an error. Not sure what the conditions the test actually tests for.

command/alloc_status.go Show resolved Hide resolved
@@ -7527,15 +7527,17 @@ type Allocation struct {
// the scheduler.
Resources *Resources

// COMPAT(0.11): Remove in 0.11
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't this be COMPAT(0.12): Remove in 0.12? If both clients and servers upgrade-allocs-on-restore in 0.11, what could still have AllocatedResources==nil in 0.12 (and therefore need these fields to populate AllocRes)? It seems like the only problem would be if someone used a 0.10 or earlier agents with 0.12 code.

(Not that we have to remove it in 0.12, I'm just curious if we could while maintaining our +/-1 Y version safety.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If both clients and servers upgrade-allocs-on-restore in 0.11, what could still have AllocatedResources==nil in 0.12 (and therefore need these fields to populate AllocRes)?

If both clients and servers upgrade the on-disk representation, then yes. But we currently don't do that, neither will with this PR. Consider the case where a cluster starts with nomad 0.8; then operator upgrades in rapid short successions through 0.9, 0.10 (with this PR), 0.11, and then to 0.12 - so fast such that Raft didn't generate a Snapshot during these upgrades. In this case, Nomad 0.12 will read the representation that was persisted by 0.8 and lacks AllocatedResources.

To be able to fully remove it, we must augment the recommended upgrade path to ensure on-disk representation get upgraded before a user can do a subsequent upgrade.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we Raft Snapshot on server agent startup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can consider it - we'll need to do some vetting and testing before going there. Agent startup is critical to cluster recovery and I'd be nervous about adding a blocking call that may fail there; if done asynchronously, we'd need to properly indicate to operators when it's safe to safe to upgrade and potentially cope with operators potentially ignore the warning. Maybe consider it as part of 0.12?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, a blocking nomad operator ... command might be sufficient + upgrade instructions to use it if "rapidly" upgrading from 0.N to 0.N+2.

Definitely seems best to leave it out of scope for this effort, but could you file an issue describing how its currently impossible to safely remove deprecated fields that are persisted to raft? Doesn't seem like anything we need to rush to fix, but I can see it mattering a lot more post-1.0 when people are much slower to upgrade (Consul struggles with this), and may want options to upgrade from 1.N to 1.N+x (where x > 1) quickly and easily.

nomad/structs/structs.go Show resolved Hide resolved

// TODO: Investigate if we should canonicalize the job
// it may be out of sync with respect to the original job
// a.Job.Canonicalize()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// a.Job.Canonicalize()
a.Job.Canonicalize()

I think this is the right thing to do:

  • On clients they may have an old version of a job persisted that may not match the expectations of the code running in the agent.
  • On servers I actually don't remember off the top of my head if Jobs are persisted on Allocs (are allocs normalized only in raft messages on the wire, or in state store too 🤔 ?). Luckily this doesn't matter! Job.Canonicalize handles job being nil and Jobs are already canonicalized on statestore restore.

So I think we might waste a couple CPU instructions, but it seems necessary on clients at least.

// a.Job.Canonicalize()
}

func toAllocatedResources(taskResources map[string]*Resources) map[string]*AllocatedTaskResources {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to just inline this code into Canonicalize as it seems easy to get orphaned here if we are able to remove the canonicalization in the future.

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@schmichael beat me to finishing the review so I think he's addressed most of my specific questions/comments already. A more general question: calling Canonicalize in the appropriate places is a new invariant which could lead to panics if we miss it. Is there any way (other than "be careful") that we could guarantee we can't miss it at build/test time?

nomad/structs/structs.go Show resolved Hide resolved
Copy link
Contributor Author

@notnoop notnoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more general question: calling Canonicalize in the appropriate places is a new invariant which could lead to panics if we miss it. Is there any way (other than "be careful") that we could guarantee we can't miss it at build/test time?

To clarify, calling Canonicalize isn't a new invariant - it's the pattern we already used for Job and Node migrations before. As you point out, it's fragile and we miss the subtlety, so generic "be careful" advice isn't so great.

My suggestion would be a combination of codifying the patterns related to persisted structure compatibility (in a checklist or a engineering guide) and follow up later with more e2e tests that exercise upgrade paths and test multiple combinations of upgrade. Our current e2e framework is too weak to accommodate tests like that, so we'd need more investments there.

Mahmood Ali added 5 commits January 10, 2020 10:41
alloc.Job may be stale as well and need to migrate it.  It does cost
extra cycles but should be negligible.
There is a case for always canonicalizing alloc.Job field when
canonicalizing the alloc.  I'm less certain of implications though, and
the job canonicalize hasn't changed for a long time.

Here, we special case client restore from database as it's probably the
most relevant part.  When receiving an alloc from RPC, the data should
be fresh enough.
@notnoop notnoop merged commit b789b50 into master Jan 28, 2020
@notnoop notnoop deleted the b-alloc-canoncalize branch January 28, 2020 20:12
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants