scheduler: annotate tasksUpdated with reason and purge DeepEquals #16421

shoenig · 2023-03-10T18:52:22Z

reflect.DeepEqual is not suitable for comparing Job structs.

By design, it

treats nil vs empty slices/maps as different

Note that a non-nil empty slice and a nil slice (for example, []byte{} and []byte(nil)) are not deeply equal.

compares private struct fields

Struct values are deeply equal if their corresponding fields, both exported and unexported, are deeply equal.

The use of reflect.DeepEqual has been the root of bugs a few times in the past.

#14917
#10896
#9330

This PR introduces the OpaqueMapEquals helper function as a replacement for DeepEqual for the specific cases of comparing map[comparable]interface{}, which is necessary in cases like opaque plugin config, and envoy configs. Unlike DE, OME exhibits the opposite behavior in the two scenarios above. For other (known) types, implement .Equal method on structs and compare values directly, or defer to other sensible helpers.

The tasksUpdated helper has also been modified to return a struct containing the first mis-match of fields triggering a destructive update (as opposed to a simple boolean). Folks tracking down misbehavior now need only to add a log line to understand why tasksUpdated is (or isn't) triggering updates.

Benchmark

before

nomad/scheduler on main
➜ go test -count=3 -bench=BenchmarkTasksUpdated -run BenchmarkTasksUpdated
BenchmarkTasksUpdated-24          105920             14791 ns/op
BenchmarkTasksUpdated-24           82866             14376 ns/op
BenchmarkTasksUpdated-24           89450             14674 ns/op
PASS
ok      github.com/hashicorp/nomad/scheduler    4.485s

after

nomad/scheduler on tasks-updated-annotate
➜ go test -count=3 -bench=BenchmarkTasksUpdated -run BenchmarkTasksUpdated
BenchmarkTasksUpdated-24           54816             24469 ns/op
BenchmarkTasksUpdated-24           47431             24392 ns/op
BenchmarkTasksUpdated-24           48696             24565 ns/op
PASS
ok      github.com/hashicorp/nomad/scheduler    4.436s

The slowdown is primarily caused by OpaqueMapEquals - replacing its implementation with DeepEqual gets the numbers back down. But these numbers are still negligible and the tradeoff for correctness seems worthwhile.

Closes #16290

schmichael

I'm not going to count my Consul Config comparison concerns as blockers. I bet it's ok to make these changes. The breakages we've had with Consul's API in the past have always been a lot more obvious than nil vs empty.

I think any time any of us writes an Equal method we should remind one another to someday use a code generator. 😅

schmichael · 2023-03-13T17:39:16Z

lib/lang/opaque.go

@@ -0,0 +1,48 @@
+package lang


Totally forgot this package existed. The only other file in here defines Pair which is never referenced directly but only in other types. I think helper is the package most of us look in for common comparison funcs, so my preference would be to use that and kill this, but it really doesn't matter much.

Good point, moved this into helper. At some point it might be nice to split up helper into Nomad-specific helpers (like CheckHCLKeys) vs. language constructs (like this), but we can bikeshed that on a rainy day 🙂

schmichael · 2023-03-13T17:51:28Z

nomad/structs/structs.go

+func (a *Affinity) Hash() string {
+	if a == nil {
+		return "(nil)"
+	}
+	return a.String()


What's this and Spread.Hash for? Having a Hash method return non-hash-derived values seems strange as String() formatting changes for testing/debugging will break Hash compatibility even when the actual struct's contents haven't changed. However these methods don't appear to be called directly anywhere, so I'm not even sure what they're for.

Our use of Hash() methods is already really inconsistent already both in usage pattern (does it generate a hash on every call or does it require calling SetHash first?!) and return type (ints, strings, bytes 🤷). I don't want to make that situation worse without good reason. We've kind of made a mess of any common meaning for Hash.

Affinity.Hash and Spread.Hash are being used

here: https://github.com/hashicorp/nomad/pull/16421/files#diff-25a0c4de309c3511db80cb2707981cda3326bfd49f29148222be44ba9a529fe2R488

and here: https://github.com/hashicorp/nomad/pull/16421/files#diff-25a0c4de309c3511db80cb2707981cda3326bfd49f29148222be44ba9a529fe2R517

Previously we created a slice of all the Affinity/Spread on the job from each level and used reflect.DeepEqual on the slice. Implementing Hash() and deferring to HashSet was just a lazy way doing something like creating the slice and using slices.EqualsFunc - but I can make that change.

schmichael · 2023-03-13T18:00:02Z

nomad/structs/services.go

@@ -1409,7 +1401,7 @@ func (p *ConsulProxy) Equal(o *ConsulProxy) bool {
 		return false
 	}

-	if !opaqueMapsEqual(p.Config, o.Config) {


Is this safe? Can we be sure that Consul doesn't treat a nested empty map differently from a nil map? If the map is a nested value of Config (eg Config["foo"] = map[string]string{}), I could see nil meaning unset and empty meaning set but use defaults. Even it it's safe today can we be sure it will be safe for all future versions of Consul?

Maybe to put it another way: do we gain anything by not using DeepEqual in these cases? It seems like making any assumptions about what's meaningful in the keys and values of opaque maps is ... well: not opaque!

~~On the flip side - what if Consul (or Envoy!) normalizes these fields one way or the other, and the original job uses the other form?~~

These opaque fields are a landmine no matter what, and I think it's more likely we run into the normalization problem.

After a nice nap I realized this is nonsense - when Consul/Envoy normalize these fields they come back via our requesting of the Consul API, and that comparison is done in consul/service_client, which itself uses DeepEqual for the opaque maps.

https://github.com/hashicorp/nomad/blob/main/command/agent/consul/service_client.go#L343

I'll switch the comparison for envoy fields here back to DeepEqual to match

tgross

LGTM!

Can we add a note to the jobspec checklist about when to implement an Equal method? Do we want to lean on this method for nomad/structs/diff.go as well? (Probably in a separate PR)

nomad/structs/structs.go

lgfa29

I think we talked about it in the past, but it would nice to have some automated, randomized, fuzzed tests for this kind of thing.

lgfa29 · 2023-03-13T20:53:40Z

scheduler/util.go

+
+// tasksUpdated creates a comparison between task groups to see if the tasks, their
+// drivers, environment variables or config have been modified.
+func tasksUpdated(jobA, jobB *structs.Job, taskGroup string) comparison {


Probably a very loaded question, and it may be a premature micro-optimization, but do we have a sense of which fields are most likely to be updated?

Since we return early on first change detected I think it would make sense to defer comparison for less used fields (like ephemeral disks) to the bottom and bring fields that more likely to change (like task config or env) closer to the top.

Would it be helpful to also quickly hash the structs with something like https://github.com/mitchellh/hashstructure to exit early if the hashes match, meaning that no update happened?

Since we return early on first change detected I think it would make sense to defer comparison for less used fields

I thought about this, and introduced this Benchmark in case someone wants to improve things. For this PR I really only care about correctness.

Would it be helpful to also quickly hash the structs

Unfortunately no generic hashing or equals methodology will work here - there's a handful of custom comparisons we need to do (i.e. ignoring or normalizing fields) even outside of Canonicalize, which you can see in the helper methods called by tasksUpdated.

Even if we could generate a hash, we'd probably want to cache that hash value as part of the struct for it to benefit performance. Otherwise we'd have to rehash on each pass through the scheduler, which would mean chasing all the same set of pointers we need to do to call the helper methods anyways. That's why ex. the node class hash is precalculated.

I thought about this, and introduced this Benchmark in case someone wants to improve things. For this PR I really only care about correctness.

Makes sense to focus on correctness. I think the problem with benchmarking something like this is that it's not an empirical performance change, it's more like a heuristic based on what we think is most often modified.

Just for a quick check I modified the benchmark so it updates a task env and moved the task loop to the top and got some improvement:

goos: darwin goarch: arm64 pkg: github.com/hashicorp/nomad/scheduler BenchmarkTasksUpdated-10 125265 9232 ns/op BenchmarkTasksUpdated-10 138152 8697 ns/op BenchmarkTasksUpdated-10 128814 8714 ns/op BenchmarkTasksUpdated-10 136887 8711 ns/op BenchmarkTasksUpdated-10 138432 8690 ns/op BenchmarkTasksUpdated-10 137956 8736 ns/op BenchmarkTasksUpdated-10 136660 8736 ns/op BenchmarkTasksUpdated-10 138224 8728 ns/op BenchmarkTasksUpdated-10 137574 8729 ns/op BenchmarkTasksUpdated-10 133634 8733 ns/op PASS ok github.com/hashicorp/nomad/scheduler 13.029s

goos: darwin goarch: arm64 pkg: github.com/hashicorp/nomad/scheduler BenchmarkTasksUpdated-10 238004 4909 ns/op BenchmarkTasksUpdated-10 237406 4527 ns/op BenchmarkTasksUpdated-10 327616 3654 ns/op BenchmarkTasksUpdated-10 324906 3646 ns/op BenchmarkTasksUpdated-10 325503 3808 ns/op BenchmarkTasksUpdated-10 325220 3822 ns/op BenchmarkTasksUpdated-10 320337 3730 ns/op BenchmarkTasksUpdated-10 279225 3694 ns/op BenchmarkTasksUpdated-10 321788 3704 ns/op BenchmarkTasksUpdated-10 317722 3680 ns/op PASS ok github.com/hashicorp/nomad/scheduler 12.379s

shoenig · 2023-03-14T14:26:25Z

Can we add a note

I added a bit of clarification to the jopspec guide. As for diffing, I think that mostly boils down to the fieldDiffs helper ... and a lot of other custom code. TBH I don't know where to begin cleaning that up, other than to say we shouldn't be writing custom parsing/diff code.

shoenig · 2023-03-14T14:45:46Z

With the swap back to DE for Envoy configs,

BenchmarkTasksUpdated-24           96236             13000 ns/op
BenchmarkTasksUpdated-24           93657             13166 ns/op
BenchmarkTasksUpdated-24           90940             13063 ns/op

... go-cmp is really slow 🙁

vercel bot deployed to Preview – nomad-storybook-and-ui March 10, 2023 18:56 View deployment

shoenig force-pushed the tasks-updated-annotate branch from f98137d to a0b4222 Compare March 10, 2023 19:15

vercel bot deployed to Preview – nomad-storybook-and-ui March 10, 2023 19:19 View deployment

shoenig force-pushed the tasks-updated-annotate branch from a0b4222 to b8de29a Compare March 13, 2023 14:57

vercel bot deployed to Preview – nomad-storybook-and-ui March 13, 2023 15:02 View deployment

shoenig force-pushed the tasks-updated-annotate branch from b8de29a to 2d155ac Compare March 13, 2023 15:56

vercel bot deployed to Preview – nomad-storybook-and-ui March 13, 2023 16:02 View deployment

scheduler: annotate tasksUpdated with reason and purge DeepEquals

2fca70f

shoenig force-pushed the tasks-updated-annotate branch from 2d155ac to 2fca70f Compare March 13, 2023 16:22

vercel bot deployed to Preview – nomad-storybook-and-ui March 13, 2023 16:26 View deployment

shoenig marked this pull request as ready for review March 13, 2023 17:22

shoenig requested review from schmichael, lgfa29 and tgross March 13, 2023 17:23

schmichael approved these changes Mar 13, 2023

View reviewed changes

tgross approved these changes Mar 13, 2023

View reviewed changes

nomad/structs/structs.go Show resolved Hide resolved

cr: move opaque into helper

06a73d8

lgfa29 approved these changes Mar 13, 2023

View reviewed changes

shoenig added 3 commits March 13, 2023 16:13

cr: swap affinity/spread hashing for slice equal

1ddd06d

contributing: update checklist-jobspec with notes about struct methods

b874495

cr: add more cases to wait config equal method

38e1c7d

vercel bot deployed to Preview – nomad-storybook-and-ui March 13, 2023 21:57 View deployment

shoenig added the backport/1.5.x backport to 1.5.x release line label Mar 13, 2023

cr: use reflect when comparing envoy config blocks

f5d0424

vercel bot deployed to Preview – nomad-storybook-and-ui March 14, 2023 14:04 View deployment

cl: add cl

55293b1

vercel bot deployed to Preview – nomad-storybook-and-ui March 14, 2023 14:29 View deployment

shoenig merged commit 1a01e87 into main Mar 14, 2023

shoenig deleted the tasks-updated-annotate branch March 14, 2023 14:46

hc-github-team-nomad-core mentioned this pull request Mar 14, 2023

Backport of scheduler: annotate tasksUpdated with reason and purge DeepEquals into release/1.5.x #16491

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheduler: annotate tasksUpdated with reason and purge DeepEquals #16421

scheduler: annotate tasksUpdated with reason and purge DeepEquals #16421

shoenig commented Mar 10, 2023 •

edited

Loading

schmichael left a comment

schmichael Mar 13, 2023

shoenig Mar 13, 2023

schmichael Mar 13, 2023

shoenig Mar 13, 2023

schmichael Mar 13, 2023

shoenig Mar 13, 2023 •

edited

Loading

shoenig Mar 14, 2023 •

edited

Loading

tgross left a comment

lgfa29 left a comment

lgfa29 Mar 13, 2023

shoenig Mar 13, 2023 •

edited

Loading

tgross Mar 14, 2023

lgfa29 Mar 14, 2023

shoenig commented Mar 14, 2023

shoenig commented Mar 14, 2023

scheduler: annotate tasksUpdated with reason and purge DeepEquals #16421

scheduler: annotate tasksUpdated with reason and purge DeepEquals #16421

Conversation

shoenig commented Mar 10, 2023 • edited Loading

schmichael left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoenig Mar 13, 2023 • edited Loading

Choose a reason for hiding this comment

shoenig Mar 14, 2023 • edited Loading

Choose a reason for hiding this comment

tgross left a comment

Choose a reason for hiding this comment

lgfa29 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoenig Mar 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoenig commented Mar 14, 2023

shoenig commented Mar 14, 2023

shoenig commented Mar 10, 2023 •

edited

Loading

shoenig Mar 13, 2023 •

edited

Loading

shoenig Mar 14, 2023 •

edited

Loading

shoenig Mar 13, 2023 •

edited

Loading