-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scheduler: annotate tasksUpdated with reason and purge DeepEquals #16421
Conversation
f98137d
to
a0b4222
Compare
a0b4222
to
b8de29a
Compare
b8de29a
to
2d155ac
Compare
2d155ac
to
2fca70f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not going to count my Consul Config comparison concerns as blockers. I bet it's ok to make these changes. The breakages we've had with Consul's API in the past have always been a lot more obvious than nil vs empty
.
I think any time any of us writes an Equal method we should remind one another to someday use a code generator. 😅
lib/lang/opaque.go
Outdated
@@ -0,0 +1,48 @@ | |||
package lang |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally forgot this package existed. The only other file in here defines Pair
which is never referenced directly but only in other types. I think helper
is the package most of us look in for common comparison funcs, so my preference would be to use that and kill this, but it really doesn't matter much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, moved this into helper
. At some point it might be nice to split up helper
into Nomad-specific helpers (like CheckHCLKeys
) vs. language constructs (like this), but we can bikeshed that on a rainy day 🙂
nomad/structs/structs.go
Outdated
func (a *Affinity) Hash() string { | ||
if a == nil { | ||
return "(nil)" | ||
} | ||
return a.String() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this and Spread.Hash for? Having a Hash method return non-hash-derived values seems strange as String()
formatting changes for testing/debugging will break Hash compatibility even when the actual struct's contents haven't changed. However these methods don't appear to be called directly anywhere, so I'm not even sure what they're for.
Our use of Hash()
methods is already really inconsistent already both in usage pattern (does it generate a hash on every call or does it require calling SetHash first?!) and return type (ints, strings, bytes 🤷). I don't want to make that situation worse without good reason. We've kind of made a mess of any common meaning for Hash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Affinity.Hash
and Spread.Hash
are being used
Previously we created a slice of all the Affinity/Spread on the job from each level and used reflect.DeepEqual
on the slice. Implementing Hash()
and deferring to HashSet
was just a lazy way doing something like creating the slice and using slices.EqualsFunc
- but I can make that change.
@@ -1409,7 +1401,7 @@ func (p *ConsulProxy) Equal(o *ConsulProxy) bool { | |||
return false | |||
} | |||
|
|||
if !opaqueMapsEqual(p.Config, o.Config) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this safe? Can we be sure that Consul doesn't treat a nested empty map differently from a nil map? If the map is a nested value of Config (eg Config["foo"] = map[string]string{}
), I could see nil meaning unset
and empty meaning set but use defaults
. Even it it's safe today can we be sure it will be safe for all future versions of Consul?
Maybe to put it another way: do we gain anything by not using DeepEqual in these cases? It seems like making any assumptions about what's meaningful in the keys and values of opaque maps is ... well: not opaque!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the flip side - what if Consul (or Envoy!) normalizes these fields one way or the other, and the original job uses the other form?
These opaque fields are a landmine no matter what, and I think it's more likely we run into the normalization problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a nice nap I realized this is nonsense - when Consul/Envoy normalize these fields they come back via our requesting of the Consul API, and that comparison is done in consul/service_client, which itself uses DeepEqual for the opaque maps.
https://github.com/hashicorp/nomad/blob/main/command/agent/consul/service_client.go#L343
I'll switch the comparison for envoy fields here back to DeepEqual to match
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Can we add a note to the jobspec checklist about when to implement an Equal
method? Do we want to lean on this method for nomad/structs/diff.go
as well? (Probably in a separate PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we talked about it in the past, but it would nice to have some automated, randomized, fuzzed tests for this kind of thing.
|
||
// tasksUpdated creates a comparison between task groups to see if the tasks, their | ||
// drivers, environment variables or config have been modified. | ||
func tasksUpdated(jobA, jobB *structs.Job, taskGroup string) comparison { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a very loaded question, and it may be a premature micro-optimization, but do we have a sense of which fields are most likely to be updated?
Since we return early on first change detected I think it would make sense to defer comparison for less used fields (like ephemeral disks) to the bottom and bring fields that more likely to change (like task config
or env
) closer to the top.
Would it be helpful to also quickly hash the structs with something like https://github.com/mitchellh/hashstructure to exit early if the hashes match, meaning that no update happened?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we return early on first change detected I think it would make sense to defer comparison for less used fields
I thought about this, and introduced this Benchmark in case someone wants to improve things. For this PR I really only care about correctness.
Would it be helpful to also quickly hash the structs
Unfortunately no generic hashing or equals methodology will work here - there's a handful of custom comparisons we need to do (i.e. ignoring or normalizing fields) even outside of Canonicalize
, which you can see in the helper methods called by tasksUpdated
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if we could generate a hash, we'd probably want to cache that hash value as part of the struct for it to benefit performance. Otherwise we'd have to rehash on each pass through the scheduler, which would mean chasing all the same set of pointers we need to do to call the helper methods anyways. That's why ex. the node class hash is precalculated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this, and introduced this Benchmark in case someone wants to improve things. For this PR I really only care about correctness.
Makes sense to focus on correctness. I think the problem with benchmarking something like this is that it's not an empirical performance change, it's more like a heuristic based on what we think is most often modified.
Just for a quick check I modified the benchmark so it updates a task env
and moved the task loop to the top and got some improvement:
goos: darwin
goarch: arm64
pkg: github.com/hashicorp/nomad/scheduler
BenchmarkTasksUpdated-10 125265 9232 ns/op
BenchmarkTasksUpdated-10 138152 8697 ns/op
BenchmarkTasksUpdated-10 128814 8714 ns/op
BenchmarkTasksUpdated-10 136887 8711 ns/op
BenchmarkTasksUpdated-10 138432 8690 ns/op
BenchmarkTasksUpdated-10 137956 8736 ns/op
BenchmarkTasksUpdated-10 136660 8736 ns/op
BenchmarkTasksUpdated-10 138224 8728 ns/op
BenchmarkTasksUpdated-10 137574 8729 ns/op
BenchmarkTasksUpdated-10 133634 8733 ns/op
PASS
ok github.com/hashicorp/nomad/scheduler 13.029s
goos: darwin
goarch: arm64
pkg: github.com/hashicorp/nomad/scheduler
BenchmarkTasksUpdated-10 238004 4909 ns/op
BenchmarkTasksUpdated-10 237406 4527 ns/op
BenchmarkTasksUpdated-10 327616 3654 ns/op
BenchmarkTasksUpdated-10 324906 3646 ns/op
BenchmarkTasksUpdated-10 325503 3808 ns/op
BenchmarkTasksUpdated-10 325220 3822 ns/op
BenchmarkTasksUpdated-10 320337 3730 ns/op
BenchmarkTasksUpdated-10 279225 3694 ns/op
BenchmarkTasksUpdated-10 321788 3704 ns/op
BenchmarkTasksUpdated-10 317722 3680 ns/op
PASS
ok github.com/hashicorp/nomad/scheduler 12.379s
I added a bit of clarification to the jopspec guide. As for diffing, I think that mostly boils down to the |
With the swap back to DE for Envoy configs,
... |
reflect.DeepEqual is not suitable for comparing
Job
structs.By design, it
The use of
reflect.DeepEqual
has been the root of bugs a few times in the past.#14917
#10896
#9330
This PR introduces the
OpaqueMapEquals
helper function as a replacement forDeepEqual
for the specific cases of comparingmap[comparable]interface{}
, which is necessary in cases like opaque plugin config, and envoy configs. Unlike DE, OME exhibits the opposite behavior in the two scenarios above. For other (known) types, implement.Equal
method on structs and compare values directly, or defer to other sensible helpers.The
tasksUpdated
helper has also been modified to return a struct containing the first mis-match of fields triggering a destructive update (as opposed to a simple boolean). Folks tracking down misbehavior now need only to add a log line to understand whytasksUpdated
is (or isn't) triggering updates.Benchmark
before
after
The slowdown is primarily caused by
OpaqueMapEquals
- replacing its implementation withDeepEqual
gets the numbers back down. But these numbers are still negligible and the tradeoff for correctness seems worthwhile.Closes #16290