-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
finalizing scaling API work #7572
Conversation
447b17f
to
95d65f0
Compare
updated ScalingEvent API to record "message string,error bool" instead of confusing "reason,error *string"
ab16744
to
732eda8
Compare
732eda8
to
10ffa7e
Compare
Count *int64 | ||
Target map[string]string | ||
Message string | ||
Error bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stylistic nit, Error being a bool seems unfamiliar, HasError might be a bit verbose, Successful, Failed maybe? 🤷♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, we played around with a few names there... it indicates that the autoscaler wants the Nomad operator to know that something is wrong. HasError
might be better. but Success/Failed
doesn't feel quite right.
i wanted to use ThereIsAProblemInTheAutoscalerOrTheScalingPolicy
but it seemed... not good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ScalingRequest.HasError
and ScalingRequest.Failed
both read well, and Failed has a precedent in other events structs
Co-Authored-By: Drew Bailey <[email protected]>
Co-Authored-By: Drew Bailey <[email protected]>
Co-Authored-By: Drew Bailey <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@@ -51,6 +51,7 @@ const ( | |||
ScalingPolicySnapshot | |||
CSIPluginSnapshot | |||
CSIVolumeSnapshot | |||
ScalingEventsSnapshot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
smol smol ting, just double checking this snapshot case denotes multiple scaling events?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these snapshots are a single entry in the table. in this case, all of the scaling events for a single job.
defer metrics.MeasureSince([]string{"nomad", "fsm", "upsert_scaling_event"}, time.Now()) | ||
var req structs.ScalingEventRequest | ||
if err := structs.Decode(buf, &req); err != nil { | ||
panic(fmt.Errorf("failed to decode request: %v", err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to panic here? could we just fail the scaling event request and log the error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the standard behavior for applying raft log entries to the state store. see, e.g.,:
Lines 306 to 311 in 88ff339
func (n *nomadFSM) applyUpsertNode(buf []byte, index uint64) interface{} { | |
defer metrics.MeasureSince([]string{"nomad", "fsm", "register_node"}, time.Now()) | |
var req structs.NodeRegisterRequest | |
if err := structs.Decode(buf, &req); err != nil { | |
panic(fmt.Errorf("failed to decode request: %v", err)) | |
} |
it likely means that the raft log entry doesn't contain what the leading byte says it should contain, which likely means a version mismatch or corruption. the reason for the panic, i assume, is that state is undefined if we don't know how to apply a raft entry.
blocking queries on the job scaling status
…to f-7422-scaling-events
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
closed #7422