Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

finalizing scaling API work #7572

Merged
merged 8 commits into from
Apr 1, 2020
Merged

finalizing scaling API work #7572

merged 8 commits into from
Apr 1, 2020

Conversation

cgbaker
Copy link
Contributor

@cgbaker cgbaker commented Mar 31, 2020

  • adding raft and state_store support to track job scaling events
  • updated the ScalingEvent API to record "message string,error bool" instead of cumbersome "reason,error *string"

closed #7422

@cgbaker cgbaker force-pushed the f-7422-scaling-events branch from 447b17f to 95d65f0 Compare April 1, 2020 00:03
updated ScalingEvent API to record "message string,error bool" instead
of confusing "reason,error *string"
@cgbaker cgbaker force-pushed the f-7422-scaling-events branch 2 times, most recently from ab16744 to 732eda8 Compare April 1, 2020 16:18
@cgbaker cgbaker force-pushed the f-7422-scaling-events branch from 732eda8 to 10ffa7e Compare April 1, 2020 16:39
@cgbaker cgbaker marked this pull request as ready for review April 1, 2020 16:39
@cgbaker cgbaker requested a review from jrasell April 1, 2020 16:42
Count *int64
Target map[string]string
Message string
Error bool
Copy link
Contributor

@drewbailey drewbailey Apr 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stylistic nit, Error being a bool seems unfamiliar, HasError might be a bit verbose, Successful, Failed maybe? 🤷‍♂️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, we played around with a few names there... it indicates that the autoscaler wants the Nomad operator to know that something is wrong. HasError might be better. but Success/Failed doesn't feel quite right.

i wanted to use ThereIsAProblemInTheAutoscalerOrTheScalingPolicy but it seemed... not good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ScalingRequest.HasError and ScalingRequest.Failed both read well, and Failed has a precedent in other events structs

nomad/job_endpoint.go Outdated Show resolved Hide resolved
cgbaker and others added 2 commits April 1, 2020 11:56
Copy link
Contributor

@drewbailey drewbailey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@@ -51,6 +51,7 @@ const (
ScalingPolicySnapshot
CSIPluginSnapshot
CSIVolumeSnapshot
ScalingEventsSnapshot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

smol smol ting, just double checking this snapshot case denotes multiple scaling events?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these snapshots are a single entry in the table. in this case, all of the scaling events for a single job.

defer metrics.MeasureSince([]string{"nomad", "fsm", "upsert_scaling_event"}, time.Now())
var req structs.ScalingEventRequest
if err := structs.Decode(buf, &req); err != nil {
panic(fmt.Errorf("failed to decode request: %v", err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to panic here? could we just fail the scaling event request and log the error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the standard behavior for applying raft log entries to the state store. see, e.g.,:

nomad/nomad/fsm.go

Lines 306 to 311 in 88ff339

func (n *nomadFSM) applyUpsertNode(buf []byte, index uint64) interface{} {
defer metrics.MeasureSince([]string{"nomad", "fsm", "register_node"}, time.Now())
var req structs.NodeRegisterRequest
if err := structs.Decode(buf, &req); err != nil {
panic(fmt.Errorf("failed to decode request: %v", err))
}

it likely means that the raft log entry doesn't contain what the leading byte says it should contain, which likely means a version mismatch or corruption. the reason for the panic, i assume, is that state is undefined if we don't know how to apply a raft entry.

@cgbaker cgbaker merged commit d3e7288 into master Apr 1, 2020
@cgbaker cgbaker deleted the f-7422-scaling-events branch April 1, 2020 18:49
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 10, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement scaling events API
3 participants