-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scheduler: config option to reject job registration #11610
Conversation
ed803ee
to
f956f33
Compare
f956f33
to
c1e9b5d
Compare
c1e9b5d
to
34d37e4
Compare
34d37e4
to
574eaff
Compare
574eaff
to
1b6fae2
Compare
@@ -13,6 +13,7 @@ const ( | |||
errNoRegionPath = "No path to region" | |||
errTokenNotFound = "ACL token not found" | |||
errPermissionDenied = "Permission denied" | |||
errJobRegistrationDisabled = "Job registration, dispatch, and scale are disabled by the scheduler configuration" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mikenomitch @davemay99 I updated the error message we emit as discussed, and it'll look like so:
$ nomad job run ./example.nomad
Error submitting job: Unexpected response code: 403 (Job registration, dispatch, and scale are disabled by the scheduler configuration)
I'd be happy to take any suggestions on improving this message!
1b6fae2
to
c363e22
Compare
c363e22
to
bce7d6c
Compare
bce7d6c
to
44c4ae7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made some suggestions. Non blocking, but interested in your thoughts.
@@ -451,6 +451,9 @@ func errCodeFromHandler(err error) (int, string) { | |||
} else if strings.HasSuffix(errMsg, structs.ErrTokenNotFound.Error()) { | |||
errMsg = structs.ErrTokenNotFound.Error() | |||
code = 403 | |||
} else if strings.HasSuffix(errMsg, structs.ErrJobRegistrationDisabled.Error()) { | |||
errMsg = structs.ErrJobRegistrationDisabled.Error() | |||
code = 403 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you feel about 409
here instead of 403
? We're trying to get more intentional and consistent with status codes and error messages when the opportunity arrises. 409
is documented as
409 Conflict
The request could not be completed due to a conflict with the current state of the resource
This seems more in line, and the error message should let clients know what the conflict is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent of 409 is to avoid write skew and the direction in the standard is to adjust the request and submit it again. Which is exactly what we don't want here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to the idea of finding a more specific error code but I feel like it needs to be one that doesn't tell any automated system "keep retrying!" (ex. 420 or 429). The 403 tells them "you can't submit because the permissions to submit have been changed"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was afraid this might be the answer. I'll keep looking for a possible alternative code.
@@ -487,6 +490,9 @@ func (s *HTTPServer) wrap(handler func(resp http.ResponseWriter, req *http.Reque | |||
} else if strings.HasSuffix(errMsg, structs.ErrTokenNotFound.Error()) { | |||
errMsg = structs.ErrTokenNotFound.Error() | |||
code = 403 | |||
} else if strings.HasSuffix(errMsg, structs.ErrJobRegistrationDisabled.Error()) { | |||
errMsg = structs.ErrJobRegistrationDisabled.Error() | |||
code = 403 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here.
@@ -128,6 +130,8 @@ server state is authoritative. | |||
|
|||
- `MemoryOversubscriptionEnabled` `(bool: false)` <sup>1.1 Beta</sup> - When `true`, tasks may exceed their reserved memory limit, if the client has excess memory capacity. Tasks must specify [`memory_max`](/docs/job-specification/resources#memory_max) to take advantage of memory oversubscription. | |||
|
|||
- `RejectJobRegistration` `(bool: false)` - When `true`, the server will return permission denied errors for job registration, job dispatch, and job scale APIs, unless the ACL token for the request is a management token. If ACLs are disabled, no user will be able to register jobs. This allows operators to shed load from automated proceses during incident response. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you switch to 409
this text will need to be updated to say conflict errors
rather than permission denied errors
. Thank you for explaining the load shedding usage clearly ❤️
During incident response, operators may find that automated processes elsewhere in the organization can be generating new workloads on Nomad clusters that are unable to handle the workload. This changeset adds a field to the `SchedulerConfiguration` API that causes all job registration calls to be rejected unless the request has a management ACL token.
44c4ae7
to
9aa5472
Compare
@DerekStrickland I've addressed most of your comments. But for the error code for the time being I'm going to merge as-is. We can always go back and trivially tweak the error code if we want any time up until the day we ship. 😀 |
For sure! Sorry. Wasn't trying to block. Also, I can't find anything better that isn't repurposing something else inappropriately. |
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Fixes #11450
During incident response, operators may find that automated processes
elsewhere in the organization can be generating new workloads on Nomad
clusters that are unable to handle the workload. This changeset adds a
field to the
SchedulerConfiguration
API that causes all jobregistration calls to be rejected unless the request has a management
ACL token.
Note this doesn't include any new CLI commands (see #11450 (comment) for why)