-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enforce idempotency of dispatched jobs using token on dispatch request #10806
Changes from 3 commits
c4be87a
b5d21a9
f607c35
8502048
d83ff8b
63d8e92
aa3512d
4448cff
6a1a200
3a8febe
2bd2f58
16d43ae
d77329a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1899,6 +1899,7 @@ func (j *Job) Dispatch(args *structs.JobDispatchRequest, reply *structs.JobDispa | |
dispatchJob.Dispatched = true | ||
dispatchJob.Status = "" | ||
dispatchJob.StatusDescription = "" | ||
dispatchJob.DispatchIdempotencyToken = args.IdempotencyToken | ||
|
||
// Merge in the meta data | ||
for k, v := range args.Meta { | ||
|
@@ -1908,6 +1909,39 @@ func (j *Job) Dispatch(args *structs.JobDispatchRequest, reply *structs.JobDispa | |
dispatchJob.Meta[k] = v | ||
} | ||
|
||
// Check to see if an idempotency token was specified on the request | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The comments on this section are kind of the // Ensure that we have only one dispatched version of this job running concurrently
// by comparing the idempotency token against any non-terminal versions. |
||
if args.IdempotencyToken != "" { | ||
// Fetch all jobs that match the parameterized job ID prefix | ||
iter, err := snap.JobsByIDPrefix(ws, parameterizedJob.Namespace, parameterizedJob.ID) | ||
if err != nil { | ||
return fmt.Errorf("failed to retrieve jobs for idempotency check") | ||
alexmunda marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} | ||
|
||
// Iterate | ||
for { | ||
raw := iter.Next() | ||
if raw == nil { | ||
break | ||
} | ||
|
||
// Ensure the parent ID is an exact match | ||
existingJob := raw.(*structs.Job) | ||
if existingJob.ParentID != dispatchJob.ParentID { | ||
continue | ||
} | ||
|
||
// Idempotency tokens match. Ensure existing job is terminal. | ||
if existingJob.DispatchIdempotencyToken == args.IdempotencyToken { | ||
// The existing job is either pending or running. | ||
// Registering a new job would violate the idempotency token. | ||
if existingJob.Status != structs.JobStatusDead { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's the benefit of special casing dead jobs? I fear that this becomes an edge case for super short dispatch jobs - where if the job finishes before the caller retries request (due to long exponential backoff sleep), the job gets re-run. Naively, I'd think we should enforce idempotency as long as possible, basically until the child job is garbage collected. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great point, I hadn't considered the short dispatch case. For our use case on HCP, the idempotency token will map to an action for a specific resource (ie There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh I see. So there are two types of transient failures: First, the dispatch request fails with network transient error (e.g. network or timeout) where the dispatch request may have failed but may also have succeed but the caller wasn't informed of the success. The API consumer will want to retry with idempotent token to avoid repeating the side-effects. I assume here we want to enforce the token for the longest reasonable time, as it never hurts. The second failure is if the dispatch request succeeded but the dispatched job failed with a transient failure (e.g. instance creation failed for transient error and should be retried). Here, api consumer want to retry as soon as the job failed. Is the second failure case that's driving this design? I would have hoped that the Also, you can achieve the second failure handling semantics even if idempotent token is enforced for dead jobs. e.g. you can have There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah yes, totally forgot about the |
||
return fmt.Errorf("dispatch violates idempotency token of non-terminal child job: %s", existingJob.ID) | ||
alexmunda marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} | ||
} | ||
} | ||
|
||
} | ||
|
||
// Compress the payload | ||
dispatchJob.Payload = snappy.Encode(nil, args.Payload) | ||
|
||
|
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a new method here is good for backwards compatibility, but I'm wondering if in the long term it would be better to add a token to the
WriteOptions
struct instead. That would make it possible for us to add idempotency to any kind of write API in the future without adding a bunch of new methods.Would love to hear thoughts from @DerekStrickland @schmichael @notnoop on that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of putting the token on the WriteOptions. If feels like it enables an almost aspect oriented approach for handling this generically at more of a framework level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to
WriteOptions
!There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alexmunda !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm conflicted about having the field in WriteOptions and slightly lean towards how the PR had it before under JobDispatchRequest. Idempotency is such a common concern, so belonging to WriteOptions make sense. Though, it currently only applies to dispatch jobs; other APIs address a similar but not an exact problem using CAS and ModifyIndex (e.g. Update Scheduler Configuration endpoint).
I'm fine with having it in
WriteOptions
, as it's probably the pattern we want to use moving forward - but would recommend calling out that only the dispatch endpoint supports it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@notnoop I am open to either approach 😄. With the current
WriteOptions
approach, where would the best place be to call that out? Comment on the struct field?