-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfixes in RFS work coordination update queries to claim leases #769
Bugfixes in RFS work coordination update queries to claim leases #769
Conversation
The easy one which probably wasn't ever causing issues is to convert 409 conflict responses immediately to a VERSION_CONFLICT result rather than waiting to parse the response body (& possibly be surprised). The other was to set the operation to noop for the update_by_query result when the conditional checks (like whether the new expiration is > the old expiration). That removes a big (& only?) pressure where the subsequent call to get the assigned document would return 0 results - which made sense since no document was actually updated, even though the update_by_query reported that one was. Signed-off-by: Greg Schohn <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #769 +/- ##
============================================
+ Coverage 68.51% 68.53% +0.01%
+ Complexity 1578 1577 -1
============================================
Files 275 275
Lines 11596 11660 +64
Branches 735 737 +2
============================================
+ Hits 7945 7991 +46
- Misses 3248 3263 +15
- Partials 403 406 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@@ -400,6 +402,9 @@ UpdateResult assignOneWorkItem(long expirationWindowSeconds) throws IOException | |||
|
|||
var response = httpClient.makeJsonRequest(POST_METHOD, INDEX_NAME + "/_update_by_query?refresh=true&max_docs=1", | |||
null, body); | |||
if (response.getStatusCode() == 409) { | |||
return UpdateResult.VERSION_CONFLICT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to check for resultTree['noop'] > 0 on the below elseIf [line 413]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - addressed in the latest commit (if I understood the concern)
… the server-side/peer clocks don't agree that an item has been expired. That results in no update happening and used to just throw. Now it throws and the client backs off a bit before retrying again. The FullTest now puts some clock jitter into each of its clients and it's no longer seeing an IllegalStateException (or the newly subclassed PotentialClockDriftDetectedException) with the retries in place. Signed-off-by: Greg Schohn <[email protected]>
…hohn/opensearch-migrations into FixFalseUpdatesForRFSCoordination
…ething else altogether. Drift is retryable, other issues aren't. We should probably get more careful about other response codes (or timeouts), but that's a bigger and more sweeping change. Signed-off-by: Greg Schohn <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! I deployed this within my AWS Environment and seeing it resolved my issues with concurrency
assert numUpdated <= 1; | ||
if (numUpdated > 0) { | ||
return UpdateResult.SUCCESSFUL_ACQUISITION; | ||
} else if (resultTree.path(VERSION_CONFLICTS_FIELD_NAME).longValue() > 0) { | ||
return UpdateResult.VERSION_CONFLICT; | ||
} else if (resultTree.path("total").longValue() == 0) { | ||
return UpdateResult.NOTHING_TO_ACQUIRE; | ||
} else if (noops > 0) { | ||
throw new PotentialClockDriftDetectedException("Found " + noops + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VERSION_CONFLICT isn't being handled with an exception, what do you think about adding UpdateResult.CLOCK_DRIFT
and log the number of noops found here?
This cleans up the logic around acquireNextWorkItem(...)
to be more consistent.
Nothing is calling that code that is changed, can we add some validation since there are some clear off-by-one errors possible? |
The easy one which probably wasn't ever causing issues is to convert 409 conflict responses immediately to a VERSION_CONFLICT result rather than waiting to parse the response body (& possibly be surprised). The other was to set the operation to noop for the update_by_query result when the conditional checks (like whether the new expiration is > the old expiration). That removes a big (& only?) pressure where the subsequent call to get the assigned document would return 0 results - which made sense since no document was actually updated, even though the update_by_query reported that one was.
Description
Issues Resolved
https://opensearch.atlassian.net/jira/software/c/projects/MIGRATIONS/boards/1?assignee=63d1918f69c7ae3958d1f57f&selectedIssue=MIGRATIONS-1810
Testing
Ran through the FullTest, observing none of the "The query for the assigned work document returned 0" errors. The test still doesn't assert that that shouldn't happen though.
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.