Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfixes in RFS work coordination update queries to claim leases #769

Conversation

gregschohn
Copy link
Collaborator

The easy one which probably wasn't ever causing issues is to convert 409 conflict responses immediately to a VERSION_CONFLICT result rather than waiting to parse the response body (& possibly be surprised). The other was to set the operation to noop for the update_by_query result when the conditional checks (like whether the new expiration is > the old expiration). That removes a big (& only?) pressure where the subsequent call to get the assigned document would return 0 results - which made sense since no document was actually updated, even though the update_by_query reported that one was.

Description

  • Category Bug fix
  • Why these changes are required? To make sure that we're making forward progress on work assignments and not creating unnecessary noise.
  • What is the old behavior before changes and new behavior after changes? We'd get lots of assertions that allegedly updated documents couldn't be found. Now we don't check because we won't falsely mark them as being updated.

Issues Resolved

https://opensearch.atlassian.net/jira/software/c/projects/MIGRATIONS/boards/1?assignee=63d1918f69c7ae3958d1f57f&selectedIssue=MIGRATIONS-1810

Testing

Ran through the FullTest, observing none of the "The query for the assigned work document returned 0" errors. The test still doesn't assert that that shouldn't happen though.

Check List

  • New functionality includes testing
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

The easy one which probably wasn't ever causing issues is to convert 409 conflict responses immediately to a VERSION_CONFLICT result rather than waiting to parse the response body (& possibly be surprised).
The other was to set the operation to noop for the update_by_query result when the conditional checks (like whether the new expiration is > the old expiration).  That removes a big (& only?) pressure where the subsequent call to get the assigned document would return 0 results - which made sense since no document was actually updated, even though the update_by_query reported that one was.

Signed-off-by: Greg Schohn <[email protected]>
Copy link

codecov bot commented Jun 26, 2024

Codecov Report

Attention: Patch coverage is 4.00000% with 24 lines in your changes missing coverage. Please review.

Project coverage is 68.53%. Comparing base (a978c0a) to head (4bd3273).
Report is 2 commits behind head on main.

Files Patch % Lines
...in/java/com/rfs/cms/OpenSearchWorkCoordinator.java 4.00% 24 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #769      +/-   ##
============================================
+ Coverage     68.51%   68.53%   +0.01%     
+ Complexity     1578     1577       -1     
============================================
  Files           275      275              
  Lines         11596    11660      +64     
  Branches        735      737       +2     
============================================
+ Hits           7945     7991      +46     
- Misses         3248     3263      +15     
- Partials        403      406       +3     
Flag Coverage Δ
gradle-test 61.20% <4.00%> (-0.15%) ⬇️
python-test 87.56% <ø> (+0.19%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@gregschohn gregschohn marked this pull request as ready for review June 26, 2024 04:07
@@ -400,6 +402,9 @@ UpdateResult assignOneWorkItem(long expirationWindowSeconds) throws IOException

var response = httpClient.makeJsonRequest(POST_METHOD, INDEX_NAME + "/_update_by_query?refresh=true&max_docs=1",
null, body);
if (response.getStatusCode() == 409) {
return UpdateResult.VERSION_CONFLICT;
Copy link
Member

@AndreKurait AndreKurait Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to check for resultTree['noop'] > 0 on the below elseIf [line 413]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - addressed in the latest commit (if I understood the concern)

… the server-side/peer clocks don't agree that an item has been expired.

That results in no update happening and used to just throw.  Now it throws and the client backs off a bit before retrying again.
The FullTest now puts some clock jitter into each of its clients and it's no longer seeing an IllegalStateException (or the newly subclassed PotentialClockDriftDetectedException) with the retries in place.

Signed-off-by: Greg Schohn <[email protected]>
…hohn/opensearch-migrations into FixFalseUpdatesForRFSCoordination
@gregschohn gregschohn changed the title Details oriented bugfixes in RFS work coordination Bugfixes in RFS work coordination update queries to claim leases Jun 26, 2024
…ething else altogether.

Drift is retryable, other issues aren't.  We should probably get more careful about other response codes (or timeouts), but that's a bigger and more sweeping change.

Signed-off-by: Greg Schohn <[email protected]>
Copy link
Member

@AndreKurait AndreKurait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! I deployed this within my AWS Environment and seeing it resolved my issues with concurrency

assert numUpdated <= 1;
if (numUpdated > 0) {
return UpdateResult.SUCCESSFUL_ACQUISITION;
} else if (resultTree.path(VERSION_CONFLICTS_FIELD_NAME).longValue() > 0) {
return UpdateResult.VERSION_CONFLICT;
} else if (resultTree.path("total").longValue() == 0) {
return UpdateResult.NOTHING_TO_ACQUIRE;
} else if (noops > 0) {
throw new PotentialClockDriftDetectedException("Found " + noops +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VERSION_CONFLICT isn't being handled with an exception, what do you think about adding UpdateResult.CLOCK_DRIFT and log the number of noops found here?

This cleans up the logic around acquireNextWorkItem(...) to be more consistent.

@peternied
Copy link
Member

Failing after 1s — 4.00% of diff hit (target 68.51%)

Nothing is calling that code that is changed, can we add some validation since there are some clear off-by-one errors possible?

@gregschohn gregschohn merged commit f8b22b1 into opensearch-project:main Jun 26, 2024
12 of 13 checks passed
@gregschohn gregschohn deleted the FixFalseUpdatesForRFSCoordination branch June 26, 2024 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants