Bugfixes in RFS work coordination update queries to claim leases #769

gregschohn · 2024-06-26T03:36:13Z

The easy one which probably wasn't ever causing issues is to convert 409 conflict responses immediately to a VERSION_CONFLICT result rather than waiting to parse the response body (& possibly be surprised). The other was to set the operation to noop for the update_by_query result when the conditional checks (like whether the new expiration is > the old expiration). That removes a big (& only?) pressure where the subsequent call to get the assigned document would return 0 results - which made sense since no document was actually updated, even though the update_by_query reported that one was.

Description

Category Bug fix
Why these changes are required? To make sure that we're making forward progress on work assignments and not creating unnecessary noise.
What is the old behavior before changes and new behavior after changes? We'd get lots of assertions that allegedly updated documents couldn't be found. Now we don't check because we won't falsely mark them as being updated.

Issues Resolved

https://opensearch.atlassian.net/jira/software/c/projects/MIGRATIONS/boards/1?assignee=63d1918f69c7ae3958d1f57f&selectedIssue=MIGRATIONS-1810

Testing

Ran through the FullTest, observing none of the "The query for the assigned work document returned 0" errors. The test still doesn't assert that that shouldn't happen though.

Check List

New functionality includes testing
- All tests pass, including unit test, integration test and doctest
New functionality has been documented
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

The easy one which probably wasn't ever causing issues is to convert 409 conflict responses immediately to a VERSION_CONFLICT result rather than waiting to parse the response body (& possibly be surprised). The other was to set the operation to noop for the update_by_query result when the conditional checks (like whether the new expiration is > the old expiration). That removes a big (& only?) pressure where the subsequent call to get the assigned document would return 0 results - which made sense since no document was actually updated, even though the update_by_query reported that one was. Signed-off-by: Greg Schohn <[email protected]>

codecov · 2024-06-26T03:37:08Z

Codecov Report

Attention: Patch coverage is 4.00000% with 24 lines in your changes missing coverage. Please review.

Project coverage is 68.53%. Comparing base (a978c0a) to head (4bd3273).
Report is 2 commits behind head on main.

Files	Patch %	Lines
...in/java/com/rfs/cms/OpenSearchWorkCoordinator.java	4.00%	24 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main     #769      +/-   ##
============================================
+ Coverage     68.51%   68.53%   +0.01%     
+ Complexity     1578     1577       -1     
============================================
  Files           275      275              
  Lines         11596    11660      +64     
  Branches        735      737       +2     
============================================
+ Hits           7945     7991      +46     
- Misses         3248     3263      +15     
- Partials        403      406       +3

Flag	Coverage Δ
gradle-test	`61.20% <4.00%> (-0.15%)`	⬇️
python-test	`87.56% <ø> (+0.19%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

AndreKurait · 2024-06-26T04:31:51Z

RFS/src/main/java/com/rfs/cms/OpenSearchWorkCoordinator.java

@@ -400,6 +402,9 @@ UpdateResult assignOneWorkItem(long expirationWindowSeconds) throws IOException

        var response = httpClient.makeJsonRequest(POST_METHOD,  INDEX_NAME + "/_update_by_query?refresh=true&max_docs=1",
                null, body);
+        if (response.getStatusCode() == 409) {
+            return UpdateResult.VERSION_CONFLICT;


I think we need to check for resultTree['noop'] > 0 on the below elseIf [line 413]

Thanks - addressed in the latest commit (if I understood the concern)

… the server-side/peer clocks don't agree that an item has been expired. That results in no update happening and used to just throw. Now it throws and the client backs off a bit before retrying again. The FullTest now puts some clock jitter into each of its clients and it's no longer seeing an IllegalStateException (or the newly subclassed PotentialClockDriftDetectedException) with the retries in place. Signed-off-by: Greg Schohn <[email protected]>

…hohn/opensearch-migrations into FixFalseUpdatesForRFSCoordination

…ething else altogether. Drift is retryable, other issues aren't. We should probably get more careful about other response codes (or timeouts), but that's a bigger and more sweeping change. Signed-off-by: Greg Schohn <[email protected]>

AndreKurait

Thanks for this! I deployed this within my AWS Environment and seeing it resolved my issues with concurrency

peternied · 2024-06-26T14:14:10Z

RFS/src/main/java/com/rfs/cms/OpenSearchWorkCoordinator.java

        assert numUpdated <= 1;
        if (numUpdated > 0) {
            return UpdateResult.SUCCESSFUL_ACQUISITION;
        } else if (resultTree.path(VERSION_CONFLICTS_FIELD_NAME).longValue() > 0) {
            return UpdateResult.VERSION_CONFLICT;
        } else if (resultTree.path("total").longValue() == 0) {
            return UpdateResult.NOTHING_TO_ACQUIRE;
+        } else if (noops > 0) {
+            throw new PotentialClockDriftDetectedException("Found " + noops +


VERSION_CONFLICT isn't being handled with an exception, what do you think about adding UpdateResult.CLOCK_DRIFT and log the number of noops found here?

This cleans up the logic around acquireNextWorkItem(...) to be more consistent.

peternied · 2024-06-26T16:05:30Z

Failing after 1s — 4.00% of diff hit (target 68.51%)

Nothing is calling that code that is changed, can we add some validation since there are some clear off-by-one errors possible?

gregschohn marked this pull request as ready for review June 26, 2024 04:07

gregschohn requested review from AndreKurait, chelma, lewijacn, mikaylathompson, okhasawn, peternied and sumobrian as code owners June 26, 2024 04:07

Merge branch 'main' into FixFalseUpdatesForRFSCoordination

a7ba17f

AndreKurait reviewed Jun 26, 2024

View reviewed changes

gregschohn added 2 commits June 26, 2024 07:01

Merge branch 'FixFalseUpdatesForRFSCoordination' of github.com:gregsc…

612d6e7

…hohn/opensearch-migrations into FixFalseUpdatesForRFSCoordination

gregschohn changed the title ~~Details oriented bugfixes in RFS work coordination~~ Bugfixes in RFS work coordination update queries to claim leases Jun 26, 2024

gregschohn added 2 commits June 26, 2024 07:16

Merge branch 'main' into FixFalseUpdatesForRFSCoordination

6c23514

AndreKurait approved these changes Jun 26, 2024

View reviewed changes

peternied reviewed Jun 26, 2024

View reviewed changes

Merge branch 'main' into FixFalseUpdatesForRFSCoordination

4bd3273

gregschohn merged commit f8b22b1 into opensearch-project:main Jun 26, 2024
12 of 13 checks passed

gregschohn deleted the FixFalseUpdatesForRFSCoordination branch June 26, 2024 18:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfixes in RFS work coordination update queries to claim leases #769

Bugfixes in RFS work coordination update queries to claim leases #769

gregschohn commented Jun 26, 2024

codecov bot commented Jun 26, 2024 •

edited

Loading

AndreKurait Jun 26, 2024 •

edited

Loading

gregschohn Jun 26, 2024

AndreKurait left a comment

peternied Jun 26, 2024

peternied commented Jun 26, 2024

Bugfixes in RFS work coordination update queries to claim leases #769

Bugfixes in RFS work coordination update queries to claim leases #769

Conversation

gregschohn commented Jun 26, 2024

Description

Issues Resolved

Testing

Check List

codecov bot commented Jun 26, 2024 • edited Loading

Codecov Report

AndreKurait Jun 26, 2024 • edited Loading

Choose a reason for hiding this comment

gregschohn Jun 26, 2024

Choose a reason for hiding this comment

AndreKurait left a comment

Choose a reason for hiding this comment

peternied Jun 26, 2024

Choose a reason for hiding this comment

peternied commented Jun 26, 2024

codecov bot commented Jun 26, 2024 •

edited

Loading

AndreKurait Jun 26, 2024 •

edited

Loading