Two fixes to work coordination #777

gregschohn · 2024-06-26T18:44:23Z

Description

Make the check for the setup index more reliable and only scan for work that is currently expired.

The HEAD check has now been rolled into the doUntil block for running setup() so that even 400s from the PUT command can be ignored and retried again in a sane manner. The HEAD call should be the authoritative call when it returns 200. The second fix is for searching for items to update. I had been searching for items that weren't even updated yet, causing a lot more pressure to create false positives and inefficiencies all around.

Category Bug fix
Why these changes are required? Too much contention/no guarantee on convergence for RFS Work coordination.
What is the old behavior before changes and new behavior after changes? Lots of thrashing to setup an initial index and a large number of PotentialClockDriftDetectedExceptions bleeding upward.

Issues Resolved

https://opensearch.atlassian.net/browse/MIGRATIONS-1810

Testing

Manual/gradle build

Check List

New functionality includes testing
- All tests pass, including unit test, integration test and doctest
New functionality has been documented
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…ore reliable and only scan for work that is currently expired. The HEAD check has now been rolled into the doUntil block for running setup() so that even 400s from the PUT command can be ignored and retried again in a sane manner. The HEAD call should be the authoritative call when it returns 200. The second fix is for searching for items to update. I had been searching for items that weren't even updated yet, causing a lot more pressure to create false positives and inefficiencies all around. Signed-off-by: Greg Schohn <[email protected]>

codecov · 2024-06-26T18:46:32Z

Codecov Report

Attention: Patch coverage is 50.00000% with 4 lines in your changes missing coverage. Please review.

Project coverage is 69.21%. Comparing base (4bd3273) to head (8e32803).
Report is 1 commits behind head on main.

Files	Patch %	Lines
...in/java/com/rfs/cms/OpenSearchWorkCoordinator.java	50.00%	2 Missing and 2 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main     #777      +/-   ##
============================================
+ Coverage     68.53%   69.21%   +0.68%     
- Complexity     1577     1594      +17     
============================================
  Files           275      275              
  Lines         11660    11660              
  Branches        737      737              
============================================
+ Hits           7991     8071      +80     
+ Misses         3263     3168      -95     
- Partials        406      421      +15

Flag	Coverage Δ
gradle-test	`62.15% <50.00%> (+0.95%)`	⬆️
python-test	`87.56% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

RFS/src/main/java/com/rfs/cms/OpenSearchWorkCoordinator.java

chelma · 2024-06-26T20:07:04Z

RFS/src/main/java/com/rfs/cms/OpenSearchWorkCoordinator.java

@@ -123,7 +123,7 @@ public String toString() {
                            return "[ statusCode: " + r.getStatusCode() + ", payload: " + payloadStr + "]";
                        }
                    },
-                    (response, ignored) -> (response.getStatusCode() / 100) != 2);


Just curious - what was the original logic you were trying to achieve here? I didn't quite follow your explanation in the PR overview.

The test determines when we're done needing to retry. When this is 'true', the work is deemed to be done. In other words, we would keep looping until we got something other than a 200. Notice that the beginning of the method was a HEAD that didn't have the same affliction - so in the worst case scenario, we'd burn a process and be resolved the next time that the program ran

gregschohn changed the title ~~Two fixes to work coordination - make the check for the setup index m…~~ Two fixes to work coordination Jun 26, 2024

gregschohn marked this pull request as ready for review June 26, 2024 19:25

gregschohn requested review from AndreKurait, chelma, lewijacn, mikaylathompson, okhasawn, peternied and sumobrian as code owners June 26, 2024 19:25

peternied requested changes Jun 26, 2024

View reviewed changes

RFS/src/main/java/com/rfs/cms/OpenSearchWorkCoordinator.java Show resolved Hide resolved

chelma reviewed Jun 26, 2024

View reviewed changes

AndreKurait approved these changes Jun 26, 2024

View reviewed changes

peternied approved these changes Jun 26, 2024

View reviewed changes

gregschohn merged commit 8c323d7 into opensearch-project:main Jun 26, 2024
12 of 13 checks passed

gregschohn deleted the RFSCoordinationImprovements branch July 13, 2024 01:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Two fixes to work coordination #777

Two fixes to work coordination #777

gregschohn commented Jun 26, 2024 •

edited

Loading

codecov bot commented Jun 26, 2024 •

edited

Loading

chelma Jun 26, 2024

gregschohn Jun 26, 2024

Two fixes to work coordination #777

Two fixes to work coordination #777

Conversation

gregschohn commented Jun 26, 2024 • edited Loading

Description

Issues Resolved

Testing

Check List

codecov bot commented Jun 26, 2024 • edited Loading

Codecov Report

chelma Jun 26, 2024

Choose a reason for hiding this comment

gregschohn Jun 26, 2024

Choose a reason for hiding this comment

gregschohn commented Jun 26, 2024 •

edited

Loading

codecov bot commented Jun 26, 2024 •

edited

Loading