[ResponseOps] results from taskManager.bulk*() calls ignored in rules client #145316

pmuellr · 2022-11-16T01:29:49Z

Kibana version: main

In the rules client module, there are several calls to this.taskManager.bulk*() methods which are not capturing the results of the call. However, one of the operations in the bulk could fail, and the call would still not throw an error. I think an error is only thrown for bulk for catastrophic cases, but in the normal case a response is returned independently for each operation in the bulk. Which means we could be treating errors as successes.

Here's an example, but there is also code for this.taskManager methods bulkEnable() and bulkDisable():

kibana/x-pack/plugins/alerting/server/rules_client/rules_client.ts

Lines 2312 to 2324 in dd1ad53

    
                 try { 
        
                   await this.taskManager.bulkUpdateSchedules(taskIds, scheduleOperation.value); 
        
                   this.logger.debug( 
        
                     `Successfully updated schedules for underlying tasks: ${taskIds.join(', ')}` 
        
                   ); 
        
                 } catch (error) { 
        
                   this.logger.error( 
        
                     `Failure to update schedules for underlying tasks: ${taskIds.join( 
        
                       ', ' 
        
                     )}. TaskManager bulkUpdateSchedules failed with Error: ${error.message}` 
        
                   ); 
        
                 } 
        
               }

To handle this, we are going to want to retry the operations that failed, in case the failure was related to Optimistic Concurrency Control (OCC). We have other OCC lying around the rules client code, and so we should copy the pattern. Though I don't know that we have "bulk" versions, which will be a little different in that they would retry just the ones that failed.

I think there's a chance this is the cause of #141849 (and related failures) and why we haven't be able to find a fix for those in #144739. One explanation would be that we are in fact getting an OCC error, for a task that is either just starting, or just finishing, and updating it's own task doc. And then the bulk operation also gets called. Each one is getting the task doc, and then updating it. Probably pretty rare, but the failures have been fairly rare.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-11-16T01:29:51Z

Pinging @elastic/response-ops (Team:ResponseOps)

Resolves #145316, #141849, #141864 ## Summary Adds a retry on conflict error to the saved objects bulk update call made by task manager. Errors are returned by the saved object client inside an array (with a success response). Previously, we were not inspecting the response array, just returning the full data. With this PR, we are inspecting the response array specifically for conflict errors and retrying the update for just those tasks. This `bulkUpdate` function is used both internally by task manager and externally by the rules client. I default the number of retries to 0 for bulk updates from the task manager in order to preserve existing behavior (and in order not to increase the amount of time it takes for task manager to run) but use 3 retries when used externally. Also unskipped the two flaky disable tests and ran them through the flaky test runner 400 times with no failures. * https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/1652 * https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/1653 * https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/1660 * https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/1661 Co-authored-by: Kibana Machine <[email protected]>

pmuellr added this to AppEx: ResponseOps - Execution & Connectors Nov 16, 2022

pmuellr moved this to Awaiting Triage in AppEx: ResponseOps - Execution & Connectors Nov 16, 2022

pmuellr mentioned this issue Nov 16, 2022

[Fix] flaky bulk disable task test #144739

Closed

mikecote moved this from Awaiting Triage to Todo in AppEx: ResponseOps - Execution & Connectors Nov 17, 2022

doakalexi assigned doakalexi and unassigned doakalexi Dec 7, 2022

ymao1 self-assigned this Dec 19, 2022

ymao1 moved this from Todo to In Progress in AppEx: ResponseOps - Execution & Connectors Dec 19, 2022

ymao1 mentioned this issue Dec 20, 2022

[ResponseOps] Retry bulk update conflicts in task manager #147808

Merged

ymao1 moved this from In Progress to In Review in AppEx: ResponseOps - Execution & Connectors Dec 20, 2022

ymao1 closed this as completed in #147808 Dec 21, 2022

Repository owner moved this from In Review to Done in AppEx: ResponseOps - Execution & Connectors Dec 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ResponseOps] results from taskManager.bulk*() calls ignored in rules client #145316

[ResponseOps] results from taskManager.bulk*() calls ignored in rules client #145316

pmuellr commented Nov 16, 2022

elasticmachine commented Nov 16, 2022

[ResponseOps] results from taskManager.bulk*() calls ignored in rules client #145316

[ResponseOps] results from taskManager.bulk*() calls ignored in rules client #145316

Comments

pmuellr commented Nov 16, 2022

elasticmachine commented Nov 16, 2022