[ML] Add graceful retry for anomaly detector result indexing failures #49508

benwtrent · 2019-11-22T21:16:44Z

This adds a new setting for allowing bulk indexing of results to retry.

In theory, this should work just fine as the named pipes are are bounded queues.

In the event of a retry:

Process results handler stops reading in from the named pipe, and it can reach capacity
This causes the job to stop processing results
It will stop pulling off of the named pipe from the data feed
That pipe reaches capacity
Datafeed pauses waiting for the data to be read

Marking this as WIP as more digging through the failure paths needs to be done.

Would also be good to get a larger scale test in place to verify the backpressure is sent all the way back to the datafeed without anything getting dropped.

closes #45711

elasticmachine · 2019-11-22T21:16:46Z

Pinging @elastic/ml-core (:ml)

benwtrent · 2019-11-22T21:17:45Z

...ulti-node-tests/src/test/java/org/elasticsearch/xpack/ml/integration/BulkFailureRetryIT.java

+        openJob(job.getId());
+        startDatafeed(datafeedConfig.getId(), oneDayAgo, now);
+
+        // TODO Any better way?????


I am not sure of a way to ask about the internal state...I wonder if we can read the node logs to see if there is an entry indicating that the bulk index failed?

benwtrent · 2019-11-22T21:19:41Z

.../plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/persistence/JobResultsPersister.java

@@ -218,6 +218,7 @@ public void executeRequest() {
                BulkResponse addRecordsResponse = client.bulk(bulkRequest).actionGet();
                if (addRecordsResponse.hasFailures()) {
                    logger.error("[{}] Bulk index of results has errors: {}", jobId, addRecordsResponse.buildFailureMessage());
+                    throw new BulkIndexException(addRecordsResponse);


I thought it best to throw here, and handle retries up the stack. That way the retries know about the processor state and can stop retrying if the processor died (or is dying).

benwtrent · 2019-11-25T14:05:16Z

Verified through testing that the datafeed does feel the backpressure in a two of ways:

If a lookback completes, it will pause waiting for a flush.
The datafeed pauses while the postData method waits for handling as the job is still processing previous requests.

Previously, bulk index requests did not throw exceptions. With this in mind, the flush failure changed:

Flush requests could fail due to the items not being written (previously the error was just logged). This is OK as now we actually write to the listener if the flush fails. The datafeedJob will continue executing even after a failed flush (logging the failure) and proceed to the next execution time (for real-time).

Digging through

elasticsearch/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/process/autodetect/output/AutodetectResultProcessor.java

Line 218 in 19c67d6

    
           void processResult(AutodetectResult result) throws JobResultsPersister.BulkIndexException {

It is now possible that if a bulk index request fails i.e.

if (records != null && !records.isEmpty()) {
            bulkResultsPersister.persistRecords(records); // executes the bulk index request if over request size threshold
        }

Then the rest of the results processing could be skipped (since bulk index failures now throws).

This is slightly different than before, where the rest of the results processing would continue as normal (even if previous results failed to index). In reality, if one of the results failed to bulk index, it could be assumed that the rest would fail as well.

But, we want to make this more reliable, so maybe the retries should be inside each of the individual processResult actions?

@droberts195 what do you think?

…cessing does not get skipped

benwtrent · 2019-11-25T14:45:26Z

Another thought is now with the bulk retries, the bulk request items do not get cleared after a failure.

I think we should probably clear them out, I am thinking within bulkPersistWithRetry method. Once all the attempts get exhausted something like persister.clearBulkRequest() should be called. Otherwise, the bulk request could continue to grow without bounds. Probably exacerbating the failures.

benwtrent · 2019-11-25T15:07:34Z

@elasticmachine update branch

droberts195

Thanks for taking this on.

I have made a few initial comments.

But as I was reading through the changes I realised this is more complicated than I thought. I'll have a closer look at the flush logic tomorrow, as that's an area that has the potential to completely lock up all processing if we get it wrong.

droberts195 · 2019-11-25T15:21:30Z

...java/org/elasticsearch/xpack/ml/job/process/autodetect/output/AutodetectResultProcessor.java

+        "xpack.ml.persist_results_max_retries",
+        2,
+        0,
+        Integer.MAX_VALUE - 2,


This should probably be lower, say 100.

@droberts195 yeah, I agree. I am also thinking the random sleep should probably be a random value between some minimum value and the current exponential backoff max.

droberts195 · 2019-11-25T15:25:41Z

...java/org/elasticsearch/xpack/ml/job/process/autodetect/output/AutodetectResultProcessor.java

@@ -71,6 +73,14 @@
 */
 public class AutodetectResultProcessor {

+    public static final Setting<Integer> PERSIST_RESULTS_MAX_RETRIES = Setting.intSetting(
+        "xpack.ml.persist_results_max_retries",


We should think more about the name of this setting before release. I guess it's possible that indexing data frame analytics results could also fail and need to be retried. In this case we can keep the current setting name and use the same number of retries for both anomaly detection results and data frame analytics results.

But if in the long term we think this setting will only ever be used for anomaly detection results then we should change the name of the setting to xpack.ml.persist_anomaly_results_max_retries.

I am leaning towards using the same setting eventually for data frame analytics results and keeping the name as is. What do you think @dimitris-athanasiou?

It would definitely make sense to reuse this setting for data frame analytics.

droberts195 · 2019-11-25T15:31:01Z

...java/org/elasticsearch/xpack/ml/job/process/autodetect/output/AutodetectResultProcessor.java

@@ -310,6 +323,46 @@ void processResult(AutodetectResult result) {
        }
    }

+    void bulkPersistWithRetry(CheckedRunnable<JobResultsPersister.BulkIndexException> bulkRunnable) {
+        int attempts = 0;
+        while(attempts < maximumFailureRetries) {


Shouldn't it be <=, because if retries is zero we still want to try once?

droberts195 · 2019-11-25T17:35:03Z

...java/org/elasticsearch/xpack/ml/job/process/autodetect/output/AutodetectResultProcessor.java

+                    double backOff = ((1 << attempts) - 1) / 2.0;
+                    Thread.sleep((int)(backOff * 100));
+                } catch (InterruptedException interrupt) {
+                    LOGGER.warn(


There should be a Thread.currentThread().interrupt(); in this catch block as well as the logging so that the fact this thread was interrupted is not forgotten.

droberts195 · 2019-11-25T17:47:50Z

...java/org/elasticsearch/xpack/ml/job/process/autodetect/output/AutodetectResultProcessor.java

+                LOGGER.warn(new ParameterizedMessage("[{}] Error processing autodetect result", jobId), e);
+            }
+        }
+        bulkResultsPersister.clearBulkRequest();


Another thought is now with the bulk retries, the bulk request items do not get cleared after a failure.

I think we should also be removing successful items from the bulk request before retrying it, as that will also reduce the burden on the bulk threadpool. That will have to be done in the JobResultsPersister class.

I think we should also be removing successful items from the bulk request before retrying it

This is tricky. I will look into trying to filter out the successes. It may be doable because all the indexing requests contain a provided doc ID.

droberts195 · 2019-12-02T17:27:16Z

We discussed this in more detail on a call and came up with the following requirements:

We should retry all* ML result indexing failures, not just those that currently use the bulk endpoint
The same setting should determine the number of retries - we don't want many settings
Failures to index after exhausting all retries should fail the job
But since we don't want 3 to happen for transient problems, the default number of retries should be high and the sum of the backoff time in between these retries should also be high - of the order of 30 minutes in total

(* retries should be added to data frame analytics in a separate PR - this one should just change anomaly detection)

.../plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/persistence/TimingStatsReporter.java

...rc/main/java/org/elasticsearch/xpack/ml/job/process/autodetect/AutodetectProcessManager.java

...java/org/elasticsearch/xpack/ml/job/process/autodetect/output/AutodetectResultProcessor.java

...va/org/elasticsearch/xpack/ml/job/process/autodetect/writer/CsvDataToProcessWriterTests.java

...test/java/org/elasticsearch/xpack/ml/job/process/autodetect/AutodetectCommunicatorTests.java

...src/test/java/org/elasticsearch/xpack/ml/utils/persistence/ResultsPersisterServiceTests.java

...ulti-node-tests/src/test/java/org/elasticsearch/xpack/ml/integration/BulkFailureRetryIT.java

przemekwitek · 2019-12-09T10:39:31Z

...ulti-node-tests/src/test/java/org/elasticsearch/xpack/ml/integration/BulkFailureRetryIT.java

+        client().admin()
+            .cluster()
+            .prepareUpdateSettings()
+            .setTransientSettings(Settings.builder()


Does it have a potential of affecting other, unrelated tests?

see the @After clause. It sets all back to null

I see this part. My question was more along the lines if it is possible that 2 test classes will share these cluster settings. But I guess it's not the case.

...ulti-node-tests/src/test/java/org/elasticsearch/xpack/ml/integration/BulkFailureRetryIT.java

droberts195 · 2019-12-09T15:17:33Z

...n/ml/src/main/java/org/elasticsearch/xpack/ml/utils/persistence/ResultsPersisterService.java

+                currentMin = currentMax;
+            }
+            double backOff = ((1 << Math.min(currentAttempt, MAX_RETRY_EXPONENT)) - 1) / 2.0;
+            int max = (int)(backOff * 100);


There is undocumented subtlety here. backOff * 100 can be greater than Integer.MAX_VALUE, and then the cast to int of a double greater than Integer.MAX_VALUE will result in Integer.MAX_VALUE.

But it makes me wonder whether it would be clearer to just use:

int uncappedBackOff = ((1 << Math.min(currentAttempt, MAX_RETRY_EXPONENT)) - 1) * (100 / 2);

and change MAX_RETRY_EXPONENT to 24.

This avoids any subtlety with casting int to double and back again. Or if there is a really good reason to go via double, please comment it.

Rounding up and * 50 should work. Let me experiment

droberts195 · 2019-12-09T17:42:28Z

...n/ml/src/main/java/org/elasticsearch/xpack/ml/utils/persistence/ResultsPersisterService.java

+    private static final int MAX_RETRY_SLEEP_MILLIS = (int)Duration.ofMinutes(15).toMillis();
+    private static final int MIN_RETRY_SLEEP_MILLIS = 50;
+    // Having an exponent higher than this causes integer overflow
+    private static final int MAX_RETRY_EXPONENT = 29;


I think this will need changing to 24, otherwise the int will overflow when multiplied by 50

jshell> int val = ((1 << 29) - 1) * (100 / 2); val ==> 1073741774

As a long the answer is 26843545550 though, so 1073741774 is due to wrapping.

Try int val = ((1 << 27) - 1) * (100 / 2);

If we're going to rely on wrapping then it would probably be clearer to just say if (currentAttempt > SOMETHING) { max = magic number }

You are 100% right. I reduced to 24.

benwtrent · 2019-12-09T17:58:49Z

@elasticmachine update branch

…nt/elasticsearch into feature/ml-persist-results-retry

...ulti-node-tests/src/test/java/org/elasticsearch/xpack/ml/integration/BulkFailureRetryIT.java

...ugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/persistence/JobDataCountsPersister.java

hendrikmuhs · 2019-12-10T10:17:07Z

...n/ml/src/main/java/org/elasticsearch/xpack/ml/utils/persistence/ResultsPersisterService.java

+    BulkResponse bulkIndexWithRetry(BulkRequest bulkRequest,
+                                    String jobId,
+                                    Supplier<Boolean> shouldRetry,
+                                    Consumer<String> msgHandler,


could the msgHandler do both: log and audit?

Sure, whatever the handler wants, but I do think it is important for results persister to log on its own.

...n/ml/src/main/java/org/elasticsearch/xpack/ml/utils/persistence/ResultsPersisterService.java

droberts195

LGTM

I think we should also add retries for the model state documents that get indexed by IndexingStateProcessor, but these can be added in a new PR.

hendrikmuhs

LGTM

przemekwitek

LGTM

przemekwitek · 2019-12-11T11:55:05Z

...ulti-node-tests/src/test/java/org/elasticsearch/xpack/ml/integration/BulkFailureRetryIT.java

+        client().admin()
+            .cluster()
+            .prepareUpdateSettings()
+            .setTransientSettings(Settings.builder()


I see this part. My question was more along the lines if it is possible that 2 test classes will share these cluster settings. But I guess it's not the case.

...test/java/org/elasticsearch/xpack/ml/job/process/autodetect/AutodetectCommunicatorTests.java

benwtrent · 2019-12-12T12:49:48Z

@elasticmachine update branch

…elastic#49508) All results indexing now retry the amount of times configured in `xpack.ml.persist_results_max_retries`. The retries are done in a semi-random, exponential backoff.

…lures(#49508) (#50145) * [ML] Add graceful retry for anomaly detector result indexing failures (#49508) All results indexing now retry the amount of times configured in `xpack.ml.persist_results_max_retries`. The retries are done in a semi-random, exponential backoff. * fixing test

…elastic#49508) All results indexing now retry the amount of times configured in `xpack.ml.persist_results_max_retries`. The retries are done in a semi-random, exponential backoff.

[ML] Add graceful retry for bulk index results failures

19c67d6

benwtrent added >enhancement :ml Machine learning v8.0.0 v7.6.0 labels Nov 22, 2019

benwtrent commented Nov 22, 2019

View reviewed changes

benwtrent added 2 commits November 25, 2019 09:09

fixing test, removing unused import

1877d4e

moving retries to individual actions in processResult so result pro…

6d59172

…cessing does not get skipped

clear out bulk request if not able to persist after retrying

ed4d9a1

benwtrent requested a review from droberts195 November 25, 2019 14:49

Merge branch 'master' into feature/ml-persist-results-retry

ff2b984

droberts195 reviewed Nov 25, 2019

View reviewed changes

removing successful bulk response items from subsequent bulk requests

1bf708e

droberts195 changed the title ~~[ML] Add graceful retry for bulk index results failures~~ [ML] Add graceful retry for anomaly detector result indexing failures Dec 2, 2019

benwtrent added 5 commits December 3, 2019 10:08

Merge branch 'master' into feature/ml-persist-results-retry

02c220c

adding retries for all synchronous persistent calls

791c570

Fixing refresh policy handling bug and missing settings

bf749d7

fixing datafeed timing stats persistence

0e01259

rewriting datacounts reporter to use retries

4710635

benwtrent marked this pull request as ready for review December 5, 2019 13:34

przemekwitek reviewed Dec 9, 2019

View reviewed changes

benwtrent added 2 commits December 9, 2019 08:54

cleanup and addressing pr comments

7e4642f

using builder where possible

cb8a5b2

droberts195 reviewed Dec 9, 2019

View reviewed changes

adjusing backoff calculation

7814ed2

droberts195 reviewed Dec 9, 2019

View reviewed changes

elasticmachine and others added 4 commits December 9, 2019 09:58

Merge branch 'master' into feature/ml-persist-results-retry

fbd717a

disallow int wrapping

724dfcf

Merge branch 'feature/ml-persist-results-retry' of github.com:benwtre…

cb03f38

…nt/elasticsearch into feature/ml-persist-results-retry

Adding auditor messaging indicating failures and retries

756be6c

hendrikmuhs reviewed Dec 10, 2019

View reviewed changes

addressing PR comments

6e1a25a

benwtrent requested review from hendrikmuhs, droberts195 and przemekwitek December 10, 2019 15:11

droberts195 approved these changes Dec 10, 2019

View reviewed changes

hendrikmuhs approved these changes Dec 11, 2019

View reviewed changes

przemekwitek approved these changes Dec 11, 2019

View reviewed changes

Merge branch 'master' into feature/ml-persist-results-retry

a9fba44

benwtrent merged commit 5c3dd57 into elastic:master Dec 12, 2019

benwtrent deleted the feature/ml-persist-results-retry branch December 12, 2019 14:03

droberts195 mentioned this pull request Dec 12, 2019

[ML] Gracefully handle and retry state bulk indexing failures #50143

Closed

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Add graceful retry for anomaly detector result indexing failures #49508

[ML] Add graceful retry for anomaly detector result indexing failures #49508

benwtrent commented Nov 22, 2019

elasticmachine commented Nov 22, 2019

benwtrent Nov 22, 2019

benwtrent Nov 22, 2019

benwtrent commented Nov 25, 2019

benwtrent commented Nov 25, 2019

benwtrent commented Nov 25, 2019

droberts195 left a comment

droberts195 Nov 25, 2019

benwtrent Nov 25, 2019

droberts195 Nov 25, 2019

dimitris-athanasiou Nov 26, 2019

droberts195 Nov 25, 2019

droberts195 Nov 25, 2019

droberts195 Nov 25, 2019

benwtrent Nov 25, 2019

droberts195 commented Dec 2, 2019

przemekwitek Dec 9, 2019

benwtrent Dec 9, 2019

przemekwitek Dec 11, 2019

droberts195 Dec 9, 2019

benwtrent Dec 9, 2019

droberts195 Dec 9, 2019

benwtrent Dec 9, 2019

droberts195 Dec 9, 2019

benwtrent Dec 9, 2019

benwtrent commented Dec 9, 2019

hendrikmuhs Dec 10, 2019

benwtrent Dec 10, 2019

droberts195 left a comment

hendrikmuhs left a comment

przemekwitek left a comment

przemekwitek Dec 11, 2019

benwtrent commented Dec 12, 2019

[ML] Add graceful retry for anomaly detector result indexing failures #49508

[ML] Add graceful retry for anomaly detector result indexing failures #49508

Conversation

benwtrent commented Nov 22, 2019

elasticmachine commented Nov 22, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benwtrent commented Nov 25, 2019

benwtrent commented Nov 25, 2019

benwtrent commented Nov 25, 2019

droberts195 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

droberts195 commented Dec 2, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benwtrent commented Dec 9, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

droberts195 left a comment

Choose a reason for hiding this comment

hendrikmuhs left a comment

Choose a reason for hiding this comment

przemekwitek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benwtrent commented Dec 12, 2019