Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Add graceful retry for anomaly detector result indexing failures #49508

Merged
merged 20 commits into from
Dec 12, 2019

Conversation

benwtrent
Copy link
Member

This adds a new setting for allowing bulk indexing of results to retry.

In theory, this should work just fine as the named pipes are are bounded queues.

In the event of a retry:

  • Process results handler stops reading in from the named pipe, and it can reach capacity
  • This causes the job to stop processing results
  • It will stop pulling off of the named pipe from the data feed
  • That pipe reaches capacity
  • Datafeed pauses waiting for the data to be read

Marking this as WIP as more digging through the failure paths needs to be done.

Would also be good to get a larger scale test in place to verify the backpressure is sent all the way back to the datafeed without anything getting dropped.

closes #45711

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

openJob(job.getId());
startDatafeed(datafeedConfig.getId(), oneDayAgo, now);

// TODO Any better way?????
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure of a way to ask about the internal state...I wonder if we can read the node logs to see if there is an entry indicating that the bulk index failed?

@@ -218,6 +218,7 @@ public void executeRequest() {
BulkResponse addRecordsResponse = client.bulk(bulkRequest).actionGet();
if (addRecordsResponse.hasFailures()) {
logger.error("[{}] Bulk index of results has errors: {}", jobId, addRecordsResponse.buildFailureMessage());
throw new BulkIndexException(addRecordsResponse);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it best to throw here, and handle retries up the stack. That way the retries know about the processor state and can stop retrying if the processor died (or is dying).

@benwtrent
Copy link
Member Author

Verified through testing that the datafeed does feel the backpressure in a two of ways:

  • If a lookback completes, it will pause waiting for a flush.
  • The datafeed pauses while the postData method waits for handling as the job is still processing previous requests.

Previously, bulk index requests did not throw exceptions. With this in mind, the flush failure changed:

Flush requests could fail due to the items not being written (previously the error was just logged). This is OK as now we actually write to the listener if the flush fails. The datafeedJob will continue executing even after a failed flush (logging the failure) and proceed to the next execution time (for real-time).

Digging through

void processResult(AutodetectResult result) throws JobResultsPersister.BulkIndexException {

It is now possible that if a bulk index request fails i.e.

if (records != null && !records.isEmpty()) {
            bulkResultsPersister.persistRecords(records); // executes the bulk index request if over request size threshold
        }

Then the rest of the results processing could be skipped (since bulk index failures now throws).

This is slightly different than before, where the rest of the results processing would continue as normal (even if previous results failed to index). In reality, if one of the results failed to bulk index, it could be assumed that the rest would fail as well.

But, we want to make this more reliable, so maybe the retries should be inside each of the individual processResult actions?

@droberts195 what do you think?

@benwtrent
Copy link
Member Author

Another thought is now with the bulk retries, the bulk request items do not get cleared after a failure.

I think we should probably clear them out, I am thinking within bulkPersistWithRetry method. Once all the attempts get exhausted something like persister.clearBulkRequest() should be called. Otherwise, the bulk request could continue to grow without bounds. Probably exacerbating the failures.

@benwtrent
Copy link
Member Author

@elasticmachine update branch

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on.

I have made a few initial comments.

But as I was reading through the changes I realised this is more complicated than I thought. I'll have a closer look at the flush logic tomorrow, as that's an area that has the potential to completely lock up all processing if we get it wrong.

"xpack.ml.persist_results_max_retries",
2,
0,
Integer.MAX_VALUE - 2,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be lower, say 100.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droberts195 yeah, I agree. I am also thinking the random sleep should probably be a random value between some minimum value and the current exponential backoff max.

@@ -71,6 +73,14 @@
*/
public class AutodetectResultProcessor {

public static final Setting<Integer> PERSIST_RESULTS_MAX_RETRIES = Setting.intSetting(
"xpack.ml.persist_results_max_retries",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should think more about the name of this setting before release. I guess it's possible that indexing data frame analytics results could also fail and need to be retried. In this case we can keep the current setting name and use the same number of retries for both anomaly detection results and data frame analytics results.

But if in the long term we think this setting will only ever be used for anomaly detection results then we should change the name of the setting to xpack.ml.persist_anomaly_results_max_retries.

I am leaning towards using the same setting eventually for data frame analytics results and keeping the name as is. What do you think @dimitris-athanasiou?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would definitely make sense to reuse this setting for data frame analytics.

@@ -310,6 +323,46 @@ void processResult(AutodetectResult result) {
}
}

void bulkPersistWithRetry(CheckedRunnable<JobResultsPersister.BulkIndexException> bulkRunnable) {
int attempts = 0;
while(attempts < maximumFailureRetries) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be <=, because if retries is zero we still want to try once?

double backOff = ((1 << attempts) - 1) / 2.0;
Thread.sleep((int)(backOff * 100));
} catch (InterruptedException interrupt) {
LOGGER.warn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a Thread.currentThread().interrupt(); in this catch block as well as the logging so that the fact this thread was interrupted is not forgotten.

LOGGER.warn(new ParameterizedMessage("[{}] Error processing autodetect result", jobId), e);
}
}
bulkResultsPersister.clearBulkRequest();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought is now with the bulk retries, the bulk request items do not get cleared after a failure.

I think we should also be removing successful items from the bulk request before retrying it, as that will also reduce the burden on the bulk threadpool. That will have to be done in the JobResultsPersister class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also be removing successful items from the bulk request before retrying it

This is tricky. I will look into trying to filter out the successes. It may be doable because all the indexing requests contain a provided doc ID.

@droberts195 droberts195 changed the title [ML] Add graceful retry for bulk index results failures [ML] Add graceful retry for anomaly detector result indexing failures Dec 2, 2019
@droberts195
Copy link
Contributor

We discussed this in more detail on a call and came up with the following requirements:

  1. We should retry all* ML result indexing failures, not just those that currently use the bulk endpoint
  2. The same setting should determine the number of retries - we don't want many settings
  3. Failures to index after exhausting all retries should fail the job
  4. But since we don't want 3 to happen for transient problems, the default number of retries should be high and the sum of the backoff time in between these retries should also be high - of the order of 30 minutes in total

(* retries should be added to data frame analytics in a separate PR - this one should just change anomaly detection)

@benwtrent benwtrent marked this pull request as ready for review December 5, 2019 13:34
client().admin()
.cluster()
.prepareUpdateSettings()
.setTransientSettings(Settings.builder()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it have a potential of affecting other, unrelated tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see the @After clause. It sets all back to null

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this part. My question was more along the lines if it is possible that 2 test classes will share these cluster settings. But I guess it's not the case.

currentMin = currentMax;
}
double backOff = ((1 << Math.min(currentAttempt, MAX_RETRY_EXPONENT)) - 1) / 2.0;
int max = (int)(backOff * 100);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is undocumented subtlety here. backOff * 100 can be greater than Integer.MAX_VALUE, and then the cast to int of a double greater than Integer.MAX_VALUE will result in Integer.MAX_VALUE.

But it makes me wonder whether it would be clearer to just use:

    int uncappedBackOff = ((1 << Math.min(currentAttempt, MAX_RETRY_EXPONENT)) - 1) * (100 / 2);

and change MAX_RETRY_EXPONENT to 24.

This avoids any subtlety with casting int to double and back again. Or if there is a really good reason to go via double, please comment it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rounding up and * 50 should work. Let me experiment

private static final int MAX_RETRY_SLEEP_MILLIS = (int)Duration.ofMinutes(15).toMillis();
private static final int MIN_RETRY_SLEEP_MILLIS = 50;
// Having an exponent higher than this causes integer overflow
private static final int MAX_RETRY_EXPONENT = 29;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will need changing to 24, otherwise the int will overflow when multiplied by 50

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jshell> int val = ((1 << 29) - 1) * (100 / 2);
val ==> 1073741774

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a long the answer is 26843545550 though, so 1073741774 is due to wrapping.

Try int val = ((1 << 27) - 1) * (100 / 2);

If we're going to rely on wrapping then it would probably be clearer to just say if (currentAttempt > SOMETHING) { max = magic number }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are 100% right. I reduced to 24.

@benwtrent
Copy link
Member Author

@elasticmachine update branch

BulkResponse bulkIndexWithRetry(BulkRequest bulkRequest,
String jobId,
Supplier<Boolean> shouldRetry,
Consumer<String> msgHandler,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could the msgHandler do both: log and audit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, whatever the handler wants, but I do think it is important for results persister to log on its own.

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I think we should also add retries for the model state documents that get indexed by IndexingStateProcessor, but these can be added in a new PR.

Copy link

@hendrikmuhs hendrikmuhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@przemekwitek przemekwitek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

client().admin()
.cluster()
.prepareUpdateSettings()
.setTransientSettings(Settings.builder()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this part. My question was more along the lines if it is possible that 2 test classes will share these cluster settings. But I guess it's not the case.

@benwtrent
Copy link
Member Author

@elasticmachine update branch

@benwtrent benwtrent merged commit 5c3dd57 into elastic:master Dec 12, 2019
@benwtrent benwtrent deleted the feature/ml-persist-results-retry branch December 12, 2019 14:03
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Dec 12, 2019
…elastic#49508)

All results indexing now retry the amount of times configured in `xpack.ml.persist_results_max_retries`. The retries are done in a semi-random, exponential backoff.
benwtrent added a commit that referenced this pull request Dec 12, 2019
…lures(#49508) (#50145)

* [ML] Add graceful retry for anomaly detector result indexing failures (#49508)

All results indexing now retry the amount of times configured in `xpack.ml.persist_results_max_retries`. The retries are done in a semi-random, exponential backoff.

* fixing test
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this pull request Jan 23, 2020
…elastic#49508)

All results indexing now retry the amount of times configured in `xpack.ml.persist_results_max_retries`. The retries are done in a semi-random, exponential backoff.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ML] Gracefully handle and retry results bulk indexing failures
7 participants