[ML] Nightly maintenance is not triggered #47003

pheyos · 2019-09-24T10:01:52Z

In our test cluster, we've noticed a large number of model snapshots for our long running jobs, dating back to the day of the upgrade from 6.5.1 -> 6.6.2 -> 6.7.0.
A first investigation in the cloud logs (only past 7 days available) showed that there's no maintenance task executed.

elasticmachine · 2019-09-24T10:01:53Z

Pinging @elastic/ml-core

droberts195 · 2019-09-24T16:18:22Z

My observations when we were discussing this this morning were:

The way the daily maintenance task reschedules is that each invocation schedules the next one. So if some problem stops the task running once then it won't run again.
There is a race condition in MlDailyMaintenanceService where if stop and start are called close together, with start being called before stop has finished doing its work then the stop that was called before start but finished running afterwards can cancel the newly scheduled timer. (I'm not sure it's possible in production for this to happen, but we could defend against it by making all the methods that access the cancellable variable synchronized.)
Similarly there's a potential race in MlInitializationService if onMaster can be called before a call made very soon beforehand to offMaster has finished its work. That could be defended against by making all the methods that access mlDailyMaintenanceService synchronized.

pheyos · 2019-09-25T07:27:09Z

To debug this issue, I did a full cluster restart which should schedule a new nightly maintenance task. But one day later, the maintenance task still did not run (7am server time).
I've also checked logs and model snapshots of a second long running cluster that has been created on 7.3.0 right away (without upgrade from an older version like the first cluster) with the same result: no sign of running the maintenance task and lots of old model snapshots.

droberts195 · 2019-09-25T10:37:23Z

The problem is much simpler than the subtle race conditions I thought of. It was introduced in 6.6: https://github.com/elastic/elasticsearch/pull/36702/files#diff-6d8a49bb3f57c032cf9c63498d1e2f89L48

It means nightly maintenance has not run for any cluster on versions 6.6.0 to 7.4.0 inclusive.

I will fix this for 6.8.4/7.4.1 (and also defend the subtle races too).

A refactoring in 6.6 meant that the ML daily maintenance actions have not been run at all since then. This change installs the local master listener that schedules the ML daily maintenance, and also defends against some subtle race conditions that could occur in the future if a node flipped very quickly between master and non-master. Fixes elastic#47003

droberts195 · 2019-09-25T11:03:00Z

I opened #47103 to fix this.

The reason it didn't show up in any of our unit/integration tests is that the unit tests don't exercise the real on/off master listener functionality, and in integration tests we actually test the DeleteExpiredDataAction on the basis that we cannot have an integration test waiting for the early hours of the morning.

So in the tests in the elasticsearch repo we're testing most pieces of the jigsaw in isolation but not the full jigsaw. This means it's critical that the long running QA tests keep an eye out for whether daily maintenance is running in the early hours of each morning. If it is running then the unit/integration tests show it's probably doing the right thing. But if it's not running in the early hours of each morning then we're relying on the long running QA tests to tell us.

A refactoring in 6.6 meant that the ML daily maintenance actions have not been run at all since then. This change installs the local master listener that schedules the ML daily maintenance, and also defends against some subtle race conditions that could occur in the future if a node flipped very quickly between master and non-master. Fixes #47003

Due to elastic#47003 many clusters will have built up a large backlog of expired results. On upgrading to a version where that bug is fixed users could find that the first ML daily maintenance task deletes a very large amount of documents. This change introduces throttling to the delete-by-query that the ML daily maintenance uses to delete expired results: - Average 200 documents per second - Maximum of 10 million documents per day (There is no throttling for state/forecast documents as these are expected to be lower volume.) Relates elastic#47103

A refactoring in 6.6 meant that the ML daily maintenance actions have not been run at all since then. This change installs the local master listener that schedules the ML daily maintenance, and also defends against some subtle race conditions that could occur in the future if a node flipped very quickly between master and non-master. Fixes #47003

Due to #47003 many clusters will have built up a large backlog of expired results. On upgrading to a version where that bug is fixed users could find that the first ML daily maintenance task deletes a very large amount of documents. This change introduces throttling to the delete-by-query that the ML daily maintenance uses to delete expired results to limit it to deleting an average 200 documents per second. (There is no throttling for state/forecast documents as these are expected to be lower volume.) Additionally a rough time limit of 8 hours is applied to the whole delete expired data action. (This is only rough as it won't stop part way through a single operation - it only checks the timeout between operations.) Relates #47103

pheyos added >bug :ml Machine learning labels Sep 24, 2019

dimitris-athanasiou self-assigned this Sep 24, 2019

droberts195 assigned droberts195 and unassigned dimitris-athanasiou Sep 25, 2019

droberts195 mentioned this issue Sep 25, 2019

[ML] Reinstate ML daily maintenance actions #47103

Merged

droberts195 closed this as completed in #47103 Sep 26, 2019

droberts195 mentioned this issue Sep 26, 2019

[ML] Throttle the delete-by-query of expired results #47177

Merged

codebrain mentioned this issue Oct 25, 2019

7.4.1 meta ticket elastic/elasticsearch-net#4174

Closed

39 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Nightly maintenance is not triggered #47003

[ML] Nightly maintenance is not triggered #47003

pheyos commented Sep 24, 2019

elasticmachine commented Sep 24, 2019

droberts195 commented Sep 24, 2019

pheyos commented Sep 25, 2019

droberts195 commented Sep 25, 2019

droberts195 commented Sep 25, 2019

[ML] Nightly maintenance is not triggered #47003

[ML] Nightly maintenance is not triggered #47003

Comments

pheyos commented Sep 24, 2019

elasticmachine commented Sep 24, 2019

droberts195 commented Sep 24, 2019

pheyos commented Sep 25, 2019

droberts195 commented Sep 25, 2019

droberts195 commented Sep 25, 2019