Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Nightly maintenance is not triggered #47003

Closed
pheyos opened this issue Sep 24, 2019 · 5 comments · Fixed by #47103
Closed

[ML] Nightly maintenance is not triggered #47003

pheyos opened this issue Sep 24, 2019 · 5 comments · Fixed by #47103
Assignees
Labels
>bug :ml Machine learning

Comments

@pheyos
Copy link
Member

pheyos commented Sep 24, 2019

In our test cluster, we've noticed a large number of model snapshots for our long running jobs, dating back to the day of the upgrade from 6.5.1 -> 6.6.2 -> 6.7.0.
A first investigation in the cloud logs (only past 7 days available) showed that there's no maintenance task executed.

@pheyos pheyos added >bug :ml Machine learning labels Sep 24, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@dimitris-athanasiou dimitris-athanasiou self-assigned this Sep 24, 2019
@droberts195
Copy link
Contributor

My observations when we were discussing this this morning were:

  1. The way the daily maintenance task reschedules is that each invocation schedules the next one. So if some problem stops the task running once then it won't run again.
  2. There is a race condition in MlDailyMaintenanceService where if stop and start are called close together, with start being called before stop has finished doing its work then the stop that was called before start but finished running afterwards can cancel the newly scheduled timer. (I'm not sure it's possible in production for this to happen, but we could defend against it by making all the methods that access the cancellable variable synchronized.)
  3. Similarly there's a potential race in MlInitializationService if onMaster can be called before a call made very soon beforehand to offMaster has finished its work. That could be defended against by making all the methods that access mlDailyMaintenanceService synchronized.

@pheyos
Copy link
Member Author

pheyos commented Sep 25, 2019

To debug this issue, I did a full cluster restart which should schedule a new nightly maintenance task. But one day later, the maintenance task still did not run (7am server time).
I've also checked logs and model snapshots of a second long running cluster that has been created on 7.3.0 right away (without upgrade from an older version like the first cluster) with the same result: no sign of running the maintenance task and lots of old model snapshots.

@droberts195
Copy link
Contributor

The problem is much simpler than the subtle race conditions I thought of. It was introduced in 6.6: https://github.com/elastic/elasticsearch/pull/36702/files#diff-6d8a49bb3f57c032cf9c63498d1e2f89L48

It means nightly maintenance has not run for any cluster on versions 6.6.0 to 7.4.0 inclusive.

I will fix this for 6.8.4/7.4.1 (and also defend the subtle races too).

droberts195 added a commit to droberts195/elasticsearch that referenced this issue Sep 25, 2019
A refactoring in 6.6 meant that the ML daily
maintenance actions have not been run at all
since then. This change installs the local
master listener that schedules the ML daily
maintenance, and also defends against some
subtle race conditions that could occur in the
future if a node flipped very quickly between
master and non-master.

Fixes elastic#47003
@droberts195
Copy link
Contributor

I opened #47103 to fix this.

The reason it didn't show up in any of our unit/integration tests is that the unit tests don't exercise the real on/off master listener functionality, and in integration tests we actually test the DeleteExpiredDataAction on the basis that we cannot have an integration test waiting for the early hours of the morning.

So in the tests in the elasticsearch repo we're testing most pieces of the jigsaw in isolation but not the full jigsaw. This means it's critical that the long running QA tests keep an eye out for whether daily maintenance is running in the early hours of each morning. If it is running then the unit/integration tests show it's probably doing the right thing. But if it's not running in the early hours of each morning then we're relying on the long running QA tests to tell us.

droberts195 added a commit that referenced this issue Sep 26, 2019
A refactoring in 6.6 meant that the ML daily
maintenance actions have not been run at all
since then. This change installs the local
master listener that schedules the ML daily
maintenance, and also defends against some
subtle race conditions that could occur in the
future if a node flipped very quickly between
master and non-master.

Fixes #47003
droberts195 added a commit to droberts195/elasticsearch that referenced this issue Sep 26, 2019
Due to elastic#47003 many clusters will have built up a
large backlog of expired results. On upgrading to
a version where that bug is fixed users could find
that the first ML daily maintenance task deletes
a very large amount of documents.

This change introduces throttling to the
delete-by-query that the ML daily maintenance uses
to delete expired results:

- Average 200 documents per second
- Maximum of 10 million documents per day

(There is no throttling for state/forecast documents
as these are expected to be lower volume.)

Relates elastic#47103
droberts195 added a commit that referenced this issue Sep 30, 2019
A refactoring in 6.6 meant that the ML daily
maintenance actions have not been run at all
since then. This change installs the local
master listener that schedules the ML daily
maintenance, and also defends against some
subtle race conditions that could occur in the
future if a node flipped very quickly between
master and non-master.

Fixes #47003
droberts195 added a commit that referenced this issue Sep 30, 2019
A refactoring in 6.6 meant that the ML daily
maintenance actions have not been run at all
since then. This change installs the local
master listener that schedules the ML daily
maintenance, and also defends against some
subtle race conditions that could occur in the
future if a node flipped very quickly between
master and non-master.

Fixes #47003
droberts195 added a commit that referenced this issue Sep 30, 2019
A refactoring in 6.6 meant that the ML daily
maintenance actions have not been run at all
since then. This change installs the local
master listener that schedules the ML daily
maintenance, and also defends against some
subtle race conditions that could occur in the
future if a node flipped very quickly between
master and non-master.

Fixes #47003
droberts195 added a commit that referenced this issue Oct 2, 2019
Due to #47003 many clusters will have built up a
large backlog of expired results. On upgrading to
a version where that bug is fixed users could find
that the first ML daily maintenance task deletes
a very large amount of documents.

This change introduces throttling to the
delete-by-query that the ML daily maintenance uses
to delete expired results to limit it to deleting an
average 200 documents per second. (There is no
throttling for state/forecast documents as these
are expected to be lower volume.)

Additionally a rough time limit of 8 hours is applied
to the whole delete expired data action. (This is only
rough as it won't stop part way through a single
operation - it only checks the timeout between
operations.)

Relates #47103
droberts195 added a commit that referenced this issue Oct 2, 2019
Due to #47003 many clusters will have built up a
large backlog of expired results. On upgrading to
a version where that bug is fixed users could find
that the first ML daily maintenance task deletes
a very large amount of documents.

This change introduces throttling to the
delete-by-query that the ML daily maintenance uses
to delete expired results to limit it to deleting an
average 200 documents per second. (There is no
throttling for state/forecast documents as these
are expected to be lower volume.)

Additionally a rough time limit of 8 hours is applied
to the whole delete expired data action. (This is only
rough as it won't stop part way through a single
operation - it only checks the timeout between
operations.)

Relates #47103
droberts195 added a commit that referenced this issue Oct 2, 2019
Due to #47003 many clusters will have built up a
large backlog of expired results. On upgrading to
a version where that bug is fixed users could find
that the first ML daily maintenance task deletes
a very large amount of documents.

This change introduces throttling to the
delete-by-query that the ML daily maintenance uses
to delete expired results to limit it to deleting an
average 200 documents per second. (There is no
throttling for state/forecast documents as these
are expected to be lower volume.)

Additionally a rough time limit of 8 hours is applied
to the whole delete expired data action. (This is only
rough as it won't stop part way through a single
operation - it only checks the timeout between
operations.)

Relates #47103
droberts195 added a commit that referenced this issue Oct 2, 2019
Due to #47003 many clusters will have built up a
large backlog of expired results. On upgrading to
a version where that bug is fixed users could find
that the first ML daily maintenance task deletes
a very large amount of documents.

This change introduces throttling to the
delete-by-query that the ML daily maintenance uses
to delete expired results to limit it to deleting an
average 200 documents per second. (There is no
throttling for state/forecast documents as these
are expected to be lower volume.)

Additionally a rough time limit of 8 hours is applied
to the whole delete expired data action. (This is only
rough as it won't stop part way through a single
operation - it only checks the timeout between
operations.)

Relates #47103
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants