Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Fix master node deadlock during ML daily maintenance #31691

Merged
merged 2 commits into from
Jun 29, 2018

Conversation

davidkyle
Copy link
Member

@davidkyle davidkyle commented Jun 29, 2018

In multi-node clusters TransportDeleteExpiredDataAction can try to execute a blocking search on the transport client thread which causes the node to stop communicating. The searches in this action should execute in the Machine Learning thread pool.

Symptoms appear after the MlDailyMaintenanceService is triggered which corresponds to the following message in the log file:

[INFO ][o.e.x.m.MlDailyMaintenanceService] triggering scheduled [ML] maintenance tasks

A work-around is to disable Machine Learning on all nodes in the cluster.

The assertions in BaseFuture should have found this in testing but the tests ran in a single node cluster. I added a new gradle project ml-native-multi-node-tests which execute in a 3 node cluster and moved DeleteExpiredDataIT to it so the test now hits the failure case.

Closes #31683

@davidkyle davidkyle added >bug :ml Machine learning v6.3.1 labels Jun 29, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@davidkyle davidkyle requested review from s1monw and droberts195 June 29, 2018 15:24
@davidkyle davidkyle force-pushed the daily-maintenance-tp-fix branch from 1465a92 to 785084e Compare June 29, 2018 15:43
@davidkyle davidkyle force-pushed the daily-maintenance-tp-fix branch from 785084e to 3d2cdf4 Compare June 29, 2018 15:44
@davidkyle davidkyle requested a review from bleskes June 29, 2018 15:50
Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

As we discussed, it's not ideal to be introducing yet another qa suite, ml-native-multi-node-tests. However, as a followup we'll move all the ML native tests into this and get rid of the current ml-native-tests suite. So relatively soon we'll be back to the current number of qa suites.

@jasontedor jasontedor requested review from jasontedor and ywelsch June 29, 2018 16:15
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I've not looked at the tests in detail but rather focused on rechecking all MlDataRemover implementations to see if there might be possibly other blocking calls there. I have not found any, but noticed that ExpiredForecastsRemover#findForecastsToDelete possibly parses up to 10000 docs in one go on the network thread. This is not ideal, but can be addressed in a follow-up PR and does not need to block this blocker.

@@ -89,6 +90,8 @@ public boolean hasNext() {
private SearchResponse initScroll() {
LOGGER.trace("ES API CALL: search index {}", index);

Transports.assertNotTransportThread("BatchedDocumentsIterator makes blocking calls");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preferably put this into the next() method instead so it will also cover the other blocking calls in this class.
Could you also write this as assert Transports.assertNotTransportThread(...), this will save from extra CPU in non-debug mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need this at all. The blocking call to the client executes it anyway. The issue was that there was no testing. I think this entire transport action can use a ml threadpool instead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, it's just a more helpful error message

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put it into an assert if you keep it. I’d remove it.

@@ -66,10 +67,12 @@ private void deleteExpiredData(ActionListener<DeleteExpiredDataAction.Response>

private void deleteExpiredData(Iterator<MlDataRemover> mlDataRemoversIterator,
ActionListener<DeleteExpiredDataAction.Response> listener) {
Transports.assertNotTransportThread("ML Daily Maintenance");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unnecessary imo. Can we have a comment here why we fork?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it was to fail faster during testing/development. I'll remove both assertNotTransportThread calls and add a comment

Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jasontedor jasontedor merged commit eb782d0 into elastic:6.3 Jun 29, 2018
@davidkyle davidkyle deleted the daily-maintenance-tp-fix branch June 29, 2018 21:01
@jasontedor
Copy link
Member

Does this change need to be forward-ported @davidkyle?

@davidkyle
Copy link
Member Author

@jasontedor Thanks for merging.

No not in this state, I intend to make further changes to the ML integration tests and incorporate @ywelsch's suggestions.

@jasontedor
Copy link
Member

Thanks for clarifying @davidkyle!

@droberts195 droberts195 changed the title [ML] Expired data cleanup cannot run on the client thread [ML] Fix master node deadlock during ML daily maintenance Jul 5, 2018
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Jul 6, 2018
This is the implementation for master and 6.x of elastic#31691.

Relates elastic#31683
dimitris-athanasiou added a commit that referenced this pull request Jul 7, 2018
This is the implementation for master and 6.x of #31691.
Native tests are changed to use multi-node clusters in #31757.

Relates #31683
dimitris-athanasiou added a commit that referenced this pull request Jul 7, 2018
This is the implementation for master and 6.x of #31691.
Native tests are changed to use multi-node clusters in #31757.

Relates #31683
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants