-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Fix master node deadlock during ML daily maintenance #31691
Conversation
Pinging @elastic/ml-core |
1465a92
to
785084e
Compare
785084e
to
3d2cdf4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
As we discussed, it's not ideal to be introducing yet another qa
suite, ml-native-multi-node-tests
. However, as a followup we'll move all the ML native tests into this and get rid of the current ml-native-tests
suite. So relatively soon we'll be back to the current number of qa
suites.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I've not looked at the tests in detail but rather focused on rechecking all MlDataRemover implementations to see if there might be possibly other blocking calls there. I have not found any, but noticed that ExpiredForecastsRemover#findForecastsToDelete
possibly parses up to 10000 docs in one go on the network thread. This is not ideal, but can be addressed in a follow-up PR and does not need to block this blocker.
@@ -89,6 +90,8 @@ public boolean hasNext() { | |||
private SearchResponse initScroll() { | |||
LOGGER.trace("ES API CALL: search index {}", index); | |||
|
|||
Transports.assertNotTransportThread("BatchedDocumentsIterator makes blocking calls"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
preferably put this into the next()
method instead so it will also cover the other blocking calls in this class.
Could you also write this as assert Transports.assertNotTransportThread(...)
, this will save from extra CPU in non-debug mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we need this at all. The blocking call to the client executes it anyway. The issue was that there was no testing. I think this entire transport action can use a ml threadpool instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, it's just a more helpful error message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put it into an assert if you keep it. I’d remove it.
@@ -66,10 +67,12 @@ private void deleteExpiredData(ActionListener<DeleteExpiredDataAction.Response> | |||
|
|||
private void deleteExpiredData(Iterator<MlDataRemover> mlDataRemoversIterator, | |||
ActionListener<DeleteExpiredDataAction.Response> listener) { | |||
Transports.assertNotTransportThread("ML Daily Maintenance"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unnecessary imo. Can we have a comment here why we fork?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it was to fail faster during testing/development. I'll remove both assertNotTransportThread
calls and add a comment
…e correct thread pool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Does this change need to be forward-ported @davidkyle? |
@jasontedor Thanks for merging. No not in this state, I intend to make further changes to the ML integration tests and incorporate @ywelsch's suggestions. |
Thanks for clarifying @davidkyle! |
This is the implementation for master and 6.x of elastic#31691. Relates elastic#31683
In multi-node clusters
TransportDeleteExpiredDataAction
can try to execute a blocking search on the transport client thread which causes the node to stop communicating. The searches in this action should execute in the Machine Learning thread pool.Symptoms appear after the
MlDailyMaintenanceService
is triggered which corresponds to the following message in the log file:[INFO ][o.e.x.m.MlDailyMaintenanceService] triggering scheduled [ML] maintenance tasks
A work-around is to disable Machine Learning on all nodes in the cluster.
The assertions in BaseFuture should have found this in testing but the tests ran in a single node cluster. I added a new gradle project
ml-native-multi-node-tests
which execute in a 3 node cluster and movedDeleteExpiredDataIT
to it so the test now hits the failure case.Closes #31683