-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Job opening fails during .ml-state creation #36271
Comments
Pinging @elastic/ml-core |
This is an interesting problem. The In addition, when a job is opened, we validate that the Having explained the above, we can now understand why the reported problem happens. The flow of the recognizer is as follows:
The last bit is key as it means that by the time we open the 2nd, 3rd, etc. job, the previous are already running. Then it is possible that timing gives rise to the issue. One of the previous jobs starts persisting its state while another job is being created. The Note this specific scenario may only happen once: when the cluster is new. I will try to list options for solving this in a subsequent comment. |
I am listing solutions I could come up with here. I have a clear favourite after discussing this with some folk from the elasticsearch team but I'll list the others too (even the bad ones).
This is a superficial solution. It would fix the problem with our QA tests. It would also reduce the chances of the problem happening to our users. But it doesn't prevent the problem happening in some other way.
When the very first job is opened we don't do that check as the
This seems to be the most suitable solution. The index health API allows us to do this easily. Note that we cannot do that during the node selection as that should not be blocking. But we can do it in the master operation of What are your thoughts @elastic/ml-core ? |
I second the the This is essentially what we are doing anyways, except we are relying on the user to wait an amount of time before opening a second job. If the status persists to not be at least yellow (with some acceptable timeout), or if it downgrades before the persistent task can be assigned to a node, throwing an error is acceptable (as you have stated). Unanswered question:
|
The default timeout for the health status API is 30 seconds. That should be more than enough for a healthy cluster to activate the primaries of a freshly created index.
That is a good question. We could reuse the open job request's timeout. But then we need to do more work to properly account for the timeout through the action's different steps. Unfortunately, the infrastructure to do this properly isn't quite there. We give it a try, just accumulate timeouts, or just apply the default timeout separately from the action's timeout. The first one should be the correct behaviour but we need to balance out effort making this call. |
I agree, since the default timeout for waiting against the health status API is 30 seconds, that is probably OK. The only thing that SLIGHTLY concerns me is when the user sets the timeout on the API to less than 30 seconds. However, that should be rare as the default value is high (30 min) and this would nudge users to providing a higher timeout value where an additional 30 seconds of wait would not be that big of an issue. I say lets keep it simple and not worry about accumulating the timeouts just yet. If this turns out to be painful, then we can do the extra work on down the line. I think the bigger concern is that this particular bug even exists. An API not timing out exactly when you request it to seems like a smaller issue to me. |
@benwtrent I've tried in diverse test environments, but with your fix I was not able to reproduce this issue any more 🎉 thanks for fixing this! |
Found in version
Steps to reproduce
Perform the follwoing steps on a new created instance / cluster (i.e. the
.ml-state
index does not exist yet):Expected result
Actual result
.ml-state
is created. With bad luck on timing it happens that the index is not yet green when the second job should open, such that the opening fails with the messageAdditional information
nginx
recognizer module to create multiple jobs at a timeThe text was updated successfully, but these errors were encountered: